In order to not unnecessarily promote the source vector to greater than our
native vector size of 128b, I've added some cascading rules to widen based on
the number of elements.
The register class picked will be the RFP80 register class which has a f80 VT. The code in SelectionDAGBuilder that generates copies around inline assembly doesn't know how to handle an integer and floating point type of different bit widths.
The test case is derived from this https://godbolt.org/z/sEa659 which gcc accepts but clang crashes on. This patch just gives a more graceful error. I'm not sure if the single element struct case is special in gcc. Adding another field to the struct makes gcc reject it. If we want to support this correctly I think we need a change in the frontend to give us the true element type. Right now the frontend just realizes the constraint can take a memory argument so creates an integer type of the same size and bitcasts.
Differential Revision: https://reviews.llvm.org/D87485
This extends the distributing postinc code in load/store optimizer to
also handle the case where there is an existing pre/post inc instruction,
where subsequent instructions can be modified to use the adjusted
offset from the increment. This can save us having to keep the old
register live past the increment instruction.
Differential Revision: https://reviews.llvm.org/D83377
For <8 x s32> = fptrunc <8 x s64> the fewerElementsVector action tries to break
down the source vector into the final source vectors of <2 x s64> using unmerge.
This fixes a crash due to using the wrong number of elements for the breakdown
type.
Also add some legalizer tests for explicitly G_FPTRUNC which we didn't have.
Differential Revision: https://reviews.llvm.org/D87814
D75689 turns the faddp pattern into a shuffle with vector add.
Match this new pattern in target-specific DAG combine, rather than ISel,
because legalization (for v2f32) turns it into a bit of a mess.
- extended to cover f16, f32, f64 and i64
- Need to lower COPY from SGPR to VGPR to a real instruction as the
standard COPY is used where the source and destination are from the
same register bank so that we potentially coalesc them together and
save one COPY. Considering that, backend optimizations, such as CSE,
won't handle them. However, the copy from SGPR to VGPR always needs
materializing to a native instruction, it should be lowered into a
real one before other backend optimizations.
Differential Revision: https://reviews.llvm.org/D87556
The predicated MVE intrinsics are generated as, for example,
llvm.arm.mve.add.predicated(x, splat(y). p). We need to sink the splat
value back into the loop, like we do for other instructions, so we can
re-select qr variants.
Differential Revision: https://reviews.llvm.org/D87693
Instruction combining pass turns library rotl implementation to llvm.fshl.i16.
In the selection dag the intrinsic is turned to ISD::ROTL node that cannot be selected.
Need to expand it to shifts again.
Reviewed By: rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D87618
This patch adds new ISD nodes, FCVTZS_MERGE_PASSTHRU &
FCVTZU_MERGE_PASSTHRU, which are used to lower scalable vector
FP_TO_SINT/FP_TO_UINT operations and the following intrinsics:
- llvm.aarch64.sve.fcvtzu
- llvm.aarch64.sve.fcvtzs
Reviewed By: efriedma, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D87232
On Solaris/x86, several hundred 32-bit tests `FAIL`, all in the same way:
env ASAN_OPTIONS=halt_on_error=false ./halt_on_error_suppress_equal_pcs.cpp.tmp
Segmentation Fault (core dumped)
They segfault during startup:
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1 (LWP 1)]
0x080f21f0 in __sanitizer::internal_mmap(void*, unsigned long, int, int, int, unsigned long long) () at /vol/llvm/src/llvm-project/dist/compiler-rt/lib/sanitizer_common/sanitizer_solaris.cpp:65
65 int prot, int flags, int fd, OFF_T offset) {
1: x/i $pc
=> 0x80f21f0 <_ZN11__sanitizer13internal_mmapEPvmiiiy+16>: movaps 0x30(%esp),%xmm0
(gdb) p/x $esp
$3 = 0xfeffd488
The problem is that `movaps` expects 16-byte alignment, while 32-bit Solaris/x86
only guarantees 4-byte alignment following the i386 psABI.
This patch updates `X86Subtarget::initSubtargetFeatures` accordingly,
handles Solaris/x86 in the corresponding testcase, and allows for some
variation in address alignment in
`compiler-rt/test/ubsan/TestCases/TypeCheck/vptr.cpp`.
Tested on `amd64-pc-solaris2.11` and `x86_64-pc-linux-gnu`.
Differential Revision: https://reviews.llvm.org/D87615
When splitting a live interval with subranges, only insert copies for
the lanes that are live at the point of the split. This avoids some
unnecessary copies and fixes a problem where copying dead lanes was
generating MIR that failed verification. The test case for this is
test/CodeGen/AMDGPU/splitkit-copy-live-lanes.mir.
Without this fix, some earlier live range splitting would create %430:
%430 [256r,848r:0)[848r,2584r:1) 0@256r 1@848r L0000000000000003 [848r,2584r:0) 0@848r L0000000000000030 [256r,2584r:0) 0@256r weight:1.480938e-03
...
256B undef %430.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec
...
848B %430.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec
...
2584B %431:vreg_128 = COPY %430:vreg_128
Then RAGreedy::tryLocalSplit would split %430 into %432 and %433 just
before 848B giving:
%432 [256r,844r:0) 0@256r L0000000000000030 [256r,844r:0) 0@256r weight:3.066802e-03
%433 [844r,848r:0)[848r,2584r:1) 0@844r 1@848r L0000000000000030 [844r,2584r:0) 0@844r L0000000000000003 [844r,844d:0)[848r,2584r:1) 0@844r 1@848r weight:2.831776e-03
...
256B undef %432.sub2:vreg_128 = V_LSHRREV_B32_e32 16, %20.sub1:vreg_128, implicit $exec
...
844B undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128 {
internal %433.sub2:vreg_128 = COPY %432.sub2:vreg_128
848B }
%433.sub0:vreg_128 = V_AND_B32_e32 %92:sreg_32, %20.sub1:vreg_128, implicit $exec
...
2584B %431:vreg_128 = COPY %433:vreg_128
Note that the copy from %432 to %433 at 844B is a curious
bundle-without-a-BUNDLE-instruction that SplitKit creates deliberately,
and it includes a copy of .sub0 which is not live at this point, and
that causes it to fail verification:
*** Bad machine code: No live subrange at use ***
- function: zextload_global_v64i16_to_v64i64
- basic block: %bb.0 (0x7faed48) [0B;2848B)
- instruction: 844B undef %433.sub0:vreg_128 = COPY %432.sub0:vreg_128
- operand 1: %432.sub0:vreg_128
- interval: %432 [256r,844r:0) 0@256r L0000000000000030 [256r,844r:0) 0@256r weight:3.066802e-03
- at: 844B
Using real bundles with a BUNDLE instruction might also fix this
problem, but the current fix is less invasive and also avoids some
unnecessary copies.
https://bugs.llvm.org/show_bug.cgi?id=47492
Differential Revision: https://reviews.llvm.org/D87757
2508ef01 fixed a bug about constant removal in negation. But after
sanitizing check I found there's still some issue about it so it's
reverted.
Temporary nodes will be removed if useless in negation. Before the
removal, they'd be checked if any other nodes used it. So the removal
was moved after getNode. However in rare cases the node to be removed is
the same as result of getNode. We missed that and will be fixed by this
patch.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D87614
llc would crash for (store (fptosi-f128-i32)) when -mcpu=pwr8, we should
not generate FP_TO_(S|U)INT_IN_VSR for f128 types at this time. This
patch fixes it.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D86686
Summary:
When running a large test in LLVM_ENABLE_EXPENSIVE_CHECKS=ON mode,
buildbot could hit timeout.
Disable the test when this mode is on.
Also disable it for debug so that the test won't hang for too long.
Reviewed By: hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D87794
If we have an all ones mask, we can just a regular masked load. InstCombine already gets this in IR. But the all ones mask can appear after type legalization.
Only avx512 test cases are affected because X86 backend already looks for element 0 and the last element being 1. It replaces this with an unmasked load and blend. The all ones mask is a special case of that where the blend will be removed. That transform is only enabled on avx2 targets. I believe that's because a non-zero passthru on avx2 already requires a separate blend so its more profitable to handle mixed constant masks.
This patch adds a dedicated all ones handling to the target independent DAG combiner. I've skipped extending, expanding, and index loads for now. X86 doesn't use index so I don't know much about it. Extending made me nervous because I wasn't sure I could trust the memory VT had the right element count due to some weirdness in vector splitting. For expanding I wasn't sure if we needed different undef handling.
Differential Revision: https://reviews.llvm.org/D87788
We should be able to turn this into a unmasked load. X86 has an
optimization to detect that the first and last element aren't masked
and then turn the whole thing into an unmasked load and a blend.
That transform is disabled on avx512 though.
But if we know the blend isn't needed, then the unmasked load by
itself should always be profitable.
https://reviews.llvm.org/D86393
Patch adds five new `GICombinerRules`, one for each of the following unary
FP instrs: `G_FNEG`, `G_FABS`, `G_FPTRUNC`, `G_FSQRT`, and `G_FLOG2`. The
combine rules perform the FP operation on the constant operand and replace
the original instr with the result. Patch additionally adds new combiner
tests for the AArch64 target to test these new combiner rules.
This currently has no impact on code, but prevents sizeable code size
regressions after D52010. This prevents spilling and reloading all
values inside blocks that loop back. Add a baseline test which would
regress without this patch.
eliminateFrameIndex won't fix up the offset register when the direct
frame index reference is moved to a separate move instruction. Switch
the offset to a base 0 (which it probably should be to begin with).
This patch prevents the `llvm.masked.gather` and `llvm.masked.scatter` intrinsics to be scalarized when invoked on scalable vectors.
The change in `Function.cpp` is needed to prevent the warning that is raised when `getNumElements` is used in place of `getElementCount` on `VectorType` instances. The tests guards for regressions on this change.
The tests makes sure that calls to `llvm.masked.[gather|scatter]` are still scalarized when:
# the intrinsics are operating on fixed size vectors, and
# the compiler is not targeting fixed length SVE code generation.
Reviewed By: efriedma, sdesmalen
Differential Revision: https://reviews.llvm.org/D86249
WeakRefDirective should specify a directive to declare "a global as being a weak undefined symbol".
The directive used by AMDGPU was incorrect - ".weakref" was intended for other purposes.
The correct directive is ".weak" and it is already defined as default for ELF.
So the redefinition was removed.
Reviewers: arsenm, rampitec
Differential Revision: https://reviews.llvm.org/D87762
Fix lowering and instruction selection for v3x16 types
and enable InstCombine to emit them.
This patch only implements it for the selection dag.
GlobalISel tests in GlobalISel/llvm.amdgcn.image.load.1d.d16.ll and
GlobalISel/llvm.amdgcn.image.store.2d.d16.ll still don't work.
Differential Revision: https://reviews.llvm.org/D84420
Pre-gfx10 all MODE-setting instructions were S_SETREG_B32 which is
marked as having unmodeled side effects, which makes the machine
scheduler treat it as a barrier. Now that we have proper implicit $mode
operands we can use a no-side-effects S_SETREG_B32_mode pseudo instead
for setregs that only touch the FP MODE bits, to give the scheduler more
freedom.
Differential Revision: https://reviews.llvm.org/D87446
We've fixed the case where this could return an instruction after the
given instruction, but also means that we can falsely return a
'unique' def when they could be one coming from the backedge of a
loop.
Differential Revision: https://reviews.llvm.org/D87751
Clear the CurrentPredicate when we find an instruction which would
completely overwrite the VPR. This fix essentially means we're back
to not really being able to handle VPT instructions when tail
predicating.
Differential Revision: https://reviews.llvm.org/D87610