On RISC-V, icmp is not sunk (as the following snippet shows) which
generates the following suboptimal branch pattern:
```
core_list_find:
lh a2, 2(a1)
seqz a3, a0 <<
bltz a2, .LBB0_5
bnez a3, .LBB0_9 << should sink the seqz
[...]
j .LBB0_9
.LBB0_5:
bnez a3, .LBB0_9 << should sink the seqz
lh a1, 0(a1)
[...]
```
due to an icmp not being sunk.
The blocks after `codegenprepare` look as follows:
```
define dso_local %struct.list_head_s* @core_list_find(%struct.list_head_s* readonly %list, %struct.list_data_s* nocapture readonly %info) local_unnamed_addr #0 {
entry:
%idx = getelementptr inbounds %struct.list_data_s, %struct.list_data_s* %info, i64 0, i32 1
%0 = load i16, i16* %idx, align 2, !tbaa !4
%cmp = icmp sgt i16 %0, -1
%tobool.not37 = icmp eq %struct.list_head_s* %list, null
br i1 %cmp, label %while.cond.preheader, label %while.cond9.preheader
while.cond9.preheader: ; preds = %entry
br i1 %tobool.not37, label %return, label %land.rhs11.lr.ph
```
where the `%tobool.not37` is the result of the icmp that is not sunk.
Note that it is computed in the basic-block up until what becomes the
`bltz` instruction and the `bnez` is a basic-block of its own.
Compare this to what happens on AArch64 (where the icmp is correctly sunk):
```
define dso_local %struct.list_head_s* @core_list_find(%struct.list_head_s* readonly %list, %struct.list_data_s* nocapture readonly %info) local_unnamed_addr #0 {
entry:
%idx = getelementptr inbounds %struct.list_data_s, %struct.list_data_s* %info, i64 0, i32 1
%0 = load i16, i16* %idx, align 2, !tbaa !6
%cmp = icmp sgt i16 %0, -1
br i1 %cmp, label %while.cond.preheader, label %while.cond9.preheader
while.cond9.preheader: ; preds = %entry
%1 = icmp eq %struct.list_head_s* %list, null
br i1 %1, label %return, label %land.rhs11.lr.ph
```
This is caused by sinkCmpExpression() being skipped, if multiple
condition registers are supported.
Given that the check for multiple condition registers affect only
sinkCmpExpression() and shouldNormalizeToSelectSequence(), this change
adjusts the RISC-V target as follows:
* we no longer signal multiple condition registers (thus changing
the behaviour of sinkCmpExpression() back to sinking the icmp)
* we override shouldNormalizeToSelectSequence() to let always select
the preferred normalisation strategy for our backend
With both changes, the test results remain unchanged. Note that without
the target-specific override to shouldNormalizeToSelectSequence(), there
is worse code (more branches) generated for select-and.ll and select-or.ll.
The original test case changes as expected:
```
core_list_find:
lh a2, 2(a1)
bltz a2, .LBB0_5
beqz a0, .LBB0_9 <<
[...]
j .LBB0_9
.LBB0_5:
beqz a0, .LBB0_9 <<
lh a1, 0(a1)
[...]
```
Differential Revision: https://reviews.llvm.org/D98932
Patch to fix some of the regressions in D77804.
By folding to rotate/funnel-shift by constant amounts for illegal types, we prevent SimplifyDemandedBits from destroying the patterns prematurely, allowing us to use the rotate/funnel-shift legalization that was added in D112443.
Differential Revision: https://reviews.llvm.org/D113192
These tests were introduced in D109809 which I pushed on behalf of
@tangxingxin1008. I must have not understood the correct arcanist
workflow for this and as such may have locally tested a stale build.
This patch fixes the issue by re-running update_llc_test_checks.py on
all four tests.
Fixed the vector type issue that where we used getVectorNumElements()
should be replaced by getVectorElementCount() when lowering these
intrinsics.
This is similar to D94149
Signed-off-by: Eric Tang <tangxingxin1008@gmail.com>
Reviewed By: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D109809
If we have a large enough floating point type that can exactly
represent the integer value, we can convert the value to FP and
use the exponent to calculate the leading/trailing zeros.
The exponent will contain log2 of the value plus the exponent bias.
We can then remove the bias and convert from log2 to leading/trailing
zeros.
This doesn't work for zero since the exponent of zero is zero so we
can only do this for CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF. If we need
a value for zero we can use a vmseq and a vmerge to handle it.
We need to be careful to make sure the floating point type is legal.
If it isn't we'll continue using the integer expansion. We could split the vector
and concatenate the results but that needs some additional work and evaluation.
Differential Revision: https://reviews.llvm.org/D111904
Add test coverage for a problem that was fixed by D113493: when updating
live intervals, fix handling of live ranges that were previously tied to
an early-clobber def but no longer are.
This change make WidenVecRes_SELECT work for scalable vectors.
This patch is split from [D110319](https://reviews.llvm.org/D110319)
Signed-off-by: Eric Tang <tangxingxin1008@gmail.com>
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D110388
These test files are copied directly from AArch64. Some of the cases
may benefit from ANDN with the Zbb extension. Somes cases already
improve use ANDN.
selectcc-to-shiftand.ll also contains tests that test select->and
conversion even when a ANDN isn't needed. I think this improves our
coverage of these optimizations.
Differential Revision: https://reviews.llvm.org/D113935
This handles the case where the mask register instruction input
comes from a Phi of vsetvlis. If the VLMAX is the same as the VLMAX
required by the mask register instruction, we can avoid a vsetvli.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113204
The division by constant optimization often produces constants that
are uimm32, but not simm32. These constants require 3 or 4 instructions
to materialize without Zba.
Since these instructions are often used by a multiply with a LHS
that needs to be zero extended with an AND, we can switch the MUL
to a MULHU by shifting both inputs left by 32. Once we shift the
constant left, the upper 32 bits no longer need to be 0 so constant
materialization is free to use LUI+ADDIW. This reduces the constant
materialization from 4 instructions to 3 in some cases while also
reducing the zero extend of the LHS from 2 shifts to 1.
Differential Revision: https://reviews.llvm.org/D113805
Register uses that are MRI->isConstantPhysReg() should not inhibit
sinking transformation.
Reviewed By: StephenTozer
Differential Revision: https://reviews.llvm.org/D111531
This improves our coverage of soft float libcalls lowering.
Remove most of the test cases from rv64i-single-softfloat.ll. They
were duplicated in the test files that now test softflow. Only
a couple test cases for constrained FP remain. Those should be
removed when we start supporting constrained FP.
This is follow up from D113528.
Many of these had an extra 'f' at the beginning of their name that
caused them to not be treated as intrinsics.
I'm not sure what fpround was supposed to be so I deleted it.
frem was changed from an intrinsic to an instruction.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D113528
Previously these would crash. I don't think these can be generated
directly from C. Not sure if any optimizations can introduce them.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D113527
In TwoAddressInstructionPass::processTiedPairs when updating live
intervals after moving the last use of RegB back to the newly inserted
copy, update any affected subranges as well as the main range.
Differential Revision: https://reviews.llvm.org/D110411
The introduction of this legalization, D111248, forgot to replace the
old chain with the new. This could manifest itself in the old
(illegally-typed) value remaining in the DAG, though the simple test
cases didn't catch this.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D113561
Not all scalar element types are allowed in vectors so we may not
be able to bitcast to a 1 element vector to use insert/extract.
This will become a bigger issue when the Zve extensions are commited.
For now, I'm using the ELEN limit to limit the element types.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113219
This patch fixes a compiler crash when widening scalable-vector loads
and stores which end up breaking down to element-wise store operations.
It does so by providing a way for targets with support for
vector-predicated loads and stores to use those instead. By widening the
operation but maintaining the original effective operation length via
the EVL, only the intended vector elements are loaded or stored.
This method should in theory be possible and even preferred for
fixed-length vector types, but all fixed-length types can be broken down
into their elements, and regardless I have observed regressions in the
generated code when doing so. I believe this is simply due to
VP_LOAD/VP_STORE not being up to par with LOAD/STORE in terms of
optimization. It does improve performance on smaller self-contained
examples, however, so the potential is there.
While the only target that benefits from this is RISCV, the legalization
is generic and so was placed centrally.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D111248
This patch merges FoldConstantVectorArithmetic back into FoldConstantArithmetic.
Like FoldConstantVectorArithmetic we now handle vector ops with any operand count, but we currently still only handle binops for scalar types - this can be improved in future patches - in particular some common unary/trinary ops still have poor constant folding.
There's one change in functionality causing test changes - FoldConstantVectorArithmetic bails early if the build/splat vector isn't all constant (with some undefs) elements, but FoldConstantArithmetic doesn't - it instead attempts to fold the scalar nodes and bails if they fail to regenerate a constant/undef result, allowing some additional identity/undef patterns to be handled.
Differential Revision: https://reviews.llvm.org/D113300
This improves our type coverage. We were only testing integer
insert and extract before due to the FP types not being enabled for
arguments and returns.
Differential Revision: https://reviews.llvm.org/D113217
Similar to D110206, this patch optimizes unmasked vp.load intrinsics to
avoid the need of a vmset instruction to set the mask. It does so by
selecting a riscv_vle intrinsic rather than a riscv_vle_mask intrinsic.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D113022
Although this isn't required, it better matches the suggested syntax as
per the documentation work ongoing in D112930.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D112939
Fold (srl (mul (zext i32:$a to i64), i64:c), 32) -> (mulhu $a, $b),
if c can truncate to i32 without loss.
Reviewed By: frasercrmck, craig.topper, RKSimon
Differential Revision: https://reviews.llvm.org/D108129
If the type of a funnel shift needs to be expanded, expand it to two funnel shifts instead of regular shifts. For constant shifts, this doesn't make much difference, but for variable shifts it allows a more optimal lowering.
Also use the optimized funnel shift lowering for rotates.
Alive2: https://alive2.llvm.org/ce/z/TvHDB- / https://alive2.llvm.org/ce/z/yzPept
(Branched from D108058 as getting this completed should help unlock some other WIP patches).
Original Patch: @efriedma (Eli Friedman)
Differential Revision: https://reviews.llvm.org/D112443
When D105690 changed the mnemonic from vf(w)redsum to vf(w)redusum,
several tests were deleted instead of being renamed.
This commit also consistently renames the other tests that weren't
deleted.
If the VL operand of a mask register instruction comes from an
explicit vsetvli with a different VTYPE, we can still avoid needing
a vsetvli as long as the SEW/LMUL ratio is the same and policy bits
match.
Differential Revision: https://reviews.llvm.org/D112762
If the VL argument for a mask instruction comes from a vsetvli with
an SEW!=8, we will insert an extra vsetvli for the mask instruction
even if the SEW/LMUL ratio is the same. This requires at least one
instruction before the mask instruction that needs the SEW of the
explicit vsetvli. Otherwise, we'll just rewrite the explicit vsetvli.
Sync the order of Zvlsseg registers with vector registers to avoid
unnecessary register copies between vector instructions and zvlsseg
instructions.
Differential Revision: https://reviews.llvm.org/D110250
If we know the source operand of COPY is defined by a vector instruction
with tail agnostic and the same LMUL and there is no vsetvli between
COPY and the define instruction to change the vl and vtype, we could use
vmv.v.v or vmv.v.i to copy vector registers to get better performance than
the whole vector register move instructions.
If the source of COPY is from vmv.v.i, we could use vmv.v.i for the
COPY.
This patch only considers all these instructions within one basic block.
Case 1:
```
bb.0:
...
VSETVLI # The first VSETVLI before COPY and VOP.
... # Use this VSETVLI to check LMUL and tail agnostic.
...
vy = VOP va, vb # Define vy.
... # There is no vsetvli between VOP and COPY.
vx = COPY vy
```
Case 2:
```
bb.0:
...
VSETVLI # The first VSETVLI before VOP.
... # Use this VSETVLI to check LMUL and tail agnostic.
...
vy = VOP va, vb # Define vy.
... # There is no vsetvli to change vl between VOP and COPY.
...
VSETVLI # The first VSETVLI before COPY.
... # This VSETVLI does not change vl and vtype.
...
vx = COPY vy
```
Co-Authored-by: Zakk Chen <zakk.chen@sifive.com>
Co-Authored-by: Kito Cheng <kito.cheng@sifive.com>
Differential Revision: https://reviews.llvm.org/D103510
Simplify "LUI+SLLI+ADDI+SLLI" and "LUI+ADDIW+SLLI+ADDI+SLLI" to
"LUI+ADDIW+SLLIUW" to reduce total instruction amount.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D111933
All but 2 of the vector builtins are only used by clang_builtin_alias.
When using clang_builtin_alias, the type string of the builtin is never
checked. Only the types in the function definition used for the alias
are checked.
This patch takes advantage of this to share a single builtin for
many different types. We already used type overloads on the IR intrinsic
so the codegen for the builtins that are being merge were already
the same. This extends the type overloading to the builtins.
I had to make a few tweaks to make this work.
-Floating point vector-vector vmerge now uses the vmerge intrinsic
instead of the vfmerge intrinsic. New isel patterns and tests are
added to support this.
-The SemaChecking for the immediate of vset_v/vget_v has been removed.
Determining the valid range is harder now. I've added masking to
ManualCodegen to ensure valid IR for invalid input.
This reduces the number of builtins from ~25000 to ~1100.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D112102