This patch adds an optimization to splat-like operations where the
splatted value is extracted from a identically-sized vector. On RVV we
can splat that via vrgather.vx/vrgather.vi without dropping to scalar
beforehand.
We do have a similar VECTOR_SHUFFLE-specific optimization but that only
works on fixed-length vector types and for those with a constant splat
lane. This patch extends this optimization to make it work on
scalable-vector types and on unknown extract indices.
It is performed during fixed-vector BUILD_VECTOR lowering and during a
new DAGCombine on SPLAT_VECTOR for scalable vectors.
Reviewed By: craig.topper, khchen
Differential Revision: https://reviews.llvm.org/D118456
rv64izbb has a RORW/ROLW instructions that operate on the lower
32-bits of a 64-bit value and sign extend bit 31 of the result.
DAGCombiner won't match rotate idioms because the i32 type isn't Legal
on riscv64.
This patch teaches DAGCombiner to allow it if the type is going to
be promoted and the target has Custom type legalization for ISD::ROTL
or ISD::ROTR. I've restricted this to scalar types. It doesn't appear
any in tree targets other than riscv64 have custom type legalization
for rotates.
If this patch isn't acceptable, I guess I can match SRLW, SLLW, and OR
after type legalization, but I'd like to avoid that if possible.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D119062
We were only testing rotate idioms on rv32i. DAGCombiner won't
form ISD::ROTL/ROTR unless those operations are Legal or Custom.
They aren't for rv32 so we were only testing shift lowering.
This commit adds i64 idioms and the idioms that mask the shift
amount to avoid UB for a rotate of 0. I've added riscv64 and Zbb
RUN lines to show that we do match rotate for XLen types when
available. We currently miss i32 on rv64izbb.
Add a new ISD opcode to represent the sign extending behavior of
vmv.x.h. Keep the previous anyext opcode to allow the existing
(fmv_x_anyexth (fmv_h_x X)) combine to keep working without needing
to generate a sign extend.
For fmv.x.w we are able to match the sext_inreg in an isel pattern,
but a 16-bit sext_inreg is lowered to a shift pair before isel. This
seemed like a larger match than we should do in isel.
Differential Revision: https://reviews.llvm.org/D118974
This code tries to replace the pattern with a pair of shifts, but
we were excluding if the And could be a zext.h or zext.w. The SLLI/SRL
pair is more compressible and doesn't come with much down side.
We do regress one test case in rv64i-exhaustive-w-insts.ll but we
can probably add a narrower exclusion for that case.
Using AArch64's original implementation for reference, this patch
implements a pass to remove unneeded copies of X0. This pass runs
after register allocation and looks to see if a register is implied
to be 0 by a branch in the predecessor basic block.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118160
Add the vslidedown and interleave patterns that I recently implemented.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118952
This avoids a crash for scalable vectors and or scalarization for
fixed vectors.
The algorithm is different enough that I don't think it makes sense
to merge with ceil/floor/trunc. Algorithm is adapted from gcc's X86
SSE2 output.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117247
Add support for the 'pause' hint instruction as an alias for
'fence w, 0'. To do this allow the 'fence' operands pred and succ
to be set to 0 (the empty set). This will also allow future hints
to be encoded as 'fence 0, <x>' and 'fence <x>, 0'.
This patch revised from @mundaym's D93019.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D117789
This adds or reuses ISD opcodes for vadd.wv, vaddu.wv, vadd.vv, vaddu.vv
and a similar set for sub.
I've included support for narrowing scalar splats that have known
sign/zero bits similar to what was done for MUL_VL.
The conversion to vwadd.vv proceeds in two phases. First we'll form
a vwadd.wv by narrowing one of the operands. Then we'll visit the
vwadd.wv to try to narrow the other operand. This turned out to be
simpler than catching all the cases in one step. The forming of of
vwadd.wv can happen for either operand for add, but only the right
hand side for sub since sub isn't commutable.
An interesting quirk is that ADD_VL and VZEXT_VL/VSEXT_VL are formed
during vector op legalization, but VMV_V_X_VL isn't usually formed
until op legalization when BUILD_VECTORS are handled. This leads to
VWADD_W_VL forming in one DAG combine round, and then a later DAG combine
round sees the VMV_V_X_VL and needs to commute the operands to get the
splat in position. This alone necessitated a VWADD_W_VL combine function
which made forming vwadd.vv in two stages an easy choice.
I've left out trying hard to form vwadd.wx instructions for now. It would
only save an extend in the scalar domain which isn't as interesting.
Might need to review the test coverage a bit. Most of the vwadd.wv
instructions are coming from vXi64 tests on rv64. The tests were
copy pasted from the existing multiply tests.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117954
Previously, all children would be checked to see if any were an
explicit Register. If anywhere no commutable patterns would be
generated. This patch loosens the restriction to only check the
children that are being commuted.
Digging back through history, this code predates the existence of
commutable intrinsics and commutable SDNodes with more than 2
operands. At that time the loop would count the number of children that
weren't registers and if that was equal to 2 it would allow commuting.
I don't think this loop was re-considered when commutable
intrinsics were added or when we allowed SDNodes with more than 2
operands.
This important for RISCV were our isel patterns have a V0 mask
operand after the commutable operands on some RISCVISD opcodes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D117955
The first phase of the analysis can avoid a vsetvli if an earlier
instruction in the block used an SEW and LMUL that when combined with
the EEW of the load/store would produce the desired EMUL. If we
avoided a vsetvli this will affect the global analysis we do in the
second phase.
The third phase where we really insert the vsetvlis needs to agree
with the first phase. If it doesn't we can insert vsetvlis that
invalidate the global analysis.
In the test case there is a VSETVLI in the preheader that sets
SEW=64 and LMUL=1. Inside the loop there is a VADD with SEW=64 and LMUL=1.
This VADD is followed by a store that wants wants SEW=32 LMUL=1/2.
Because it has EEW=32 as part of the opcode the SEW=64 LMUL=1 from the
VADD can be become EMUL=1 for the store. So the first phase determines no
vsetvli is needed.
The third phase manages CurInfo differently than BBInfo.Change from the
first phase. CurInfo is only updated when we see a vsetvli or insert
a vsetvli. This was done to allow predecessor block information from
the global analysis to be applied to multiple instructions. Since the
loop body has no vsetvli we won't update CurInfo for either the VADD
or the VSE. This prevented us from checking the store vsetvli elision
for the VSE resulting in a vsetvli SEW=32 LMUL=1/2 being emitted which
invalidated the global analysis.
To mitigate this, I've added a BBLocalInfo variable that more closely
matches the first phase propagation. This gets updated based on the
VADD and prevents emitting a vsetvli for the store like we did in the
first phase.
I wonder if we should do an earlier phase to handle the load/store case
by adding more pseudo opcodes and changing the SEW/LMUL for those
instructions before the insertion analysis. That might be more robust
than trying to guarantee two phases make the same decision.
Fixes the test from D118629.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118667
Inspired by a recent Discourse post on undef vs. poison usage, this
series of patches should reduce the number of undefs in LLVM tests by
around 10%.
Only undef vector operands to insertelement/shufflevector have been
handled, which are by far the most common we've got.
The switchover is split into 3 fairly arbitrary clusters to make it
slightly more manageable: vector predication, fixed-length vectors,
scalable vectors.
This test shows a loop, whose preheader uses a SEW=64, LMUL=1 vector
operation. The loop body starts off with another SEW=64, LMUL=1 VADD
vector operation, before switching to a SEW=32, LMUL=1/2 vector store
instruction.
We can see that the VSETVLI insertion pass omits a VSETVLI before the
VADD (thinking it inherits its configuration from the preheader) but
does place a SEW=32, LMUL=1/2 VSETVLI before the store. This results in
a miscompilation as when the loop comes back around, the VADD is
incorrectly configured with SEW=32, LMUL=1/2.
It appears to be a bad load/store optimization, as replacing the vector
store with an SEW=32, LMUL=1/2 VADD does correctly insert a VSETVLI. The
issue is therefore possibly arising from canSkipVSETVLIForLoadStore.
Differential Revision: https://reviews.llvm.org/D118629
I have updated TargetLowering::isConstTrueVal to also consider
SPLAT_VECTOR nodes with constant integer operands. This allows the
optimisation to also work for targets that support scalable vectors.
Differential Revision: https://reviews.llvm.org/D117210
The spec doesn't seem to be written as if Zfh implies Zfhmin. They
seem to be separate extensions.
This patch moves the instructions from Zfhmin to be enabled with
either the Zfh or Zfhmin extensions.
Reviewed By: achieveartificialintelligence
Differential Revision: https://reviews.llvm.org/D118581
We can use the RISCVISD::GREV encoding that swaps the bits in
each byte. This allows it to use the existing computeKnownBits
support for RISCVISD::GREV.
Now the backend promotes mask vector to an i8 vector and extract element from that. We could bitcast to a widen element vector, and extract from it to GPR, then use I instruction to extract the certain bit.
Differential Revision: https://reviews.llvm.org/D117389
We were creating a truncate with the default for the type, but for
VP intrinsics we have a VL that we should use.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118406
`__gnu_h2f_ieee` and `__gnu_f2h_ieee` are introduce by ARM and set that as
default name for fp16 and fp32 conversion in LLVM.
However RISC-V GCC using default naming scheme for that, which is
`__extendhfsf2` and `__truncsfhf2` for that, that cause runtime ABI
incompatible issue.
Although we didn't have formal runtime ABI spec to specify those naming
convention yet, but I think it would be great to fix the incompatible
issue first.
And I've plan to create a runtime ABI spec undere psABI spec this year.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118207
These splats -- whether BUILD_VECTOR or SPLAT_VECTOR -- are formed by
first extracting a value from a vector and splatting it to all elements
of the destination vector. These could be performed more optimally,
avoiding the drop to scalar, using RVV's vrgather, for example.
The LMULMAX check names didn't match the options we were passing to llc
(they were swapped around) and we were silently missing coverage for one
test which differs between RV32 and RV64.
According to riscv-v-spec-1.0, widening signed(vs2)-unsigned integer multiply
vwmulsu.vv vd, vs2, vs1, vm # vector-vector
vwmulsu.vx vd, vs2, rs1, vm # vector-scalar
It is worth noting that signed op is only for vs2.
For vwmulsu.vv, we can swap two ops, and don't care which is sign extension,
but for vwmulsu.vx signExt can not be a vector extended from scalar (rs1).
I specifically added two functions ending with _swap in the test case.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118215
This patch adds support for expanding VP_MERGE through a sequence of
vector operations producing a full-length mask setting up the elements
past EVL/pivot to be false, combining this with the original mask, and
culminating in a full-length vector select.
This expansion should work for any data type, though the only use for
RVV is for boolean vectors, which themselves rely on an expansion for
the VSELECT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118058
This matches what the spec uses for the vncvt.x.x.w assembly
pseudoinstruction.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D118295
This is the unsigned variant of D111976, where we convert a clamped
fptoui to a fptoui.sat. Because we are unsigned, the condition this time
is only UMIN of UINT_MAX. Similarly to D111976 it handles ISD::UMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D114964
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Co-Authored-by: Hsiangkai Wang <Hsiangkai@gmail.com>
Reviewers: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D117647
This patch adds widening support for ISD::VP_MERGE, which widens
identically to VP_SELECT and similarly to other select-like nodes.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118030
This patch adds splitting support for ISD::VP_MERGE, which splits
identically to VP_SELECT and similarly to other select-like nodes.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118032
Split these nodes in a similar way as their masked versions.
Reviewed By: frasercrmck, craig.topper
Differential Revision: https://reviews.llvm.org/D117760
If the bitreverse gets expanded, it will introduce a new bswap. By
putting a bswap before the bitreverse, we can ensure it gets cancelled
out when this happens.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118012
This can show up during when bitreverse is expanded to bswap and
swap of bits within a byte. If the input is already a bswap, we
should cancel them out before we further transform them in a way
that makes it harder to see the redundancy.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D118007
Add might be faster than shift. We can't do this earlier without
using a Freeze instruction.
This is the intrinsic version of D106689.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118013
This patch introduces new intrinsics that enable the use of vsetvli in
contexts where only the returned vector length is of interest. The
pre-existing intrinsics are marked with side-effects, which prevents
even trivial optimizations on/across them.
These intrinsics are intended to be used in situations where the vector
length is fed in turn to RVV intrinsics or to vector-predication
intrinsics during loop vectorization, for example. Those codegen paths
ensure that instructions are generated with their own implicit vsetvli,
so the vector length and vtype can be relied upon to be correct.
No corresponding C builtins are planned at this stage, though that is a
possibility for the future if the need arises.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117910
This patch adds support for zbkx extension from K extension(v1.0.0) in MC layer.
Instructions with same functionality and same encoding is defined in the bitmanip extension.
It defines {Xperm8, Xperm4} as instruction aliases for xperm.* in Zbp extension. When Zbkx is enabled while Zbp is not, xperm.h will not be available. When Zbkx and Zbp are both enabled, the instructions will be decoded in Zbp format.
[[ https://reviews.llvm.org/D94999 | D94999 ]] this is the patch that introduces xperm.* instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117889
This patch adds lowering of the llvm.vp.merge.* intrinsic
(ISD::VP_MERGE) to RVV vmerge/vfmerge instructions. It introduces a
special pseudo form of vmerge which allows a tied merge operand,
allowing us to specify the tail elements as being equal to the "on
false" operand, using a tied-def constraint and a "tail undisturbed"
policy.
While this strategy allows us to often lower the intrinsic to just one
instruction, it may be less efficient in fixed-vector types as the
number of tail elements may extend far beyond the length of the fixed
vector. Another strategy could be to use a vmerge/vfmerge instruction
with an AVL equal to the length of the vector type, and manipulate the
condition operand such that mask elements greater than the operation's
EVL are false.
I've also observed inefficient codegen in which our 'VF' patterns don't
match raw floating-point SPLAT_VECTORs, which occur in scalable-vector
code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117561
This patch follows up on D117697 to help the simple binary operations
behave similarly in the presence of masks.
It also enables CGP sinking support for vp.fdiv and vp.fsub intrinsics,
now that VFRDIV and VFRSUB are consistently matched with a LHS splat for
masked and unmasked variants.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117783
According to the spec, there are some difference between V and Zve64d. For example, the vmulh integer multiply variants that return the high word of the product (vmulh.vv, vmulh.vx, vmulhu.vv, vmulhu.vx, vmulhsu.vv, vmulhsu.vx) are not included for EEW=64 in Zve64*, but V extension does support these instructions. So we should decouple Zve* extensions and the V extension.
Differential Revision: https://reviews.llvm.org/D117854
This commit is currently implementing supports for scalar cryptography extension for LLVM according to version v1.0.0 of [K Ext specification](https://github.com/riscv/riscv-crypto/releases)(scala crypto has been ratified already). Currently, we are implementing the MC (Machine Code) layer of his extension and the majority of work is done under `llvm/lib/Target/RISCV` directory. There are also some test files in `llvm/test/MC/RISCV` directory.
Remove the subfeature of Zbk* which conflict with b extensions to reduce the size of the patch.
(Zbk* will be resubmit after this patch has been merged)
**Co-author:**@ksyx & @VincentWu & @lihongliang & @achieveartificialintelligence
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D98136
We don't optimize this as well as we could. Bitreverse is always
expanded to bswap and a shift/and/or sequence to swap bits within a
byte. The newly created bswap will either becomes a shift/and/or
sequence or rev8 instruction. We don't always realize the bswap is
redundant with another bswap before or after the bitreverse.
Found while thinking about the brev8 instruction from the
Cryptography extension. It's equivalent to bswap(bitreverse(x)) or
bitreverse(bswap(x)).
Rename to include bitreverse. Add additional tests and Zbb command lines.
There's some overlapping tests with rv32zbb.ll and rv64zbb.ll. Maybe
I'll clean that up in a future patch.
Instead of having a test for i32 XLen and i64 XLen, use sed to
replace iXLen with i32/i64 before running llc.
This change covers all of the floating point tests.
Instead of having a test for i32 XLen and i64 XLen, use sed to
replace iXLen with i32/i64 before running llc.
This change updates tests for intrinsics that operate exclusively
on mask values. It removes over 4000 lines worth of test content.
More merging will come in future changes.
Differential Revision: https://reviews.llvm.org/D117968
This patch brings better splat-matching to our VP support, by sinking
splat operands of VP intrinsics back into the same block as the VP
operation. The list of VP intrinsics we are interested in matches that
of the regular instructions.
Some optimization is still lacking. For instance, our VL nodes aren't
recognized as commutative, so splats must be on the RHS. Because of
this, we limit our sinking of splats to just the RHS operand for now.
Improvement in this regard can come in another patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117703
After D86836, we can define multiple cost values for
different cost models. So here we set CostPerUse to
1 iff RVC is enabled to avoid potential impact on RA.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117741
Zbkb supports some encodings of the general grevi, shfli, and
unshfli instructions legal, so we added separate instructions for
those encodings to improve the diagnostics for assembler and
disassembler. To be consistent we should always use these separate
instructions whenever those specific encodings of grevi/shfli/unshfli
occur. So this patch adds specific isel patterns to override the generic
isel patterns for these cases. Similar was done for rev8 and zext.h
for Zbb previously.
This commit add instructions supports of `zbkb` which defined in scalar cryptography extension version v1.0.0 (has been ratified already).
Most of the zbkb directives reuse parts of the zbp and zbb directives, so this patch just modified some of the inst aliases and predicates.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117640
Originally, hasRVVFrameObject() will scan all the stack objects to check
whether if there is any scalable vector object on the stack or not.
However, it causes errors in the register allocator. In issue 53016, it
returns false before RA because there is no RVV stack objects. After RA,
it returns true because there are spilling slots for RVV values during RA.
The compiler will not reserve BP during register allocation and generate BP
access in the PEI pass due to the inconsistent behavior of the function.
The function is changed to use hasStdExtV() as the return value. It is
not precise, but it can make the register allocation correct.
Refer to https://github.com/llvm/llvm-project/issues/53016.
Differential Revision: https://reviews.llvm.org/D117663
This is needed to properly limit fractional LMULs for Zve32.
Add new RUN Zve32 RUN lines to the existing tests for the
-riscv-v-fixed-length-vector-elen-max command line option.
RISCV only has a unary shuffle that requires places indices in a
register. For interleaving two vectors this means we need at least
two vrgathers and a vmerge to do a shuffle of two vectors.
This patch teaches shuffle lowering to use a widening addu followed
by a widening vmaccu to implement the interleave. First we extract
the low half of both V1 and V2. Then we implement
(zext(V1) + zext(V2)) + (zext(V2) * zext(2^eltbits - 1)) which
simplifies to (zext(V1) + zext(V2) * 2^eltbits). This further
simplifies to (zext(V1) + zext(V2) << eltbits). Then we bitcast the
result back to the original type splitting the wide elements in half.
We can only do this if we have a type with wider elements available.
Because we're using extends we also have to be careful with fractional
lmuls. Floating point types are supported by bitcasting to/from integer.
The tests test a varied combination of LMULs split across VLEN>=128 and
VLEN>=512 tests. There a few tests with shuffle indices commuted as well
as tests for undef indices. There's one test for a vXi64/vXf64 vector which
we can't optimize, but verifies we don't crash.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117743
This string no longer appears in the Vector Extension specification.
The segment load/store instructions are just part of the vector
instruction set.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D117724
Similar for ceil, trunc, round, and roundeven. This allows us to use
static rounding modes to avoid a libcall.
This is similar to D116771, but for the saturating conversions.
This optimization is done for AArch64 as isel patterns.
RISCV doesn't have instructions for ceil/floor/trunc/round/roundeven
so the operations don't stick around until isel to enable a pattern
match. Thus I've implemented a DAG combine.
I'm only handling saturating to i64 or i32. This could be extended
to other sizes in the future.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116864
This patch adds a variety of tests checking that we can match
vector/scalar instructions against masked VP intrinsics when the splat
is on the LHS. At this stage, we can't, despite us having
ostensibly-commutable ISel patterns for them. The use of V0 as the mask
operand interferes with the auto-generated ISel table.
This brings floating-point RVV vector/scalar support more in line with
the integer vector patterns, which can already match '.vx' instructions
with masked operations.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117697
This idea has come up in several reviews -- D115978 and D105902 -- so I
can't take any credit for the idea. Instead of using a constant pool to
lower -0.0, we can emit a sequence of two instructions:
fmv.[hwd].x freg, zero
fsgnjn.[hsd] freg, freg, freg
This is only done when the floating-point type is legal.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117687
`zve` is the new standard vector extension to specify varying degrees of
vector support for embedding processors. The `zve` extension is related
to the `zvl` extension and other updates that are added in v1.0.
According to https://github.com/riscv-non-isa/riscv-c-api-doc/pull/21,
Clang defines macro `__riscv_v_max_elen`, `__riscv_v_max_elen_fp` for
`zve` and it can be used by applications that uses the vector extension.
Authored by: Zakk Chen <zakk.chen@sifive.com> @khchen
Co-Authored by: Eop Chen <eop.chen@sifive.com> @eopXD
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D112408
For instructions without operands, the final `AsmToken::EndOfStatement`
wasn't being consumed. In the context of inline assembly, the resulting
empty statements would cause extraneous empty lines to be emitted. Fix
the issue by consuming the `EndOfStatement` token.
Differential Revision: https://reviews.llvm.org/D117565
We may not be allowed to use vXiXLen vectors. Consult ELEN to
determine what is allowed. This will become even more important
when Zve32 is added.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D117518
When widening these intrinsics, we do not have to insert neutral
elements at the end of the vector as when widening vector.reduce.*
intrinsics, thanks to vector predication semantics.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117467
Split vp.reduction.* intrinsics by splitting the vector to reduce in
two halves, perform the reduction operation in each one of them and
accumulate the results of both operations.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117469
Currently, users expected VL is the last operand. However, since some
intrinsics has tail policy in the last operand, this rule cannot be used
anymore.
Reviewed By: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D117452
For fixed vectors, the undef will get expanded to an all zeros
build_vector. We don't want that so suppress creating the
insert_subvector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117379
We were considering this legal, but later the undef would become an all
zeros vector. This would cause us to need to re-legalize the insert later
into a vslideup with zero vector.
This patch catches the case and directly legalizes it to a scalable
insert.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117377
Original patch by @hussainjk.
This patch was split off from D109377 to keep vector legalization
(widening/splitting) separate from vector element legalization
(promoting).
While the original patch added a third overload of
SelectionDAG::getVPStore, this patch takes the liberty of collapsing
those all down to 1, as three overloads seems excessive for a
little-used node.
The original patch also used ModifyToType in places, but that method
still crashes on scalable vector types. Seeing as the other VP
legalization methods only work when all operands need identical
widening, this patch follows in that vein.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117235
It seems that D115709 have mis-deleted the whole testcase for vector
extension. This commit adds them back.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117353
`zvl` is the new standard vector extension that specifies the minimum vector length of the vector extension.
The `zvl` extension is related to the `zve` extension and other updates that are added in v1.0.
According to https://github.com/riscv-non-isa/riscv-c-api-doc/pull/21,
Clang defines macro `__riscv_v_min_vlen` for `zvl` and it can be used for applications that uses the vector extension.
LLVM checks whether the option `riscv-v-vector-bits-min` (if specified) matches the `zvl*` extension specified.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108694
Specifically the unary shuffle case where the elements being
shifted in are undef. This handles the shuffles produce by expanding
llvm.reduce.mul.
I did not reduce the VL which would increase the number of vsetvlis,
but may improve the execution speed. We'd also want to narrow the
multiplies so we could share vsetvlis between the vslidedown.vi and
the next multiply.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117239
It appears the code here was written for the inline asm clobbering
a specific register, but it also gets used for named input and
output registers.
For the input and output case, we should honor the VT so we
don't insert conversion instructions around the inline assembly.
For the clobber, case we need to pick the largest register class.
Reviewed By: asb, jrtc27
Differential Revision: https://reviews.llvm.org/D117279
We could use vmv.v.i/vmv.v.x whose eew is 32 to lower the i64 splat vector if the i64 constant scalar could be splitted into two same i32 scalar.
Differential Revision: https://reviews.llvm.org/D117079
Using named registers as input or output constraints creates fcvt.d.s
and fcvt.s.d instructions around the inline assembly.
This makes the data unusable by the inline assembly and corrupts
the results of the inline assembly.
Agreed policy is that RISC-V extensions that have not yet been ratified
should be marked as experimental, and enabling them requires the use of
the -menable-experimental-extensions flag when using clang alongside the
version number. These extensions have now been ratified, so this is no
longer necessary, and the target feature names can be renamed to no
longer be prefixed with "experimental-".
Differential Revision: https://reviews.llvm.org/D117131
This adds support for STRICT_FSETCC(quiet) and STRICT_FSETCCS(signaling).
FEQ matches well to STRICT_FSETCC oeq.
FLT/FLE matches well to STRICT_FSETCCS olt/ole.
Others require commuting operands or multiple instructions.
STRICT_FSETCC olt/ole/ogt/oge/ult/ule/ugt/uge uses FLT/FLE,
but we need to save/restore FFLAGS around them to avoid spurious
exceptions. I've implemented pseudo instructions with a
CustomInserter to insert the save/restore CSR instructions.
Unfortunately, this doesn't honor exceptions for signaling NANs
but I'm not sure if signaling nans are really supported by the
constrained intrinsics.
STRICT_FSETCC one and ueq expand to a pair of FLT instructions
with a save/restore of fflags around each. This could be improved
in the future.
There may be some opportunities to generate better code for strict
comparisons mixed with nonans fast math flags. I've left FIXMEs in
the .td files for that.
Co-Authored-by: ShihPo Hung <shihpo.hung@sifive.com>
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D116694
Similar for ceil, trunc, round, and roundeven. This allows us to use
static rounding modes to avoid a libcall.
This optimization is done for AArch64 as isel patterns.
RISCV doesn't have instructions for ceil/floor/trunc/round/roundeven
so the operations don't stick around until isel to enable a pattern
match. Thus I've implemented a DAG combine.
We only handle XLen types except i32 on RV64. i32 will be type
legalized to a RISCVISD node. All other types will be type legalized
to XLen and maintain the FP_TO_SINT/UINT ISD opcode.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116771
The code can only address the whole RV32 address space or the lower 2 GiB
of the RV64 address space in small code model, so 32 bits entry is enough.
Cache hit ratio and code size have some improvements.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116435
When `Zbt` is enabled, we can generate SELECT for division by power
of 2, so that there is no data dependency.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D114856
For vmsgeu.vi with 0, we know this is always true. So we can replace
it with vmset.m (unmasked) or vmset.m+vmand.mm (masked).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D116584
Now the backend will select ~0 vl to a register and load instruction, we could use X0 to replace it.
Differential Revision: https://reviews.llvm.org/D116798
Now AND is used for zero extension when both Zbb and Zbp are not enabled.
It may be better to use shift operation if the trailing ones mask exceeds simm12.
This patch optimzes LUI+ADDI+AND to SLLI+SRLI.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D116720
When we want to create an splat vector that only the first element is initialized, we could use vmv.s.x or vfmv.s.f to build it.
Differential Revision: https://reviews.llvm.org/D116277
These tests are interested in the FP instructions being used, not
the conversions needed to pass the arguments/returns in GPRs.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116869
This can be generalized to (srl (and X, C2), C) ->
(srli (slli X, (XLen-C3), (XLen-C3) + C). Where C2 is a mask with
C3 trailing ones.
This can avoid constant materialization for C2. This is beneficial
even when C2 can be selected to ANDI because the SLLI can become
C.SLLI, but C.ANDI cannot cover all the immediates of ANDI.
This also enables CSE in some cases of i8 sdiv by constant codegen.
Similar for (sra (sext_inreg X, i8), C).
With Zbb, sext_inreg of i8 and i16 are legal for sext.b and sext.h.
This transform makes the Zbb codegen the same as without Zbb. The
shifts are more compressible. This also exposes an opportunity for
CSE with another slli in the i16 sdiv by constant codegen.
These nodes should saturate to their saturating VT. We can use this
information to know the bits past the VT are all zeros or all sign bits.
I think we might only have test coverage for the unsigned case. I'll
verify and add tests.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D116870
Split vp.select in a similar way as vselect, splitting also the length
parameter.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D116651
The 0 immediate can't be selected to vmsgtu.vi/vmsleu.vi by decrementing
the immediate. To prevent his we had special patterns that provided
alternate lowering for the 0 cases. This relied on tablegen prioritizing
the 0 pattern over the sim5_plus1 range.
This patch introduces simm5_plus1_nonzero that excludes 0. It also
excludes the special case for vmsltu.vi since we can just use
vmsltu.vx and let the 0 be selected to X0.
This is an alternative to some of the changes in D116584.
Reviewed By: Chenbing.Zheng, asb
Differential Revision: https://reviews.llvm.org/D116723
Function calls and compare instructions tend to cause sext.w
instructions to be inserted. If we make good use of W instructions,
these operations can often end up being redundant. We don't always
detect these during SelectionDAG due to things like phis. There also
some cases caused by failure to turn extload into sextload in
SelectionDAG. extload selects to LW allowing later sext.ws to become
redundant.
This patch adds a pass that examines the input of sext.w instructions trying
to determine if it is already sign extended. Either by finding a
W instruction, other instructions that produce a sign extended result,
or looking through instructions that propagate sign bits. It uses
a worklist and visited set to search as far back as necessary.
Reviewed By: asb, kito-cheng
Differential Revision: https://reviews.llvm.org/D116397
The zextload hook is only used to determine whether to insert a
zero_extend or any_extend for narrow types leaving a basic block.
Returning true from this hook tends to cause any load whose output
leaves the basic block to become an LWU instead of an LW.
Since we tend to prefer sexts for i32 compares on RV64, this can
cause extra sext.w instructions to be created in other basic blocks.
If we use LW instead of LWU this gives the MIR pass from D116397
a better chance of removing them.
Another option might be to teach getPreferredExtendForValue in
FunctionLoweringInfo.cpp about our preference for sign_extend of
i32 compares. That would cause SIGN_EXTEND to be chosen for any
value used by a compare instead of using the isZExtFree heuristic.
That will require code to convert from the llvm::Type* to EVT/MVT
as well as querying the type legalization actions to get the
promoted type in order to call TargetLowering::isSExtCheaperThanZExt.
That seemed like many extra steps when no other target wants it.
Though it would avoid us needing to lean on the MIR pass in some cases.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116567
Unsigned compares work with either zero extended or sign extended
inputs just like equality comparisons. I didn't allow this when
I refactored the code in D116421 due to lack of tests. But I've
since found a simple C test case that demonstrates when this can be
useful.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D116617
Previously we only recognized strided loads/store when the initial
value for the phi was a strided constant vector.
This patch extends the support to a strided_constant added to a
splatted value. The rewritten loop will add the splat value to the
first element of the strided constant vector to use as the scalar
start value. The stride is unaffected.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D115958
This is similar to what is done for targets that prefer zero extend
where we avoid using a zero extend if the promoted values are sign
extended.
We'll also check for zero extended operands for ugt, ult, uge, and ule when the
target prefers sign extend. This is different than preferring zero extend, where
we only check for sign bits on equality comparisons.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D116421
For large integers (for example, magic numbers generated by
TargetLowering::BuildSDIV when dividing by constant), we may
need about 4~8 instructions to build them.
In the same time, it just takes two instructions to load
constants (with extra cycles to access memory), so it may be
profitable to put these integers into constant pool.
Reviewed By: asb, craig.topper
Differential Revision: https://reviews.llvm.org/D114950
This patch adds isel support for STRICT_LRINT/LLRINT/LROUND/LLROUND.
It also adds test cases for f32 and f64 constrained intrinsics that
correspond to the intrinsics in float-intrinsics.ll and
double-intrinsics.ll. Support for promoting the integer argument of
STRICT_FPOWI was added.
I've skipped adding tests for f16 intrinsics, since we don't have libcalls
for them and we have inconsistent support for promoting them in LegalizeDAG.
This will need to be examined more closely.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116323