Our actual lowering for i1 reductions uses ctpop combined with possibly a vector negate and possibly a logic op afterwards. I believe ctpop to be low cost on all reasonable hardware.
The default costing implementation here was returning quite inconsistent costs. and/or were returning very high costs (because we seem to think moving into scalar registers is very expensive?) and others were returning lower but still too high (because of the assumed tree reduce strategy). While we should probably improve the generic costing strategy for i1 vectors, let's start by fixing the immediate problem.
Differential Revision: https://reviews.llvm.org/D127511
Apparently, update_analyze_test_checks.sh does *not* warn on conflicting CHECKs, it just silently drops those lines from the generated test. That is.. less than helpful.
This prevents them from being assumed legal by the cost model.
This matches what is done for AArch64 SVE.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D123799
This reverts commit 1fbdbb5595.
All known issues surfaced by this patch should have been fixed now.
The fixes included fixing issues with SCEV expansion in LV and DA's
reliance on LCSSA phis.
To represent various loop levels within a nest, DA implements a special
numbering scheme (see comment atop establishNestingLevels). The goal of
this numbering scheme appears to be representing each unique loop
distinctively by using as little memory as possible. This numbering
scheme is simple when the source and destination of the dependence are
in the same loop. In such cases the level is simply the depth of the
loop in which src and dst reside. When the src and dst are not in the
same loop, we could run into the following situation exposed by
https://reviews.llvm.org/D71539. This patch fixes this by detecting
such cases in checkSubscripts and treating them as non-linear/non-affine.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D110973
The MVE shuffle costing for VREV instructions was making incorrect
assumptions as to legalized vector types remaining as vectors. Add a
quick check to ensure they are indeed vectors before attempting to get
the number of elements.
LTO code may end up mixing bitcode files from various sources varying in
their use of opaque pointer types. The current strategy to decide
between opaque / typed pointers upon the first bitcode file loaded does
not work here, since we could be loading a non-opaque bitcode file first
and would then be unable to load any files with opaque pointer types
later.
So for LTO this:
- Adds an `lto::Config::OpaquePointer` option and enforces an upfront
decision between the two modes.
- Adds `-opaque-pointers`/`-no-opaque-pointers` options to the gold
plugin; disabled by default.
- `--opaque-pointers`/`--no-opaque-pointers` options with
`-plugin-opt=-opaque-pointers`/`-plugin-opt=-no-opaque-pointers`
aliases to lld; disabled by default.
- Adds an `-lto-opaque-pointers` option to the `llvm-lto2` tool.
- Changes the clang driver to pass `-plugin-opt=-opaque-pointers` to
the linker in LTO modes when clang was configured with opaque
pointers enabled by default.
This fixes https://github.com/llvm/llvm-project/issues/55377
Differential Revision: https://reviews.llvm.org/D125847
Now that SimpleLoopUnswitch and other transforms no longer introduce
branch on poison, enable the -branch-on-poison-as-ub option by
default. The practical impact of this is mostly better flag
preservation in SCEV, and some freeze instructions no longer being
necessary.
Differential Revision: https://reviews.llvm.org/D125299
Add Value Tracking support to deduce induction variable being a power of 2, allowing urem optimizations
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D126018
Also collect conditions from assume up-front in applyLoopGuards.
This allows re-using the logic to handle logical ANDs as assume
conditions.
It should should pave the road for a fix for #55645.
In this patch we change test cases from using "CHECK" to using
"CHECK-NEXT", which is to ensure the order of loops output by
loop cache analysis is correct. After D124725 we fixed the
non-deterministic output order hence we did not use "CHECK-DAG"
anymore, and now we should really use "CHECK-NEXT" to make sure
the loops in the output loop vector follow the right order.
Reviewed By: bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D124984
We were using the default getScalarizationOverhead expansion for extraction costs, which adds up all the individual element extraction costs.
This is fine for 128-bit vectors, but for 256/512-bit vectors each element extraction also has to account for extracting the upper 128-bit subvector extraction before it can handle the element. For scalarization costs we only need to extract each demanded subvector once.
Differential Revision: https://reviews.llvm.org/D125527
Building on top of D125665, this adds MVE costs for fptosi.sat and
fptoui.sat, providing MVE is available and the types are legal.
Differential Revision: https://reviews.llvm.org/D125666
Similar to D124357, this adds some cost modelling for fptoi_sat for Arm
targets. Where VFP2 is available (and FP64/FP16 for the relevant types),
the operations are legal as the Arm instructions naturally saturate.
Otherwise they will need an extra smin/smax clamp, similar to AArch64.
Differential Revision: https://reviews.llvm.org/D125665
Similar to D123386, this adds D-Movs to the AArch64 perfect shuffle
tables, slightly lowering the costs a little more. This is a rough
improvement in general, especially if you ignore mov v0.16b, v2.16b type
moves that are often artefacts of the calling convention.
The D register movs are encoded as (0x4 | LaneIdx), and to generate a D
register move we are required to bitcast into a higher type, but it is
otherwise very similar to the S-lane mov's already supported.
Differential Revision: https://reviews.llvm.org/D125477
Scaffolding support for generating runtime checks for multiple SCEV expressions
per pointer. The initial version just adds support for looking through
a single pointer select.
The more sophisticated logic for analyzing forks is in D108699
Reviewed By: huntergr
Differential Revision: https://reviews.llvm.org/D114487
Even if the minimum number of elements is 1 and the length doesn't change,
we don't know what vscale is so we can't classify it as identity mask. Instead it
is a zero element splat.
For reverse, we shouldn't classify it as a reverse unless there are at least 2 elements
in the mask. This applies to both fixed and scalable vectors. For fixed vectors, a single
element would be an identity shuffle. For scalable vector it's a zero elt splat.
Reviewed By: sdesmalen, liaolucy
Differential Revision: https://reviews.llvm.org/D124655
Fold %x umin_seq %y to %x if %x ule %y. This also subsumes the
special handling for constant operands, as if %y is constant this
folds to umin via implied poison reasoning, and if %x is constant
then either %x is not zero and it folds to umin, or it is known
zero, in which case it is ule anything.
Fold %x umin_seq %y to %x umin %y if %x cannot be zero. They only
differ in semantics for %x==0.
More generally %x *_seq %y folds to %x * %y if %x cannot be the
saturation fold (though currently we only have umin_seq).
This adds some extra costs for reverse shuffles under AArch64, filling
in the i16/f16/i8 gaps in the cost model.
Differential Revision: https://reviews.llvm.org/D124786
Based off the script from D103695, on AVX1, Jaguar/Bulldozer both have low throughput for ymm select patterns (BLENDV + OR(AND,ANDN))), and even on AVX2 Haswell still struggles with BLENDV ops
The new flag -aarch64-insert-extract-base-cost can be used to
set the value of AArch64Subtarget::getVectorInsertExtractBaseCost(),
for the purposes of experimentation.
Differential Revision: https://reviews.llvm.org/D124835
Similar to how we convert logical and/or to bitwise and/or, we should
also convert umin_seq to umin based on implied poison reasoning. In
%x umin_seq %y, if %y being poison implies %x being poison, then we
don't need the sequential evaluation: Having %y contribute towards
the result will never make the result more poisonous. An important
corollary of this is that if %y is never poison, we also don't need
the sequential evaluation.
This avoids some of the regressions in D124910.
Differential Revision: https://reviews.llvm.org/D124921
Added a motivating test case for D123400 where the loopnest has a
suboptimal loop order j-i-k. After D123400 we ensure that the order
of loop cache analysis output is loop i-j-k, despite the suboptimal
order in the original loopnest.
Reviewed By: bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D122776