SCEV makes a logical mistake when handling EitherMayExit in
case when both conditions must be met to exit the loop. The
mistake looks like follows: "if condition `A` fails within at most `X` first
iterations, and `B` fails within at most `Y` first iterations, then `A & B`
fails at most within `min (X, Y)` first iterations". This is wrong, because
both of them must fail at the same time.
Simple example illustrating this is following: we have an IV with step 1,
condition `A` = "IV is even", condition `B` = "IV is odd". Both `A` and `B`
will fail within first two iterations. But it doesn't mean that both of them
will fail within first two first iterations at the same time, which would mean
that IV is neither even nor odd at the same time within first 2 iterations.
We can only do so for known exact BE counts, but not for max.
Differential Revision: https://reviews.llvm.org/D91942
Reviewed By: nikic
This might be a regression for some ARM targets, but that should
be changed in the target-specific overrides.
There is apparently still no default lowering for these nodes,
so I am assuming these intrinsics are not in common use.
X86, PowerPC, and RISC-V for example, just crash given the most
basic IR.
Test a few more variations:
* NoAlias with different strides
* MustAlias without loop
* MustAlias with same stride
* MustAlias base pointers with different stride
This is re-applying a combination of f7eac51b9b and 8ec7ea3ddc as one patch
to avoid regressions now that we have better testing in place.
Those were reverted with 32dd5870ee because of crashing in experimental intrinsics.
That bug should be fixed with 7ae346434.
Paraphrased original commit messages:
This is the last step in removing cost-kind as a consideration in the
basic class model for intrinsics.
See D89461 for the start of that.
Subsequent commits dealt with each of the special-case intrinsics that
had customization here in the basic class. This should remove a barrier
to retrying D87188 (canonicalization to the abs intrinsic).
The ARM and x86 cost diffs seen here may be wrong because the
target-specific overrides have their own bugs, but we hope this is
less wrong - if something has a significant throughput cost, then it
should have a significant size / blended cost too by default.
The only behavioral diff in current regression tests is shown in the
x86 scatter-gather test (which is misplaced or broken because it runs
the entire -O3 pipeline) - we unrolled less, and we assume that is
a improvement.
Exception: in general, we want the *size* cost for a scalar call to be
cheap even if the other costs are expensive - we expect it to just be
a branch with some optional stack manipulation.
It is likely that we will want to carve out some
exceptions/overrides to this rule as follow-up patches for
calls that have some general and/or target-specific difference
to the expected lowering.
This was noticed as a regression in unrolling, so we have a test
for that now along with a couple of direct cost model tests.
If the assumed scalarization costs for the oversized vector
calls are not realistic, that would be another follow-up
refinement of the cost models.
Differential Revision: https://reviews.llvm.org/D90554
The constrained intrinsics have metadata arguments, so the
tests here were crashing as noted in D90554 (and that was
reverted even though this bug exists independently of that
change).
This is a partial un-revert of 32dd5870ee (originally df09f82599 ).
I'm adding back the baseline tests first, so we don't have
to back-track as much in case there are still problems.
as it's causing crashes in the optimizer. A reduced testcase has been posted as a follow-up.
This reverts commit f7eac51b9b.
Temporarily Revert "[CostModel] make default size cost for libcalls small (again)" as it depends upon the primary revert.
This reverts commit 8ec7ea3ddc.
Temporarily Revert "[CostModel] add tests for math library calls; NFC" as it depends upon the primary revert.
This reverts commit df09f82599.
Temporarily Revert "[LoopUnroll] add test for full unroll that is sensitive to cost-model; NFC" as it depends upon the primary revert.
This reverts commit 618d555e8d.
Similarly to assumes and guards deoptimize intrinsics are
marked as writing to ensure proper control dependencies
but they never modify any particular memory location.
Differential Revision: https://reviews.llvm.org/D91658
This is achieved through a substitution of FileCheck in lit.cfg.py,
where we explicitly set -allow-unused-prefixes to false.
We also introduce a %FileCheckWithUnusedPrefixes% substitution that can
be used in those cases where we want to allow unused prefixes, even if
the folder policy is to disallow them.
Differential Revision: https://reviews.llvm.org/D91275
The GEP aliasing implementation currently has two pieces of code
that solve two different subsets of the same basic problem: If you
have GEPs with offsets 4*x + 0 and 4*y + 1 (assuming access size 1),
then they do not alias regardless of whether x and y are the same.
One implementation is in aliasSameBasePointerGEPs(), which looks at
this in a limited structural way. It requires both GEP base pointers
to be exactly the same, then (optionally) a number of equal indexes,
then an unknown index, then a non-equal index into a struct. This
set of limitations works, but it's overly restrictive and hides the
core property we're trying to exploit.
The second implementation is part of aliasGEP() itself and tries to
find a common modulus in the scales, so it can then check that the
constant offset doesn't overlap under modular arithmetic. The second
implementation has the right idea of what the general problem is,
but effectively only considers power of two factors in the scales
(while aliasSameBasePointerGEPs also works with non-pow2 struct sizes.)
What this patch does is to adjust the aliasGEP() implementation to
instead find the largest common factor in all the scales (i.e. the GCD)
and use that as the modulus.
Differential Revision: https://reviews.llvm.org/D91027
We can use GF2P8AFFINEQB to reverse bits in a byte. Shuffles are needed to reverse the bytes in elements larger than i8. LegalizeVectorOps takes care of inserting the shuffle for the larger element size.
We already have Custom lowering for v16i8 with SSSE3, v32i8 with AVX, and v64i8 with AVX512BW.
I think we might be able to use this for scalars too by moving into a vector and back. But I'll save that for a follow up as its a little more involved.
Reviewed By: RKSimon, pengfei
Differential Revision: https://reviews.llvm.org/D91515
Constant hoisting may hide the constant value behind bitcast for And's
operand. Track down the constant to make the BFI result consistent
regardless of hoisting.
Differential Revision: https://reviews.llvm.org/D91450
aliasGEP() currently implements some special handling for the case
where all variable offsets are positive, in which case the constant
offset can be taken as the minimal offset. However, it does not
perform the same handling for the all-negative case. This means that
the alias-analysis result between two GEPs is asymmetric:
If GEP1 - GEP2 is all-positive, then GEP2 - GEP1 is all-negative,
and the first will result in NoAlias, while the second will result
in MayAlias.
Apart from producing sub-optimal results for one order, this also
violates our caching assumption. In particular, if BatchAA is used,
the cached result depends on the order of the GEPs in the first query.
This results in an inconsistency in BatchAA and AA results, which
is how I noticed this issue in the first place.
Differential Revision: https://reviews.llvm.org/D91383
This patch fixes the function isWideningInstruction for scalable vectors.
Now the cost model can check the widening pattern for SVE.
Differential Revision: https://reviews.llvm.org/D91260
In an effort to make code around flag determination more readable, and (possibly) prepare for a follow up change, factor out some of the flag detection logic. In the process, reduce the number of locations we mutate wrap flags by a couple.
Note that this isn't NFC. The old code tried for NSW xor (NUW || NW). This is, two different paths computed different sets of wrap flags. The new code will try for all three. The result is that some expressions end up with a few extra flags set.
This was changed recently with D90554 / f7eac51b9b
...because we had a regression testing blindspot for intrinsics
that are expected to be lowered to libcalls.
In general, we want the *size* cost for a scalar call to be cheap
even if the other costs are expensive - we expect it to just be
a branch with some optional stack manipulation.
It is likely that we will want to carve out some
exceptions/overrides to this rule as follow-up patches for
calls that have some general and/or target-specific difference
to the expected lowering.
This was noticed as a regression in unrolling, so we have a test
for that now along with a couple of direct cost model tests.
If the assumed scalarization costs for the oversized vector
calls are not realistic, that would be another follow-up
refinement of the cost models.
The SCEV code for constructing GEP expressions currently assumes
that the addition of the base and all the offsets is nsw if the GEP
is inbounds. While the addition of the offsets is indeed nsw, the
addition to the base address is not, as the base address is
interpreted as an unsigned value.
Fix the GEP expression code to not assume nsw for the base+offset
calculation. However, do assume nuw if we know that the offset is
non-negative. With this, we use the same behavior as the
construction of GEP addrecs does. (Modulo the fact that we
disregard SCEV unification, as the pre-existing FIXME points out).
Differential Revision: https://reviews.llvm.org/D90648
The GEP aliasing code currently checks for the GEP decomposition
limit being reached (i.e., we did not reach the "final" underlying
object). As far as I can see, these checks are not necessary. It is
perfectly fine to work with a GEP whose base can still be further
decomposed.
Looking back through the commit history, these checks were originally
introduced in 1a444489e9. However, I
believe that the problem this was intended to address was later
properly fixed with 1726fc698c, and
the checks are no longer necessary since then (and were not the
right fix in the first place).
Differential Revision: https://reviews.llvm.org/D91010
Summary:
Expand the print-memoryssa and print<memoryssa> passes with a new hidden
option -cfg-dot-mssa that names a file. When set, a dot-cfg style file
will be generated into the named file with the memoryssa comments retained
and those blocks containing them shown in light pink. The option does
nothing in isolation.
Author: Jamie Schmeiser <schmeise@ca.ibm.com>
Reviewed By: asbirlea (Alina Sbirlea), dblaikie (David Blaikie)
Differential Revision: https://reviews.llvm.org/D90638
Summary:
Expand the print-memoryssa and print<memoryssa> passes with a new hidden
option -cfg-dot-mssa that names a file. When set, a dot-cfg style file
will be generated into the named file with the memoryssa comments retained
and those blocks containing them shown in light pink. The option does
nothing in isolation.
Author: Jamie Schmeiser <schmeise@ca.ibm.com>
Reviewed By: asbirlea (Alina Sbirlea), dblaikie (David Blaikie)
Differential Revision: https://reviews.llvm.org/D90638
This is the last step in removing cost-kind as a consideration in the basic class model for intrinsics.
See D89461 for the start of that.
Subsequent commits dealt with each of the special-case intrinsics that had customization here in the
basic class. This should remove a barrier to retrying
D87188 (canonicalization to the abs intrinsic).
The ARM and x86 cost diffs seen here may be wrong because the target-specific overrides have their own
bugs, but we hope this is less wrong - if something has a significant throughput cost, then it should
have a significant size / blended cost too by default.
The only behavioral diff in current regression tests is shown in the x86 scatter-gather test (which is
misplaced or broken because it runs the entire -O3 pipeline) - we unrolled less, and we assume that is
a improvement.
Differential Revision: https://reviews.llvm.org/D90554
Our range computation methods benefit from no-wrap flags. But if the ranges
were first computed before the flags were set, the cached range will be too
pessimistic.
We need to drop cached ranges whenever we sharpen AddRec's no wrap flags.
Differential Revision: https://reviews.llvm.org/D89847
Reviewed By: fhahn
This is the cmp/sel sibling to D90692.
Again, the reasoning is: the throughput cost is number of instructions/uops,
so size/blended costs are identical except in special cases (for example,
fdiv or other known-expensive machine instructions or things like MVE that
may require cracking into >1 uops).
We need to check for a valid (non-null) condition type parameter because
SimplifyCFG may pass nullptr for that (and so we will crash multiple
regression tests without that check). I'm not sure if passing nullptr makes
sense, but other code in the cost model does appear to check if that param
is set or not.
Differential Revision: https://reviews.llvm.org/D90781
This is based on the same idea that I am using for the basic model implementation
and what I have partly already done for x86: throughput cost is number of
instructions/uops, so size/blended costs are identical except in special cases
(for example, fdiv or other known-expensive machine instructions or things like
MVE that may require cracking into >1 uop)).
Differential Revision: https://reviews.llvm.org/D90692
As noted in D90554, there's an opcode typo in using an easily
misused cost model API: getCmpSelInstrCost(). Beyond that, the
assumed sequence of ops is questionable, but that would be
another patch.
My guess is that the x86 test diffs show that we are probably
wrong both before and after this change, so there will be no
practical difference.
As an example, I tried this test which shows a cost of '7'
either way:
define <4 x i32> @sadd(<4 x i32> %va, <4 x i32> %vb) {
%V4I32 = call {<4 x i32>, <4 x i1>} @llvm.sadd.with.overflow.v4i32(<4 x i32> %va, <4 x i32> %vb)
%ov = extractvalue {<4 x i32>, <4 x i1>} %V4I32, 1
%r = extractvalue {<4 x i32>, <4 x i1>} %V4I32, 0
%z = select <4 x i1> %ov, <4 x i32> <i32 42, i32 42, i32 42, i32 42>, <4 x i32> %r
ret <4 x i32> %z
}
$ llc -o - sadd.ll -mattr=avx
vpaddd %xmm1, %xmm0, %xmm2
vpcmpgtd %xmm2, %xmm0, %xmm0
vpxor %xmm0, %xmm1, %xmm0
vblendvps %xmm0, LCPI0_0(%rip), %xmm2, %xmm0a
Differential Revision: https://reviews.llvm.org/D90681