Commit Graph

3272 Commits

Author SHA1 Message Date
David Green 255ad73424 [ARM] Make MVE v2i1 predicates legal
MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the
same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the
two halves. This was never treated as a legal type in llvm in the past
as there are not many 64bit instructions and no 64bit compares. There
are a few instructions that could use it though, notably a VSELECT (as
it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for
similar reasons, some gathers/scatter and long multiplies and VCTP64
instructions.

This patch goes through and makes v2i1 a legal type, handling all the
cases that fall out of that. It also makes VSELECT legal for v2i64 as a
side benefit. A lot of the codegen changes as a result - usually in way
that is a little better or a little worse, but still expensive. Costs
can change a little too in the process, again in a way that expensive
things remain expensive. A lot of the tests that changed are mainly to
ensure correctness - the code can hopefully be improved in the future
where it comes up in practice.

The intrinsics currently remain using the v4i1 they previously did to
emulate a v2i1. This will be changed in a followup patch but this one
was already large enough.

Differential Revision: https://reviews.llvm.org/D114449
2021-12-03 14:05:41 +00:00
Nikita Popov 49d040ac97 [SCEV] Fix ValuesAtScopesUsers consistency
Fixes verification failure reported at:
https://reviews.llvm.org/rGc9f9be0381d1

The issue is that getSCEVAtScope() might compute a result without
inserting it in the ValuesAtScopes map in degenerate cases,
specifically if the ValuesAtScopes entry is invalidated during the
calculation. Arguably we should still insert the result if no
existing placeholder is found, but for now just tweak the logic
to only update ValuesAtScopesUsers if ValuesAtScopes is updated.
2021-12-03 10:03:10 +01:00
Florian Hahn 829b29b619
[MemoryLocation] strcat/strncat/strcpy read/write after their args.
strcpy/strcat/strncat access memory starting from the passed in
pointers. Construct memory locations for their args using getAfter.

Discussed in D114872.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D114969
2021-12-03 08:48:23 +00:00
Daniil Fukalov ab05ab59a7 [CostModel][AMDGPU] Fix instructions costs estimation for vector types.
1. Fixed vector instructions costs estimations incosistency - removed different
   logic for "not simple types" since it biases costs for these types.
2. Fixed legalization penalty for vectors too big for the target: changed from
   overwrite default legalization cost value estimation to added penalty.
3. Fixed few typos in tests.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114893
2021-12-03 03:08:08 +03:00
Philip Reames 740057d185 [funcattrs] Infer writeonly argument attribute
This change extends the current logic for inferring readonly and readnone argument attributes to also infer writeonly.

This change is deliberately minimal; there's a couple of areas for follow up.
* I left out all call handling and thus any benefit from the SCC walk. When examining the test changes, I realized the existing code is imprecise, and am going to fix that in it's own revision before adding in the writeonly handling. (Mostly because updating the tests is hard when I, the human, can't figure out whether the result is correct.)
* I left out handling for storing a value (as opposed to storing to a pointer). This should benefit readonly/readnone as well, and applies to a bunch of other instructions. Seemed worth having as a separate review.

Differential Revision: https://reviews.llvm.org/D114963
2021-12-02 13:04:09 -08:00
Florian Hahn 222442ec2d
[BasicAA] Add tests for strcat/strncat/strcpy. 2021-12-02 17:38:07 +00:00
Florian Hahn 639a78a4bf
[MemoryLocation] Support strncpy in getForArgument.
The size argument of strncpy can be used as bound for the size of
its pointer arguments.

strncpy is guaranteed to write N bytes and reads up to N bytes.

Reviewed By: xbolva00

Differential Revision: https://reviews.llvm.org/D114871
2021-12-02 14:18:05 +00:00
Florian Hahn 9f9e8ba114
[MemoryLocation] Support memset_chk in getForArgument.
The size argument for memset_chk is an upper bound for the size of the
pointer argument. memset_chk may write less than the specified length,
if it exceeds the specified max size and aborts.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D114870
2021-12-02 13:45:58 +00:00
Florian Hahn 47616c8855
[BasicAA] Add tests for memset_pattern{4,8,16}.
This also removes the existing memset_pattern.ll test, which was relying
on GVN. It is also covered by the new test directly.
2021-12-02 11:50:32 +00:00
Florian Hahn 524ad6babb
[BasicAA] Add memset_chk libfunc tests. 2021-12-01 14:15:46 +00:00
Florian Hahn c6bd63803f
[BasicAA] Add strncpy libfunc tests. 2021-12-01 14:15:40 +00:00
Roman Lebedev 8cd782487f
[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Nikita Popov 77dd579827 [SCEV] Remove incorrect assert
Fix assertion failure reported on D113349 by removing the assert.
While the produced expression should be equivalent, it may not
be strictly the same, e.g. due to lazy nowrap flag updates. Similar
to what the main createSCEV() code does, simply retain the old
value map entry if one already exists.
2021-11-29 17:09:12 +01:00
Roman Lebedev 7e73c2a66a
[X86][Costmodel] `getInterleavedMemoryOpCostAVX512()`: masked load can not be folded into a shuffle
The mask on the shuffle is for the output, not the input.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114697
2021-11-29 18:37:07 +03:00
Roman Lebedev 5e96553608
[NFC][X86][LV][Costmodel] Add most basic test for masked interleaved load 2021-11-29 16:46:19 +03:00
Roman Lebedev cffe3a084f
[X86][Costmodel] Now that `getReplicationShuffleCost()` is good, update `getInterleavedMemoryOpCostAVX512()`
... to actually ask about i1-elt-wide mask, since that is what will probably be used on AVX512.
This unblocks D111460.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114316
2021-11-29 14:41:48 +03:00
Nikita Popov 2b160e95c8 Reland [SCEV] Fix and validate ValueExprMap/ExprValueMap consistency
Relative to the previous landing attempt, this introduces an additional
flag on forgetMemoizedResults() to not remove SCEVUnknown phis from
the value map. The invalidation after BECount calculation wants to
leave these alone and skips them in its own use-def walk, but we can
still end up invalidating them via forgetMemoizedResults() if there
is another IR value with the same SCEV. This is intended as a temporary
workaround only, and the need for this should go away once the
getBackedgeTakenInfo() invalidation is refactored in the spirit of
D114263.

-----

This adds validation for consistency of ValueExprMap and
ExprValueMap, and fixes identified issues:

* Addrec construction directly wrote to ValueExprMap in a few places,
  without updating ExprValueMap. Add a helper to ensures they stay
  consistent. The adjustment in forgetSymbolicName() explicitly
  drops the old value from the map, so that we don't rely on it
  being overwritten.
* forgetMemoizedResultsImpl() was dropping the SCEV from
  ExprValueMap, but not dropping the corresponding entries from
  ValueExprMap.

Differential Revision: https://reviews.llvm.org/D113349
2021-11-27 12:37:15 +01:00
Zarko Todorovski 7f7dac7126 [NFC][llvm] Inclusive language: reword uses of sanity test and check
Part of continuing work to use more inclusive language. Reworded uses
of sanity check and sanity test in llvm/test/
2021-11-25 07:21:42 -05:00
Graham Hunter dee810e117 [NFC][LAA] Precommit tests for forked pointers
Precommit for https://reviews.llvm.org/D108699
2021-11-24 16:20:35 +00:00
Peter Waller 787b66eb5f [LoopAccessAnalysis][SVE] Bail out for scalable vectors
The supplied test case, reduced from real world code, crashes with a
'Invalid size request on a scalable vector.' error.

Since it's similar in spirit to an existing LAA test, rename the file to
generalize it to both.

Differential Revision: https://reviews.llvm.org/D114155
2021-11-24 15:52:20 +00:00
Roman Lebedev cd8d219536
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 32 bit when have AVX512DQ
I believe, this effectively completes `X86TTIImpl::getReplicationShuffleCost()`
for AVX512, other than the question of handling plain AVX512F,
where we end up with some really ugly "shuffles",
but then is there any CPU's that support AVX512, but not AVX512DQ/AVX512BW?

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114315
2021-11-24 17:23:15 +03:00
Florian Mayer 6c06d8e310 [stack-safety] Check SCEV constraints at memory instructions.
Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D113160
2021-11-23 15:29:23 -08:00
Roman Lebedev 704d92607d
[X86][TTI] Finish costmodel for AVX512BW's VPMOVM2[BW] / VPMOV[BW]2M instructions
Apparently my methodology was suboptimal, and not only did miss all the +VL tuples,
i also missed some plain tuples. I believe, this adds everything missing.
Indeed, these manual costmodels are just not okay long-term.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114334
2021-11-22 14:31:34 +03:00
Roman Lebedev 8d09dd61c3
[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions
Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW,
these either sign-extent the mask register into a vector,
or pack the mask from vector register.

Apparently, we didn't even have MCA tests for these,
added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05,
so i'm just guessing that their perf characteristics
are optimal.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114314
2021-11-22 14:31:34 +03:00
Sjoerd Meijer 4d21b64464 [BPI] Look-up tables for non-loop branches. NFC.
This adds and uses look-up tables for non-loop branch probabilities, which have
have probabilities directly encoded into the tables for the different condition
codes. Compared to having this logic inlined in different functions, as it used
to be the case, I think this is compacter and thus also easier to check/cross
reference. This also adds a test for pointer heuristics that was missing.

Differential Revision: https://reviews.llvm.org/D114009
2021-11-22 10:30:42 +00:00
Roman Lebedev df70cf5e14
[NFC][X86][Costmodel] Actually test +prefer-256-bit in replication-shuffle-related tests :(
While -prefer-256-bit indeed becomes complete with D114314,
the real-world (the one with +prefer-256-bit) coverage is lacking.

Hilarious.
2021-11-21 01:25:49 +03:00
Roman Lebedev da47a63e03
[NFC][X86][Costmodel] Add AVX512DQ runlines to trunc.ll/extend.ll 2021-11-20 13:55:13 +03:00
Roman Lebedev 049799c311
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 8 bit when have AVX512BW+AVX512VBMI
If in addition to AVX512BW (that provides `{k}<->{i8,i16}` casts and i16 shuffles),
we have AVX512VBMI, which provides i8 shuffles, we are in an optimal situation.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114071
2021-11-19 15:58:10 +03:00
Roman Lebedev a751084bb4
[X86][Costmodel] `trunc v16i8 to v8i1` can appear after legalization, cost is same as for `trunc v8i8 to v8i1`
Note that there are many other missing costs, i'm *only* adding the ones that are queried
from `getReplicationShuffleCost()` for the existing (quite exhaustive) test coverage.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114070
2021-11-19 15:57:32 +03:00
Roman Lebedev a50fdd3fc9
[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 16 bit when have AVX512BW
Here we get pretty lucky. AVX512F does not provide any instructions
to convert between a `k` vector mask and a vector,
but AVX512BW adds `{k}<->nX{i8,i16}`conversions,
and just as it happens, with AVX512BW we have a i16 shuffle.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113915
2021-11-19 15:55:41 +03:00
Philip Reames ea12c2cb9c [SCEV] Move mustprogress based no-self-wrap logic so it applies to all exit conditions
This change moves logic which we'd added specifically for less than tests so that it applies to equalities and greater than tests as well. The basic idea is that if we can show an IV cycles infinitely through the same series on self-wrap, and that the exit condition must be taken to prevent UB, we can conclude that it must be taken before self-wrap and thus infer said flag.

The motivation here is simple loops with unsigned induction variables w/non-one steps and inequality tests. A toy example would be:
for (unsigned i = 0; i != N; i += 2) { body; }

If body contains no side effects, and this is a mustprogress function, we can assume that this must be a finite loop and thus that the exit count is N/2.

Differential Revision: https://reviews.llvm.org/D103991
2021-11-18 10:07:44 -08:00
Philip Reames 100df68496 [SCEV] Add test coverage for invertible functions of IVs 2021-11-18 08:56:45 -08:00
Florian Hahn da9f2ba3b1
[SCEV] Reorder operands checks in collectConditions.
The initial two cases require a SCEVConstant as RHS. Pull up the condition
to check and swap SCEVConstants from below. Also remove a redundant
check & swap if RHS is SCEVUnknown.
2021-11-18 09:36:16 +00:00
Florian Hahn dd6281c4c1
[SCEV] Add additional guard tests with swapped condition ops. 2021-11-18 09:35:19 +00:00
Philip Reames 0623f52a46 Autogen a test for ease of update 2021-11-17 17:20:57 -08:00
Philip Reames ad69402f3e [SCEVAA] Avoid forming malformed pointer diff expressions
This solves the same crash as in D104503, but with a different approach.

The test case test_non_dom demonstrates a case where scev-aa crashes today. (If exercised either by -eval-aa or -licm.) The basic problem is that SCEV-AA expects to be able to compute a pointer difference between two SCEVs for any two pair of pointers we do an alias query on. For (valid, but out of scope) reasons, we can end up asking whether expressions in different sub-loops can alias each other. This results in a subtraction expression being formed where neither operand dominates the other.

The approach this patch takes is to leverage the "defining scope" notion we introduced for flag semantics to detect and disallow the formation of the problematic SCEV. This ends up being relatively straight forward on that new infrastructure. This change does hint that we should probably be verifying a similar property for all SCEVs somewhere, but I'll leave that to a follow on change.

Differential Revision: D114112
2021-11-17 12:38:04 -08:00
Florian Hahn e8b55cf7b7
[SCEV] Apply loop guards when computing max BTC for arbitrary steps.
Similar other cases in the current function (e.g. when the step is 1 or
-1), applying loop guards can lead to tighter upper bounds for the
backedge-taken counts.

Fixes PR52464.

Reviewed By: reames, nikic

Differential Revision: https://reviews.llvm.org/D113578
2021-11-17 11:00:49 +00:00
Roman Lebedev 496ccb543e
[NFC][X86][Costmodel] Improve test coverage for i32->i64 vector *ext 2021-11-17 12:02:50 +03:00
Roman Lebedev 2037ec725f
[X86][Costmodel] `*ext v64i1 to v32i16` can appear after legalization, cost is same as for `*ext v32i1 to v32i16`
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113914
2021-11-17 12:02:50 +03:00
Roman Lebedev 23b194bf18
[X86][Costmodel] `trunc v32i16 to v64i1` can appear after legalization, cost is same as for `trunc v32i16 to v32i1`
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113913
2021-11-17 12:02:50 +03:00
Philip Reames 8d85e945b2 [SCEV] Canonicalize X - urem X, Y patterns
There are multiple possible ways to represent the X - urem X, Y pattern. SCEV was not canonicalizing, and thus, depending on which you analyzed, you could get different results. The sub representation appears to produce strictly inferior results in practice, so I decided to canonicalize to the Y * X/Y version.

The motivation here is that runtime unroll produces the sub X - (and X, Y-1) pattern when Y is a power of two. SCEV is thus unable to recognize that an unrolled loop exits because we don't figure out that the new unrolled step evenly divides the trip count of the unrolled loop. After instcombine runs, we convert the the andn form which SCEV recognizes, so essentially, this is just fixing a nasty pass ordering dependency.

The ARM loop hardware interaction in the test diff is opague to me, but the comments in the review from others knowledge of the infrastructure appear to indicate these are improvements in loop recognition, not regressions.

Differential Revision: https://reviews.llvm.org/D114018
2021-11-16 11:59:21 -08:00
Philip Reames 3dd6d5b628 [tests] Add coverage for different forms of X - urem X, Y 2021-11-16 09:26:34 -08:00
Philip Reames 56ae2cfecf autogen a SCEV test file 2021-11-16 09:26:34 -08:00
Florian Hahn b7aec4f08e
[SCEV] Support rewriting ZExt expressions with loop guard info.
So far, applying loop guard information has been restricted to
SCEVUnknown. In a few cases, like PR40961 and PR52464, this leads to
SCEV failing to determine tight upper bounds for the backedge taken
count.

This patch adjusts SCEVLoopGuardRewriter and applyLoopGuards to support
re-writing ZExt expressions.

This is a first step towards fixing  PR40961 and PR52464.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D113577
2021-11-16 11:16:07 +00:00
David Green 309f1e4ac8 [ARM] Add datalayout to costmodel tests. NFC
This adds a sensible datalayout to the ARM cost model tests, to prevent
the costs reported being incorrect for the size of pointers.
2021-11-16 09:49:42 +00:00
Roman Lebedev 7114c60e8f
[NFC][X86][Costmodel] Improve test coverage for {i8,i16,i32,i64}->i1 vector trunc 2021-11-15 20:46:48 +03:00
Roman Lebedev 949103dc36
[NFC][X86][Costmodel] Improve test coverage for i1->{i8,i16,i32,i64} vector *ext 2021-11-15 20:46:48 +03:00
Roman Lebedev bc35d5fe2f
[NFC][X86][Costmodel] Add i1 replication shuffle costmodel test coverage 2021-11-15 20:02:52 +03:00
Roman Lebedev 5c7255fe3a
[X86][Costmodel] `getReplicationShuffleCost()`: promote 8 bit-wide elements to 32 bit when no AVX512VBMI
Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type,
so this change does affect vectorization. In the end, it will ask about i1.

We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up.
All costs here look good, i've added the missing truncation costs in preparatory patches.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113853
2021-11-15 19:04:02 +03:00
Roman Lebedev a468c39c90
[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8`
Some of the costs get larger here,
but i suppose that makes sense since we'd previously query
scalarization costs that may not be really representative of the reality.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113852
2021-11-15 19:04:02 +03:00
Roman Lebedev 9e57d9b09d
[X86][Costmodel] `trunc v8i64 to v16i8/v32i8/v64i8` can appear after legalization, cost is same as for `trunc v8i64 to v8i8`
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113851
2021-11-15 19:04:02 +03:00
Roman Lebedev 0116c708c6
[X86][Costmodel] `trunc v16i32 to v32i8/v64i8` can appear after legalization, cost is same as for `trunc v16i32 to v16i8`
While this one is trivial and identical to the previous patch,
there is a weird cost change in a follow-up patch that i'm not sure about.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113850
2021-11-15 19:04:02 +03:00
Roman Lebedev f86b57e37c
[NFC][X86][Costmodel] Improve test coverage for {i16,i32,i64}->i8 vector trunc 2021-11-14 20:25:40 +03:00
Roman Lebedev f0da329f93
[NFC][X86][Costmodel] Improve test coverage for i8->{i16,i32,i64} vector *ext 2021-11-14 20:25:33 +03:00
Roman Lebedev 4dd2f0446c
[X86][Costmodel] `getReplicationShuffleCost()`: promote 16 bit-wide elements to 32 bit when no AVX512BW
The basic idea is simple, if we don't have native shuffle for this element type,
then we must have native shuffle for wider element type,
so promote, replicate, demote.

I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically,
case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in
2 * XMM, that then will be packed into 1 * YMM,
and it should count the cost of said packing,
not just the truncations.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113609
2021-11-14 20:01:38 +03:00
Roman Lebedev b283961012
[X86][Costmodel] `trunc v8i64 to v16i16/v32i16` can appear after legalization, cost is same as for `trunc v8i64 to v8i16`
Same as D113842, but for i64

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113843
2021-11-14 18:41:38 +03:00
Roman Lebedev a5f2fdca99
[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16`
This was noticed in D113609, hopefully it unblocks that patch.
There are likely other similar problems.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113842
2021-11-14 18:41:37 +03:00
Roman Lebedev 17a3df87ff
[NFC][X86][Costmodel] Improve test coverage for {i32,i64}->i16 vector *ext
See https://reviews.llvm.org/D113609 - some of these costs seem wrong.
2021-11-14 16:07:30 +03:00
Roman Lebedev fd24446ba5
[NFC][X86][Costmodel] Improve test coverage for i16->{i32,i64} vector *ext 2021-11-14 16:06:45 +03:00
Florian Hahn 69c1cbe20f
[SCEV] Add test case where applying zext info pessimizes BTC.
Add an additional test case for D113578.
2021-11-12 12:19:35 +00:00
Florian Hahn 5dfe60d171
[SCEV] Add tests where guards limit both %n and (zext %n).
Suggested in D113577.
2021-11-12 10:31:35 +00:00
Florian Hahn 8d2a1994c8
[AArch64] Add some fp16 cast cost-model tests.
This adds initial tests for cost-modeling {u,s}itofp for fp16 vectors.
At the moment, they are under-estimated in a couple of cases.
2021-11-11 18:21:44 +00:00
Roman Lebedev a70d74323e
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI
VBMI introduced VPERMB, so cost-model i8 replication shuffle using it.
Note that we can still model i8 replication shufflle without VBMI,
by promoting to i16/i32. That will be done in follow-ups.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113479
2021-11-10 22:52:40 +03:00
Roman Lebedev c6e894b9b2
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW
BWI introduced VPERMW, so cost-model i16 replication shuffle using it.
Note that we can still model i16 replication shufflle without BWI,
by promoting to i32. That will be done in follow-ups.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113478
2021-11-10 22:52:39 +03:00
Roman Lebedev 4101c7bf19
[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F
This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`,
that take a single input vector and a single index vector,
and are cross-lane. So far i haven't seen evidence that
replication ever results in demanding more than a single
input vector per output vector.

This results in *shockingly* lesser costs :)

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113350
2021-11-10 22:52:33 +03:00
Florian Hahn fcf2ae9923
[SCEV] Add tests that require rewriting zexts when applying guards.
Precommit tests inspired by PR40961 and PR52464.
2021-11-10 16:58:27 +00:00
Roman Lebedev 3bdf738d1b
[NFC][X86][Costmodel] Add i16 replication shuffle costmodel test coverage 2021-11-09 14:19:44 +03:00
Nikita Popov a8c318b50e [BasicAA] Use index size instead of pointer size
When accumulating the GEP offset in BasicAA, we should use the
pointer index size rather than the pointer size.

Differential Revision: https://reviews.llvm.org/D112370
2021-11-07 18:56:11 +01:00
Roman Lebedev 23566f18c6
[NFC][X86][Costmodel] Add tests for i32/i64 replication shuffles
While this isn't what we eventually need (i8 or i1),
approaching from this end is more straight-forward.
2021-11-06 17:14:56 +03:00
Roman Lebedev a30ec4778a
[TTI][CostModel] `getUserCost()`: recognize replication shuffles and query their cost
This finally creates proper test coverage for replication shuffles,
that are used by LV for conditional loads, and will allow to add
proper costmodel at least for AVX512.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113324
2021-11-06 16:45:15 +03:00
Philip Reames d24a0e8857 [SCEV] Use constant range of RHS to prove NUW on narrow IV in trip count logic
The basic idea here is that given a zero extended narrow IV, we can prove the inner IV to be NUW if we can prove there's a value the inner IV must take before overflow which must exit the loop.

Differential Revision: https://reviews.llvm.org/D109457
2021-11-05 15:36:47 -07:00
Roman Lebedev 04fa7cbf55
[NFC][CostModel] Add exhaustive test coverage for replication shuffles
This coverage has been brought to you by https://godbolt.org/z/nfc3cY1za
2021-11-06 00:53:28 +03:00
Roman Lebedev ad617183bb
[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1
Even though AVX512's masked mem ops (unlike AVX1/2) have a mask
that is a `VF x i1`, replication of said masks happens after
promotion of it to `VF x i8`, so we should use `i8`, not `i1`,
when calculating the cost of mask replication.
2021-11-05 17:27:02 +03:00
Roman Lebedev 34b903d8b0
[NFC] Add forgotten `REQUIRES: asserts` into the new costmodel test 2021-11-03 19:40:23 +03:00
Roman Lebedev c65e2ac405
[NFC] Rewrite runlines in interleaved-store-accesses-with-gaps.ll once again
https://lab.llvm.org/buildbot/#/builders/98/builds/8198 is still failing,
and i really don't understand how runlines in this test differ
from the ones in other nearby tests...
2021-11-03 19:15:33 +03:00
Roman Lebedev df93c8a919
[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask
I don't really buy that masked interleaved memory loads/stores are supported on X86.
There is zero costmodel test coverage, no actual cost modelling for the generation
of the mask repetition, and basically only two LV tests.
Additionally, i'm not very interested in AVX512.

I don't know if this really helps "soft" block over at
https://reviews.llvm.org/D111460#inline-1075467,
but i think it can't make things worse at least.

When we are being told that there is a masking, instead of
completely giving up and falling back to
fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`,
let's correctly query the cost of masked memory ops,
keep all the pretty shuffle cost modelling,
but scalarize the cost computation for the mask replication.

I think, not scalarizing the shuffles themselves
may adjust the computed costs a bit,
and maybe hopefully just enough to hide the "regressions"
at https://reviews.llvm.org/D111460#inline-1075467
I do mean hide, because the test coverage is non-existent.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112873
2021-11-03 18:14:35 +03:00
Roman Lebedev f3d1ddfe71
[NFC] Use single-dash-prefixed options in newly-added test
https://lab.llvm.org/buildbot/#/builders/98/builds/8195 complains,
and this is the only guess i have.
2021-11-03 18:12:40 +03:00
Roman Lebedev a4b64f7727
[BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if mask for gap will be used
As it can be seen in `InnerLoopVectorizer::vectorizeInterleaveGroup()`,
in some cases (reported by `UseMaskForGaps`), the gaps in the interleaved load/store group
will be masked away by another constant mask, so there is no need to
account for the cost of replication of the mask for these.

Differential Revision: https://reviews.llvm.org/D112877
2021-11-03 17:33:28 +03:00
Roman Lebedev c6b3da1d66
[NFC][X86] Duplicate LV test into a costmodel test
Copied from llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
As discussed in D111460 / D112877 / D112873 we have basically no test coverage
for this part of cost model.
2021-11-03 17:25:18 +03:00
Nikita Popov 51e9f33603 [BasicAA] Use saturating multiply on range if nsw
If we know that the var * scale multiplication is nsw, we can use
a saturating multiplication on the range (as a good approximation
of an nsw multiply). This recovers some cases where the fix from
D112611 is unnecessarily strict. (This can be further strengthened
by using a saturating add, but we currently don't track all the
necessary information for that.)

This exposes an issue in our NSW tracking for multiplies. The code
was assuming that (X +nsw Y) *nsw Z results in
(X *nsw Z) +nsw (Y *nsw Z) -- however, it is possible that the
distributed multiplications overflow, even if the non-distributed
one does not. We should discard the nsw flag if the the offset is
non-zero. If we just have (X *nsw Y) *nsw Z then concluding
X *nsw (Y *nsw Z) is fine.

Differential Revision: https://reviews.llvm.org/D112848
2021-11-02 20:27:39 +01:00
Arthur Eubanks 029f1a5344 [LazyCallGraph] Skip blockaddresses
blockaddresses do not participate in the call graph since the only
instructions that use them must all return to someplace within the
current function. And passes cannot retrieve a function address from a
blockaddress.

This was suggested by efriedma in D58260.

Fixes PR50881.

Reviewed By: nickdesaulniers

Differential Revision: https://reviews.llvm.org/D112178
2021-11-01 13:10:24 -07:00
Nikita Popov 7cf7378a9d [BasicAA] Don't treat non-inbounds GEP as nsw
The scale multiplication is only guaranteed to be nsw if the GEP
is inbounds (or the multiplication is trivial). Previously we were
only considering explicit muls in GEP indices.
2021-10-29 22:30:44 +02:00
Nikita Popov 4dd540d9c8 [BasicAA] Add missing inbounds to tests (NFC)
Add missing inbounds to tests that are not correct without it due
to possibility of offset overflow.

inbounds: https://alive2.llvm.org/ce/z/LC8G9_
w/o inbounds: https://alive2.llvm.org/ce/z/ErrJVW
2021-10-29 19:05:39 +02:00
Nikita Popov 36b22f7845 [BasicAA] Add range test with nsw (NFC) 2021-10-29 18:00:25 +02:00
Nikita Popov fbc0c308d5 [BasicAA] Handle known bits as ranges
BasicAA currently tries to determine that the offset is positive by
checking whether all variable indices are positive based on known
bits, multiplied by a positive scale. However, this is incorrect
if the scale multiplication might overflow. In the modified test
case the original value is positive, but may be negative after a
left shift.

Fix this by converting known bits into a constant range and reusing
the range-based logic, which handles overflow correctly.

Differential Revision: https://reviews.llvm.org/D112611
2021-10-27 14:41:31 +02:00
Nikita Popov 9bc7e543b4 [BasicAA] Make range check more precise
Make the range check more precise by calculating the range of
potentially accessed bytes for both accesses and checking whether
their intersection is empty. In that case there can be no overlap
between the accesses and the result is NoAlias.

This is more powerful than the previous approach, because it can
deal with sign-wrapped ranges. In the test case the original range
is [-1, INT_MAX] but becomes [0, INT_MIN] after applying the offset.
This is a wrapping range, so getSignedMin/getSignedMax will treat
it as a full range. However, the range excludes the elements
[INT_MIN+1, -1], which is enough to prove NoAlias with an access
at offset -1.

Differential Revision: https://reviews.llvm.org/D112486
2021-10-27 12:40:58 +02:00
Roman Lebedev db848fbf67
[NFC][LV][X86] Improve test coverage for masked mem ops 2021-10-27 13:36:04 +03:00
David Green 74b2a4edcc [AArch64] Add a costmodel test for overflowing arithmatic. NFC 2021-10-26 10:35:12 +01:00
Nikita Popov 721569cc36 [BasicAA] Add test for benign range overflow (NFC) 2021-10-25 22:09:39 +02:00
Nikita Popov 7e97347409 [BasicAA] Add test for incorrect non-negative logic (NFC) 2021-10-25 18:02:41 +02:00
Nikita Popov 0d20ebf686 [BasicAA] Use ranges for more than one index
D109746 made BasicAA use range information to determine the
minimum/maximum GEP offset. However, it was limited to the case of
a single variable index. This patch extends support to multiple
indices by adding all the ranges together.

Differential Revision: https://reviews.llvm.org/D112378
2021-10-25 15:30:50 +02:00
Nikita Popov 2ae67c9684 [BasicAA] Add range test with multiple indices (NFC) 2021-10-24 16:13:03 +02:00
Nikita Popov 61cfdf636d [BasicAA] Model implicit trunc of GEP indices
GEP indices larger than the GEP index size are implicitly truncated
to the index size. BasicAA currently doesn't model this, resulting
in incorrect alias analysis results.

Fix this by explicitly modelling truncation in CastedValue in the
same way we do zext and sext. Additionally we need to disable a
number of optimizations for truncated values, in particular
"non-zero" and "non-equal" may no longer hold after truncation.
I believe the constant offset heuristic is also not necessarily
correct for truncated values, but wasn't able to come up with a
test for that one.

A possible followup here would be to use the new mechanism to
model explicit trunc as well (which should be much more common,
as it is the canonical form). This is straightforward, but omitted
here to separate the correctness fix from the analysis improvement.

(Side note: While I say "index size" above, BasicAA currently uses
the pointer size instead. Something for another day...)

Differential Revision: https://reviews.llvm.org/D110977
2021-10-22 23:47:02 +02:00
Roman Lebedev 8fac9e95ad
[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members
By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

Right now, for such not-fully-interleaved loads we just use the costs
for fully-interleaved loads. But at least **in general**,
that is obviously overly pessimistic, because **in general**,
not all the shuffles needed to perform the full interleaving
will end up being live.

So what this does, is naively scales the interleaving cost
by the fraction of the live members. I believe this should still result
in the right ballpark cost estimate, although it may be over/under -estimate.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112307
2021-10-22 16:33:58 +03:00
David Sherwood 9448cdc900 [SVE][Analysis] Tune the cost model according to the tune-cpu attribute
This patch introduces a new function:

  AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

  Analysis/CostModel/AArch64/sve-gather.ll
  Analysis/CostModel/AArch64/sve-scatter.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D110259
2021-10-21 09:33:50 +01:00
Simon Pilgrim 5b395bd633 [CostModel][X86] Add costs for multiply-by-pow2 constants
These are folded to left shifts in the backend.

We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436
2021-10-20 13:11:21 +01:00
David Green 862e8d7e55 [AArch64] Improve div and rem costmodel tests. NFC
Copied from the X86 tests, these give a better test coveraged than the
existing tests.
2021-10-20 09:58:35 +01:00
Bjorn Pettersson 08619006a0 [SCEV] Avoid compile time explosion in ScalarEvolution::isImpliedCond
As seen in PR51869 the ScalarEvolution::isImpliedCond function might
end up spending lots of time when doing the isKnownPredicate checks.

Calling isKnownPredicate for example result in isKnownViaInduction
being called, which might result in isLoopBackedgeGuardedByCond being
called, and then we might get one or more new calls to isImpliedCond.
Even if the scenario described here isn't an infinite loop, using
some random generated C programs as input indicates that those
isKnownPredicate checks quite often returns true. On the other hand,
the third condition that needs to be fulfilled in order to "prove
implications via truncation", i.e. the isImpliedCondBalancedTypes
check, is rarely fulfilled.
I also made some similar experiments to look at how often we would
get the same result when using isKnownViaNonRecursiveReasoning instead
of isKnownPredicate. So far I haven't seen a single case when codegen
is negatively impacted by using isKnownViaNonRecursiveReasoning. On
the other hand, it seems like we get rid of the compile time explosion
seen in PR51869 that way. Hence this patch.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D112080
2021-10-19 21:37:57 +02:00
Kirill Stoimenov 62627c7217 [Sanitizers] Replaced getMaxPointerSizeInBits with getPointerSizeInBits, which was causing failures for 32bit x86.
Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D111829
2021-10-18 09:31:14 -07:00
Simon Pilgrim f041338153 [X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:46:10 +01:00
Simon Pilgrim c850d5c5c8 [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:15:14 +01:00
Simon Pilgrim dc3382dc2c [CostModel][X86] Add mul by positive/negative power-of-2 constants tests
We have backend optimizations for these, but currently the costmodel doesn't match them
2021-10-17 20:34:17 +01:00
Simon Pilgrim dbf5dc8930 [CostModel][X86] Add div/rem by negative power-of-2 constants
We have backend optimizations for these (like we do for power-of-2 divisions), but currently the costmodel doesn't match them
2021-10-17 18:51:15 +01:00
Roman Lebedev 91373bf12e
[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So could pick cost of `40`

For store we have:
https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So we could pick cost of `40`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111945
2021-10-17 17:28:10 +03:00
Roman Lebedev 3274ce3a28
[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111944
2021-10-17 17:28:10 +03:00
Roman Lebedev 3a6a9f74d3
[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0`
So could pick cost of `68`

For store we have:
https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111943
2021-10-17 17:28:09 +03:00
Roman Lebedev 4b76a74b42
[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `48`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111942
2021-10-17 17:28:09 +03:00
Roman Lebedev 887acf6842
[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0`
So could pick cost of `212`

For store we have:
https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0`
So we could pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111940
2021-10-17 17:28:09 +03:00
Simon Pilgrim 85b87179f4 [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs
Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938
2021-10-16 17:28:07 +01:00
Simon Pilgrim 6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Roman Lebedev d137f1288e
[X86][LV] X86 does *not* prefer vectorized addressing
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.

This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111546
2021-10-16 12:32:18 +03:00
Roman Lebedev 3d7bf6625a
[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174
2021-10-14 23:14:36 +03:00
Nikita Popov 5f05ff081f [BasicAA] Improve scalable vector handling
Currently, DecomposeGEP() bails out on the whole decomposition if
it encounters a scalable GEP type anywhere. However, it is fine to
still analyze other GEPs that we look through before hitting the
scalable GEP. This does mean that the decomposed GEP base is no
longer required to be the same as the underlying object. However,
I don't believe this property is necessary for correctness anymore.

This allows us to compute slightly more precise aliasing results
for GEP chains containing scalable vectors, though my primary
interest here is simplifying the code.

Differential Revision: https://reviews.llvm.org/D110511
2021-10-14 20:23:50 +02:00
Simon Pilgrim 77dcdc2f50 [CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32
Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))
2021-10-14 12:17:40 +01:00
Roman Lebedev cb41efb5f4
[NFC][Costmodel][X86] Fix broken `CHECK-NOT`'s in interleave costmodel tests 2021-10-13 22:44:57 +03:00
Roman Lebedev 18eef13dad
[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()`
`X86TTIImpl::getGSScalarCost()` has (at least) two issues:
* it naively computes the cost of sequence of `insertelement`/`extractelement`.
  If we are operating not on the XMM (but YMM/ZMM),
  this widely overestimates the cost of subvector insertions/extractions.
* Gather/scatter takes a vector of pointers, and scalarization results in us performing
  scalar memory operation for each of these pointers, but we never account for the cost
  of extracting these pointers out of the vector of pointers.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111222
2021-10-13 22:35:39 +03:00
Florian Hahn 4cd6cc64ed [SCEV] Add test for propagating poison through select condition.
Precommit a test for D111643.
2021-10-13 17:14:35 +01:00
Arthur Eubanks 259390de9a [LCG] Don't skip invalidation of LazyCallGraph if CFG analyses are preserved
The CFG being changed and the overall call graph are not related, we can introduce/remove calls without changing the CFG.

Resolves one of the issues in PR51946.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D111275
2021-10-11 13:30:47 -07:00
Clement Courbet 342d7b654c [BasicAA][NFC] Improve comment. 2021-10-11 10:42:59 +02:00
Clement Courbet 83ded5d323 re-land "[AA] Teach BasicAA to recognize basic GEP range information."
Now that PR52104 is fixed.
2021-10-11 10:04:22 +02:00
David Green adec922361 [AArch64] Make -mcpu=generic schedule for an in-order core
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.

The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.

On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.

When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.

A lot of existing tests have updated. This is a summary of the important
differences:
 - Most changes are the same instructions in a different order.
 - Sometimes this leads to very minor inefficiencies, such as requiring
   an extra mov to move variables into r0/v0 for the return value of a test
   function.
 - misched-fusion.ll was no longer fusing the pairs of instructions it
   should, as per D110561. I've changed the schedule used in the test
   for now.
 - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
   the different latencies. This seems fine to me.
 - Some SVE tests do not always remove movprfx where they did before due
   to different register allocation giving different destructive forms.
 - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
   produce two LDR where they previously produced an LDP due to
   store-pair-suppress kicking in.
 - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
 - Some tests such as arm64-neon-mul-div.ll and
   ragreedy-local-interval-cost.ll have more, less or just different
   spilling.
 - In aarch64_generated_funcs.ll.generated.expected one part of the
   function is no longer outlined. Interestingly if I switch this to use
   any other scheduled even less is outlined.

Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.

Differential Revision: https://reviews.llvm.org/D110830
2021-10-09 15:58:31 +01:00
Simon Pilgrim b6426d5211 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT/UGT for generic abs/min/max cost expansion
Split off ABS cost handling from MIN/MAX and use explicit predicates for each

Our generic expansion of ABS doesn't use NEG+CMP+SELECT any more (its now ASHR+ADD+XOR) so this needs to be updated.
2021-10-08 12:41:58 +01:00
Simon Pilgrim 716883736b [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT for generic sadd/ssub sat cost expansion
The comparison always checks for negative values so know the icmp predicate will be ICMP_SGT
2021-10-07 15:42:45 +01:00
Philip Reames 1183d65b4d [SCEV] Search operand tree for scope bound when inferring flags from IR
When checking to see if we can apply IR flags to a SCEV, we need to identify a bound on the defining scope of the SCEV to be produced.  We'd previously added support for a couple SCEVExpr types which trivially imply bounds, but hadn't handled types such as umax where the bounds come from the bounds of the operands.  This does the obvious thing, and recurses through operands searching for a tighter bound on the defining scope.

I'm honestly surprised by how little this seems to mater on existing tests, but it's worth doing for completeness sake alone.

Differential Revision: https://reviews.llvm.org/D111191
2021-10-06 15:10:02 -07:00
Philip Reames 2b3d913cc5 [tests] precommit test changes for D111191 2021-10-06 12:12:49 -07:00
Philip Reames 67896f494e Returning poison from a function w/ noundef return attribute is UB
This does for readability of returns within said function as what we do for the caller side when reasoning about what might be poison.

Differential Revision: https://reviews.llvm.org/D111180
2021-10-06 11:52:18 -07:00
Philip Reames 0658bab870 [SCEV] Infer flags from add/gep in any block
This patch removes a compile time restriction from isSCEVExprNeverPoison. We've strengthened our ability to reason about flags on scopes other than addrecs, and this bailout prevents us from using it. The comment is also suspect as well in that we're in the middle of constructing a SCEV for I. As such, we're going to visit all operands *anyways*.

Differential Revision: https://reviews.llvm.org/D111186
2021-10-06 11:11:54 -07:00
Simon Pilgrim 2ced9a42be [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_NE for generic smulo/umulo cost expansion
Match the predicate used in TargetLowering::expandMULO to detect overflow
2021-10-06 19:11:33 +01:00
Simon Pilgrim 7bd097fd1e [CostModel][TTI] Fix ops used for generic smulo/umulo cost expansion
Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values.

The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply.

More closely matches the default expansion from TargetLowering::expandMULO
2021-10-06 19:11:32 +01:00
Simon Pilgrim 81b5da8c97 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_ULT/UGT for generic uadd/usubo cost expansion
Match the predicates used in TargetLowering::expandUADDSUBO
2021-10-06 19:11:32 +01:00
Nikita Popov 1301a8b473 [BasicAA] Don't unnecessarily extend pointer size
BasicAA GEP decomposition currently performs all calculation on the
maximum pointer size, but at least 64-bit, with an option to double
the size. The code comment claims that this improves analysis power
when working with uint64_t indices on 32-bit systems. However, I don't
see how this can be, at least while maintaining correctness:

When working on canonical code, the GEP indices will have GEP index
size. If the original code worked on uint64_t with a 32-bit size_t,
then there will be truncs inserted before use as a GEP index. Linear
expression decomposition does not look through truncs, so this will
be an opaque value as far as GEP decomposition is concerned. Working
on a wider pointer size does not help here (or have any effect at all).

When working on non-canonical code (before first InstCombine), the
GEP indices are implicitly truncated to GEP index size. The BasicAA
code currently just ignores this fact completely, and pretends that
this truncation doesn't happen. This is incorrect and will be
addressed by D110977.

I believe that for correctness reasons, it is important to work on
the actual GEP index size to properly model potential overflow.
BasicAA tries to patch over the fact that it uses the wrong size
(see adjustToPointerSize), but it only does that in limited cases
(only for constant values, and not all of them either). I'd like to
move this code towards always working on the correct size, and
dropping these artificial pointer size adjustments is the first step
towards that.

Differential Revision: https://reviews.llvm.org/D110657
2021-10-06 18:40:21 +02:00
Simon Pilgrim 3dda247e18 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_EQ for generic funnel shift cost expansion
The comparison always checks for zero value so know the icmp predicate will be ICMP_EQ
2021-10-06 16:39:16 +01:00
Clement Courbet ff41fc07b1 Revert "[AA] Teach BasicAA to recognize basic GEP range information."
We have found a miscompile with this change, reverting while working on a
reproducer.

This reverts commit 455b60ccfb.
2021-10-06 16:49:10 +02:00
Simon Pilgrim 0776924a17 [CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions
As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost.

These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.
2021-10-06 10:14:56 +01:00
Philip Reames e64ed3c8df [test] autogen a couple of additional tests 2021-10-05 18:58:08 -07:00
Philip Reames c59c32caa0 [test] factor out reliance on noundef return value 2021-10-05 14:45:48 -07:00
Philip Reames 5020e104a1 [test] rework recently added SCEV tests
These are meant to check a future patch which recurses through operands of SCEVs, but because all SCEVs are trivially bounded by function entry, we need to arrange the trivial scope not to be valid.  (i.e. we specifically need a lower defining scope)
2021-10-05 14:42:53 -07:00
Philip Reames 94c1c56cc5 [tests] Cover cases we could infer SCEV flags, but don't 2021-10-05 13:16:16 -07:00
Roman Lebedev f92961d238
[NFC] Fixup newly-added costmodel tests to actually test what they should 2021-10-05 21:35:47 +03:00
Roman Lebedev 200edc152b
[NFC][X86][LV] Add basic costmodel test coverage for not-fully-interleaved i32 loads
The coverage could have cumulative explosion here,
so i'm adding only the most basic cases,
and hoping it's enough, though more can be added if needed.
2021-10-05 19:39:50 +03:00
Roman Lebedev 3f9b235482
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0`
So could pick cost of `36`

For store we have:
https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0`
So we could pick cost of `30`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111094
2021-10-05 16:58:58 +03:00
Roman Lebedev e2784c5d8c
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0`
So could pick cost of `18`.

For store we have:
https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0`
So we could pick cost of `15`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111093
2021-10-05 16:58:58 +03:00
Roman Lebedev 3960693048
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0`
So we could pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111092
2021-10-05 16:58:58 +03:00
Roman Lebedev 79d6d12d95
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs
This one required quite a bit of an assembly surgery, but i think it's in the right ballpark..

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So could pick cost of `64`.

For store we have:
https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5`
So we could pick cost of `66`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111091
2021-10-05 16:58:58 +03:00
Roman Lebedev 2996a2b50f
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0`
So could pick cost of `31`.

For store we have:
https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8`
So we could pick cost of `33`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111089
2021-10-05 16:58:57 +03:00
Roman Lebedev d51532d8aa
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8`
So could pick cost of `15`.

For store we have:
https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0`
So we could pick cost of `12`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111087
2021-10-05 16:58:57 +03:00
Roman Lebedev 764fd5f463
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0`
So we could pick cost of `9`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111083
2021-10-05 16:58:57 +03:00
Roman Lebedev c800119c46
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0`
So could pick cost of `20`.

For store we have:
https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0`
So we could pick cost of `20`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111076
2021-10-05 16:58:57 +03:00
Roman Lebedev 000ce0bfd5
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So could pick cost of `8`.

For store we have:
https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So we could pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111075
2021-10-05 16:58:57 +03:00
Roman Lebedev dcc2b0d933
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0`
So we could pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111073
2021-10-05 16:58:57 +03:00