llvm-project

Commit Graph

Author	SHA1	Message	Date
David Green	255ad73424	[ARM] Make MVE v2i1 predicates legal MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the two halves. This was never treated as a legal type in llvm in the past as there are not many 64bit instructions and no 64bit compares. There are a few instructions that could use it though, notably a VSELECT (as it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for similar reasons, some gathers/scatter and long multiplies and VCTP64 instructions. This patch goes through and makes v2i1 a legal type, handling all the cases that fall out of that. It also makes VSELECT legal for v2i64 as a side benefit. A lot of the codegen changes as a result - usually in way that is a little better or a little worse, but still expensive. Costs can change a little too in the process, again in a way that expensive things remain expensive. A lot of the tests that changed are mainly to ensure correctness - the code can hopefully be improved in the future where it comes up in practice. The intrinsics currently remain using the v4i1 they previously did to emulate a v2i1. This will be changed in a followup patch but this one was already large enough. Differential Revision: https://reviews.llvm.org/D114449	2021-12-03 14:05:41 +00:00
Nikita Popov	49d040ac97	[SCEV] Fix ValuesAtScopesUsers consistency Fixes verification failure reported at: https://reviews.llvm.org/rGc9f9be0381d1 The issue is that getSCEVAtScope() might compute a result without inserting it in the ValuesAtScopes map in degenerate cases, specifically if the ValuesAtScopes entry is invalidated during the calculation. Arguably we should still insert the result if no existing placeholder is found, but for now just tweak the logic to only update ValuesAtScopesUsers if ValuesAtScopes is updated.	2021-12-03 10:03:10 +01:00
Florian Hahn	829b29b619	[MemoryLocation] strcat/strncat/strcpy read/write after their args. strcpy/strcat/strncat access memory starting from the passed in pointers. Construct memory locations for their args using getAfter. Discussed in D114872. Reviewed By: reames Differential Revision: https://reviews.llvm.org/D114969	2021-12-03 08:48:23 +00:00
Daniil Fukalov	ab05ab59a7	[CostModel][AMDGPU] Fix instructions costs estimation for vector types. 1. Fixed vector instructions costs estimations incosistency - removed different logic for "not simple types" since it biases costs for these types. 2. Fixed legalization penalty for vectors too big for the target: changed from overwrite default legalization cost value estimation to added penalty. 3. Fixed few typos in tests. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114893	2021-12-03 03:08:08 +03:00
Philip Reames	740057d185	[funcattrs] Infer writeonly argument attribute This change extends the current logic for inferring readonly and readnone argument attributes to also infer writeonly. This change is deliberately minimal; there's a couple of areas for follow up. * I left out all call handling and thus any benefit from the SCC walk. When examining the test changes, I realized the existing code is imprecise, and am going to fix that in it's own revision before adding in the writeonly handling. (Mostly because updating the tests is hard when I, the human, can't figure out whether the result is correct.) * I left out handling for storing a value (as opposed to storing to a pointer). This should benefit readonly/readnone as well, and applies to a bunch of other instructions. Seemed worth having as a separate review. Differential Revision: https://reviews.llvm.org/D114963	2021-12-02 13:04:09 -08:00
Florian Hahn	222442ec2d	[BasicAA] Add tests for strcat/strncat/strcpy.	2021-12-02 17:38:07 +00:00
Florian Hahn	639a78a4bf	[MemoryLocation] Support strncpy in getForArgument. The size argument of strncpy can be used as bound for the size of its pointer arguments. strncpy is guaranteed to write N bytes and reads up to N bytes. Reviewed By: xbolva00 Differential Revision: https://reviews.llvm.org/D114871	2021-12-02 14:18:05 +00:00
Florian Hahn	9f9e8ba114	[MemoryLocation] Support memset_chk in getForArgument. The size argument for memset_chk is an upper bound for the size of the pointer argument. memset_chk may write less than the specified length, if it exceeds the specified max size and aborts. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D114870	2021-12-02 13:45:58 +00:00
Florian Hahn	47616c8855	[BasicAA] Add tests for memset_pattern{4,8,16}. This also removes the existing memset_pattern.ll test, which was relying on GVN. It is also covered by the new test directly.	2021-12-02 11:50:32 +00:00
Florian Hahn	524ad6babb	[BasicAA] Add memset_chk libfunc tests.	2021-12-01 14:15:46 +00:00
Florian Hahn	c6bd63803f	[BasicAA] Add strncpy libfunc tests.	2021-12-01 14:15:40 +00:00
Roman Lebedev	8cd782487f	[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()` We ask `TTI.getAddressComputationCost()` about the cost of computing vector address, and then multiply it by the vector width. This doesn't make any sense, it implies that we'd do a vector GEP and then scalarize the vector of pointers, but there is no such thing in the vectorized IR, we perform scalar GEP's. This is especially bad on X86, and was effectively prohibiting any scalarized vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()` says that cost of vector address computation is `10` as compared to `1` for scalar. The computed costs are similar to the ones with D111222+D111220, but we end up without masked memory intrinsics that we'd then have to expand later on, without much luck. (D111363) Differential Revision: https://reviews.llvm.org/D111460	2021-11-30 10:47:56 +03:00
Nikita Popov	77dd579827	[SCEV] Remove incorrect assert Fix assertion failure reported on D113349 by removing the assert. While the produced expression should be equivalent, it may not be strictly the same, e.g. due to lazy nowrap flag updates. Similar to what the main createSCEV() code does, simply retain the old value map entry if one already exists.	2021-11-29 17:09:12 +01:00
Roman Lebedev	7e73c2a66a	[X86][Costmodel] `getInterleavedMemoryOpCostAVX512()`: masked load can not be folded into a shuffle The mask on the shuffle is for the output, not the input. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114697	2021-11-29 18:37:07 +03:00
Roman Lebedev	5e96553608	[NFC][X86][LV][Costmodel] Add most basic test for masked interleaved load	2021-11-29 16:46:19 +03:00
Roman Lebedev	cffe3a084f	[X86][Costmodel] Now that `getReplicationShuffleCost()` is good, update `getInterleavedMemoryOpCostAVX512()` ... to actually ask about i1-elt-wide mask, since that is what will probably be used on AVX512. This unblocks D111460. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114316	2021-11-29 14:41:48 +03:00
Nikita Popov	2b160e95c8	Reland [SCEV] Fix and validate ValueExprMap/ExprValueMap consistency Relative to the previous landing attempt, this introduces an additional flag on forgetMemoizedResults() to not remove SCEVUnknown phis from the value map. The invalidation after BECount calculation wants to leave these alone and skips them in its own use-def walk, but we can still end up invalidating them via forgetMemoizedResults() if there is another IR value with the same SCEV. This is intended as a temporary workaround only, and the need for this should go away once the getBackedgeTakenInfo() invalidation is refactored in the spirit of D114263. ----- This adds validation for consistency of ValueExprMap and ExprValueMap, and fixes identified issues: * Addrec construction directly wrote to ValueExprMap in a few places, without updating ExprValueMap. Add a helper to ensures they stay consistent. The adjustment in forgetSymbolicName() explicitly drops the old value from the map, so that we don't rely on it being overwritten. * forgetMemoizedResultsImpl() was dropping the SCEV from ExprValueMap, but not dropping the corresponding entries from ValueExprMap. Differential Revision: https://reviews.llvm.org/D113349	2021-11-27 12:37:15 +01:00
Zarko Todorovski	7f7dac7126	[NFC][llvm] Inclusive language: reword uses of sanity test and check Part of continuing work to use more inclusive language. Reworded uses of sanity check and sanity test in llvm/test/	2021-11-25 07:21:42 -05:00
Graham Hunter	dee810e117	[NFC][LAA] Precommit tests for forked pointers Precommit for https://reviews.llvm.org/D108699	2021-11-24 16:20:35 +00:00
Peter Waller	787b66eb5f	[LoopAccessAnalysis][SVE] Bail out for scalable vectors The supplied test case, reduced from real world code, crashes with a 'Invalid size request on a scalable vector.' error. Since it's similar in spirit to an existing LAA test, rename the file to generalize it to both. Differential Revision: https://reviews.llvm.org/D114155	2021-11-24 15:52:20 +00:00
Roman Lebedev	cd8d219536	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 32 bit when have AVX512DQ I believe, this effectively completes `X86TTIImpl::getReplicationShuffleCost()` for AVX512, other than the question of handling plain AVX512F, where we end up with some really ugly "shuffles", but then is there any CPU's that support AVX512, but not AVX512DQ/AVX512BW? Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114315	2021-11-24 17:23:15 +03:00
Florian Mayer	6c06d8e310	[stack-safety] Check SCEV constraints at memory instructions. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D113160	2021-11-23 15:29:23 -08:00
Roman Lebedev	704d92607d	[X86][TTI] Finish costmodel for AVX512BW's VPMOVM2[BW] / VPMOV[BW]2M instructions Apparently my methodology was suboptimal, and not only did miss all the +VL tuples, i also missed some plain tuples. I believe, this adds everything missing. Indeed, these manual costmodels are just not okay long-term. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114334	2021-11-22 14:31:34 +03:00
Roman Lebedev	8d09dd61c3	[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW, these either sign-extent the mask register into a vector, or pack the mask from vector register. Apparently, we didn't even have MCA tests for these, added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05, so i'm just guessing that their perf characteristics are optimal. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114314	2021-11-22 14:31:34 +03:00
Sjoerd Meijer	4d21b64464	[BPI] Look-up tables for non-loop branches. NFC. This adds and uses look-up tables for non-loop branch probabilities, which have have probabilities directly encoded into the tables for the different condition codes. Compared to having this logic inlined in different functions, as it used to be the case, I think this is compacter and thus also easier to check/cross reference. This also adds a test for pointer heuristics that was missing. Differential Revision: https://reviews.llvm.org/D114009	2021-11-22 10:30:42 +00:00
Roman Lebedev	df70cf5e14	[NFC][X86][Costmodel] Actually test +prefer-256-bit in replication-shuffle-related tests :( While -prefer-256-bit indeed becomes complete with D114314, the real-world (the one with +prefer-256-bit) coverage is lacking. Hilarious.	2021-11-21 01:25:49 +03:00
Roman Lebedev	da47a63e03	[NFC][X86][Costmodel] Add AVX512DQ runlines to trunc.ll/extend.ll	2021-11-20 13:55:13 +03:00
Roman Lebedev	049799c311	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 8 bit when have AVX512BW+AVX512VBMI If in addition to AVX512BW (that provides `{k}<->{i8,i16}` casts and i16 shuffles), we have AVX512VBMI, which provides i8 shuffles, we are in an optimal situation. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114071	2021-11-19 15:58:10 +03:00
Roman Lebedev	a751084bb4	[X86][Costmodel] `trunc v16i8 to v8i1` can appear after legalization, cost is same as for `trunc v8i8 to v8i1` Note that there are many other missing costs, i'm only adding the ones that are queried from `getReplicationShuffleCost()` for the existing (quite exhaustive) test coverage. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D114070	2021-11-19 15:57:32 +03:00
Roman Lebedev	a50fdd3fc9	[X86][Costmodel] `getReplicationShuffleCost()`: promote 1 bit-wide elements to 16 bit when have AVX512BW Here we get pretty lucky. AVX512F does not provide any instructions to convert between a `k` vector mask and a vector, but AVX512BW adds `{k}<->nX{i8,i16}`conversions, and just as it happens, with AVX512BW we have a i16 shuffle. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113915	2021-11-19 15:55:41 +03:00
Philip Reames	ea12c2cb9c	[SCEV] Move mustprogress based no-self-wrap logic so it applies to all exit conditions This change moves logic which we'd added specifically for less than tests so that it applies to equalities and greater than tests as well. The basic idea is that if we can show an IV cycles infinitely through the same series on self-wrap, and that the exit condition must be taken to prevent UB, we can conclude that it must be taken before self-wrap and thus infer said flag. The motivation here is simple loops with unsigned induction variables w/non-one steps and inequality tests. A toy example would be: for (unsigned i = 0; i != N; i += 2) { body; } If body contains no side effects, and this is a mustprogress function, we can assume that this must be a finite loop and thus that the exit count is N/2. Differential Revision: https://reviews.llvm.org/D103991	2021-11-18 10:07:44 -08:00
Philip Reames	100df68496	[SCEV] Add test coverage for invertible functions of IVs	2021-11-18 08:56:45 -08:00
Florian Hahn	da9f2ba3b1	[SCEV] Reorder operands checks in collectConditions. The initial two cases require a SCEVConstant as RHS. Pull up the condition to check and swap SCEVConstants from below. Also remove a redundant check & swap if RHS is SCEVUnknown.	2021-11-18 09:36:16 +00:00
Florian Hahn	dd6281c4c1	[SCEV] Add additional guard tests with swapped condition ops.	2021-11-18 09:35:19 +00:00
Philip Reames	0623f52a46	Autogen a test for ease of update	2021-11-17 17:20:57 -08:00
Philip Reames	ad69402f3e	[SCEVAA] Avoid forming malformed pointer diff expressions This solves the same crash as in D104503, but with a different approach. The test case test_non_dom demonstrates a case where scev-aa crashes today. (If exercised either by -eval-aa or -licm.) The basic problem is that SCEV-AA expects to be able to compute a pointer difference between two SCEVs for any two pair of pointers we do an alias query on. For (valid, but out of scope) reasons, we can end up asking whether expressions in different sub-loops can alias each other. This results in a subtraction expression being formed where neither operand dominates the other. The approach this patch takes is to leverage the "defining scope" notion we introduced for flag semantics to detect and disallow the formation of the problematic SCEV. This ends up being relatively straight forward on that new infrastructure. This change does hint that we should probably be verifying a similar property for all SCEVs somewhere, but I'll leave that to a follow on change. Differential Revision: D114112	2021-11-17 12:38:04 -08:00
Florian Hahn	e8b55cf7b7	[SCEV] Apply loop guards when computing max BTC for arbitrary steps. Similar other cases in the current function (e.g. when the step is 1 or -1), applying loop guards can lead to tighter upper bounds for the backedge-taken counts. Fixes PR52464. Reviewed By: reames, nikic Differential Revision: https://reviews.llvm.org/D113578	2021-11-17 11:00:49 +00:00
Roman Lebedev	496ccb543e	[NFC][X86][Costmodel] Improve test coverage for i32->i64 vector *ext	2021-11-17 12:02:50 +03:00
Roman Lebedev	2037ec725f	[X86][Costmodel] `ext v64i1 to v32i16` can appear after legalization, cost is same as for `ext v32i1 to v32i16` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113914	2021-11-17 12:02:50 +03:00
Roman Lebedev	23b194bf18	[X86][Costmodel] `trunc v32i16 to v64i1` can appear after legalization, cost is same as for `trunc v32i16 to v32i1` Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113913	2021-11-17 12:02:50 +03:00
Philip Reames	8d85e945b2	[SCEV] Canonicalize X - urem X, Y patterns There are multiple possible ways to represent the X - urem X, Y pattern. SCEV was not canonicalizing, and thus, depending on which you analyzed, you could get different results. The sub representation appears to produce strictly inferior results in practice, so I decided to canonicalize to the Y * X/Y version. The motivation here is that runtime unroll produces the sub X - (and X, Y-1) pattern when Y is a power of two. SCEV is thus unable to recognize that an unrolled loop exits because we don't figure out that the new unrolled step evenly divides the trip count of the unrolled loop. After instcombine runs, we convert the the andn form which SCEV recognizes, so essentially, this is just fixing a nasty pass ordering dependency. The ARM loop hardware interaction in the test diff is opague to me, but the comments in the review from others knowledge of the infrastructure appear to indicate these are improvements in loop recognition, not regressions. Differential Revision: https://reviews.llvm.org/D114018	2021-11-16 11:59:21 -08:00
Philip Reames	3dd6d5b628	[tests] Add coverage for different forms of X - urem X, Y	2021-11-16 09:26:34 -08:00
Philip Reames	56ae2cfecf	autogen a SCEV test file	2021-11-16 09:26:34 -08:00
Florian Hahn	b7aec4f08e	[SCEV] Support rewriting ZExt expressions with loop guard info. So far, applying loop guard information has been restricted to SCEVUnknown. In a few cases, like PR40961 and PR52464, this leads to SCEV failing to determine tight upper bounds for the backedge taken count. This patch adjusts SCEVLoopGuardRewriter and applyLoopGuards to support re-writing ZExt expressions. This is a first step towards fixing PR40961 and PR52464. Reviewed By: reames Differential Revision: https://reviews.llvm.org/D113577	2021-11-16 11:16:07 +00:00
David Green	309f1e4ac8	[ARM] Add datalayout to costmodel tests. NFC This adds a sensible datalayout to the ARM cost model tests, to prevent the costs reported being incorrect for the size of pointers.	2021-11-16 09:49:42 +00:00
Roman Lebedev	7114c60e8f	[NFC][X86][Costmodel] Improve test coverage for {i8,i16,i32,i64}->i1 vector trunc	2021-11-15 20:46:48 +03:00
Roman Lebedev	949103dc36	[NFC][X86][Costmodel] Improve test coverage for i1->{i8,i16,i32,i64} vector *ext	2021-11-15 20:46:48 +03:00
Roman Lebedev	bc35d5fe2f	[NFC][X86][Costmodel] Add i1 replication shuffle costmodel test coverage	2021-11-15 20:02:52 +03:00
Roman Lebedev	5c7255fe3a	[X86][Costmodel] `getReplicationShuffleCost()`: promote 8 bit-wide elements to 32 bit when no AVX512VBMI Currently `X86TTIImpl::getInterleavedMemoryOpCostAVX512()` asks about i8 elt type, so this change does affect vectorization. In the end, it will ask about i1. We should also try to promote to i16 if we have AVX512BW, i'll do that in a follow-up. All costs here look good, i've added the missing truncation costs in preparatory patches. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113853	2021-11-15 19:04:02 +03:00
Roman Lebedev	a468c39c90	[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8` Some of the costs get larger here, but i suppose that makes sense since we'd previously query scalarization costs that may not be really representative of the reality. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113852	2021-11-15 19:04:02 +03:00
Roman Lebedev	9e57d9b09d	[X86][Costmodel] `trunc v8i64 to v16i8/v32i8/v64i8` can appear after legalization, cost is same as for `trunc v8i64 to v8i8` While this one is trivial and identical to the previous patch, there is a weird cost change in a follow-up patch that i'm not sure about. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113851	2021-11-15 19:04:02 +03:00
Roman Lebedev	0116c708c6	[X86][Costmodel] `trunc v16i32 to v32i8/v64i8` can appear after legalization, cost is same as for `trunc v16i32 to v16i8` While this one is trivial and identical to the previous patch, there is a weird cost change in a follow-up patch that i'm not sure about. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113850	2021-11-15 19:04:02 +03:00
Roman Lebedev	f86b57e37c	[NFC][X86][Costmodel] Improve test coverage for {i16,i32,i64}->i8 vector trunc	2021-11-14 20:25:40 +03:00
Roman Lebedev	f0da329f93	[NFC][X86][Costmodel] Improve test coverage for i8->{i16,i32,i64} vector *ext	2021-11-14 20:25:33 +03:00
Roman Lebedev	4dd2f0446c	[X86][Costmodel] `getReplicationShuffleCost()`: promote 16 bit-wide elements to 32 bit when no AVX512BW The basic idea is simple, if we don't have native shuffle for this element type, then we must have native shuffle for wider element type, so promote, replicate, demote. I believe, asking `getCastInstrCost(Instruction::Trunc` is correct semantically, case in point `trunc <32 x i32> to <32 x i8>` aka 2 * ZMM will naively result in 2 * XMM, that then will be packed into 1 * YMM, and it should count the cost of said packing, not just the truncations. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113609	2021-11-14 20:01:38 +03:00
Roman Lebedev	b283961012	[X86][Costmodel] `trunc v8i64 to v16i16/v32i16` can appear after legalization, cost is same as for `trunc v8i64 to v8i16` Same as D113842, but for i64 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113843	2021-11-14 18:41:38 +03:00
Roman Lebedev	a5f2fdca99	[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16` This was noticed in D113609, hopefully it unblocks that patch. There are likely other similar problems. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113842	2021-11-14 18:41:37 +03:00
Roman Lebedev	17a3df87ff	[NFC][X86][Costmodel] Improve test coverage for {i32,i64}->i16 vector *ext See https://reviews.llvm.org/D113609 - some of these costs seem wrong.	2021-11-14 16:07:30 +03:00
Roman Lebedev	fd24446ba5	[NFC][X86][Costmodel] Improve test coverage for i16->{i32,i64} vector *ext	2021-11-14 16:06:45 +03:00
Florian Hahn	69c1cbe20f	[SCEV] Add test case where applying zext info pessimizes BTC. Add an additional test case for D113578.	2021-11-12 12:19:35 +00:00
Florian Hahn	5dfe60d171	[SCEV] Add tests where guards limit both %n and (zext %n). Suggested in D113577.	2021-11-12 10:31:35 +00:00
Florian Hahn	8d2a1994c8	[AArch64] Add some fp16 cast cost-model tests. This adds initial tests for cost-modeling {u,s}itofp for fp16 vectors. At the moment, they are under-estimated in a couple of cases.	2021-11-11 18:21:44 +00:00
Roman Lebedev	a70d74323e	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 8 bit-wide elements with AVX512VBMI VBMI introduced VPERMB, so cost-model i8 replication shuffle using it. Note that we can still model i8 replication shufflle without VBMI, by promoting to i16/i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113479	2021-11-10 22:52:40 +03:00
Roman Lebedev	c6e894b9b2	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 16 bit-wide elements with AVX512BW BWI introduced VPERMW, so cost-model i16 replication shuffle using it. Note that we can still model i16 replication shufflle without BWI, by promoting to i32. That will be done in follow-ups. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113478	2021-11-10 22:52:39 +03:00
Roman Lebedev	4101c7bf19	[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F This models lowering to `vpermd`/`vpermq`/`vpermps`/`vpermpd`, that take a single input vector and a single index vector, and are cross-lane. So far i haven't seen evidence that replication ever results in demanding more than a single input vector per output vector. This results in shockingly lesser costs :) Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113350	2021-11-10 22:52:33 +03:00
Florian Hahn	fcf2ae9923	[SCEV] Add tests that require rewriting zexts when applying guards. Precommit tests inspired by PR40961 and PR52464.	2021-11-10 16:58:27 +00:00
Roman Lebedev	3bdf738d1b	[NFC][X86][Costmodel] Add i16 replication shuffle costmodel test coverage	2021-11-09 14:19:44 +03:00
Nikita Popov	a8c318b50e	[BasicAA] Use index size instead of pointer size When accumulating the GEP offset in BasicAA, we should use the pointer index size rather than the pointer size. Differential Revision: https://reviews.llvm.org/D112370	2021-11-07 18:56:11 +01:00
Roman Lebedev	23566f18c6	[NFC][X86][Costmodel] Add tests for i32/i64 replication shuffles While this isn't what we eventually need (i8 or i1), approaching from this end is more straight-forward.	2021-11-06 17:14:56 +03:00
Roman Lebedev	a30ec4778a	[TTI][CostModel] `getUserCost()`: recognize replication shuffles and query their cost This finally creates proper test coverage for replication shuffles, that are used by LV for conditional loads, and will allow to add proper costmodel at least for AVX512. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D113324	2021-11-06 16:45:15 +03:00
Philip Reames	d24a0e8857	[SCEV] Use constant range of RHS to prove NUW on narrow IV in trip count logic The basic idea here is that given a zero extended narrow IV, we can prove the inner IV to be NUW if we can prove there's a value the inner IV must take before overflow which must exit the loop. Differential Revision: https://reviews.llvm.org/D109457	2021-11-05 15:36:47 -07:00
Roman Lebedev	04fa7cbf55	[NFC][CostModel] Add exhaustive test coverage for replication shuffles This coverage has been brought to you by https://godbolt.org/z/nfc3cY1za	2021-11-06 00:53:28 +03:00
Roman Lebedev	ad617183bb	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: mask is i8 not i1 Even though AVX512's masked mem ops (unlike AVX1/2) have a mask that is a `VF x i1`, replication of said masks happens after promotion of it to `VF x i8`, so we should use `i8`, not `i1`, when calculating the cost of mask replication.	2021-11-05 17:27:02 +03:00
Roman Lebedev	34b903d8b0	[NFC] Add forgotten `REQUIRES: asserts` into the new costmodel test	2021-11-03 19:40:23 +03:00
Roman Lebedev	c65e2ac405	[NFC] Rewrite runlines in interleaved-store-accesses-with-gaps.ll once again https://lab.llvm.org/buildbot/#/builders/98/builds/8198 is still failing, and i really don't understand how runlines in this test differ from the ones in other nearby tests...	2021-11-03 19:15:33 +03:00
Roman Lebedev	df93c8a919	[X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask I don't really buy that masked interleaved memory loads/stores are supported on X86. There is zero costmodel test coverage, no actual cost modelling for the generation of the mask repetition, and basically only two LV tests. Additionally, i'm not very interested in AVX512. I don't know if this really helps "soft" block over at https://reviews.llvm.org/D111460#inline-1075467, but i think it can't make things worse at least. When we are being told that there is a masking, instead of completely giving up and falling back to fully scalarizing `BasicTTIImplBase::getInterleavedMemoryOpCost()`, let's correctly query the cost of masked memory ops, keep all the pretty shuffle cost modelling, but scalarize the cost computation for the mask replication. I think, not scalarizing the shuffles themselves may adjust the computed costs a bit, and maybe hopefully just enough to hide the "regressions" at https://reviews.llvm.org/D111460#inline-1075467 I do mean hide, because the test coverage is non-existent. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112873	2021-11-03 18:14:35 +03:00
Roman Lebedev	f3d1ddfe71	[NFC] Use single-dash-prefixed options in newly-added test https://lab.llvm.org/buildbot/#/builders/98/builds/8195 complains, and this is the only guess i have.	2021-11-03 18:12:40 +03:00
Roman Lebedev	a4b64f7727	[BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if mask for gap will be used As it can be seen in `InnerLoopVectorizer::vectorizeInterleaveGroup()`, in some cases (reported by `UseMaskForGaps`), the gaps in the interleaved load/store group will be masked away by another constant mask, so there is no need to account for the cost of replication of the mask for these. Differential Revision: https://reviews.llvm.org/D112877	2021-11-03 17:33:28 +03:00
Roman Lebedev	c6b3da1d66	[NFC][X86] Duplicate LV test into a costmodel test Copied from llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll As discussed in D111460 / D112877 / D112873 we have basically no test coverage for this part of cost model.	2021-11-03 17:25:18 +03:00
Nikita Popov	51e9f33603	[BasicAA] Use saturating multiply on range if nsw If we know that the var * scale multiplication is nsw, we can use a saturating multiplication on the range (as a good approximation of an nsw multiply). This recovers some cases where the fix from D112611 is unnecessarily strict. (This can be further strengthened by using a saturating add, but we currently don't track all the necessary information for that.) This exposes an issue in our NSW tracking for multiplies. The code was assuming that (X +nsw Y) nsw Z results in (X nsw Z) +nsw (Y nsw Z) -- however, it is possible that the distributed multiplications overflow, even if the non-distributed one does not. We should discard the nsw flag if the the offset is non-zero. If we just have (X nsw Y) nsw Z then concluding X nsw (Y *nsw Z) is fine. Differential Revision: https://reviews.llvm.org/D112848	2021-11-02 20:27:39 +01:00
Arthur Eubanks	029f1a5344	[LazyCallGraph] Skip blockaddresses blockaddresses do not participate in the call graph since the only instructions that use them must all return to someplace within the current function. And passes cannot retrieve a function address from a blockaddress. This was suggested by efriedma in D58260. Fixes PR50881. Reviewed By: nickdesaulniers Differential Revision: https://reviews.llvm.org/D112178	2021-11-01 13:10:24 -07:00
Nikita Popov	7cf7378a9d	[BasicAA] Don't treat non-inbounds GEP as nsw The scale multiplication is only guaranteed to be nsw if the GEP is inbounds (or the multiplication is trivial). Previously we were only considering explicit muls in GEP indices.	2021-10-29 22:30:44 +02:00
Nikita Popov	4dd540d9c8	[BasicAA] Add missing inbounds to tests (NFC) Add missing inbounds to tests that are not correct without it due to possibility of offset overflow. inbounds: https://alive2.llvm.org/ce/z/LC8G9_ w/o inbounds: https://alive2.llvm.org/ce/z/ErrJVW	2021-10-29 19:05:39 +02:00
Nikita Popov	36b22f7845	[BasicAA] Add range test with nsw (NFC)	2021-10-29 18:00:25 +02:00
Nikita Popov	fbc0c308d5	[BasicAA] Handle known bits as ranges BasicAA currently tries to determine that the offset is positive by checking whether all variable indices are positive based on known bits, multiplied by a positive scale. However, this is incorrect if the scale multiplication might overflow. In the modified test case the original value is positive, but may be negative after a left shift. Fix this by converting known bits into a constant range and reusing the range-based logic, which handles overflow correctly. Differential Revision: https://reviews.llvm.org/D112611	2021-10-27 14:41:31 +02:00
Nikita Popov	9bc7e543b4	[BasicAA] Make range check more precise Make the range check more precise by calculating the range of potentially accessed bytes for both accesses and checking whether their intersection is empty. In that case there can be no overlap between the accesses and the result is NoAlias. This is more powerful than the previous approach, because it can deal with sign-wrapped ranges. In the test case the original range is [-1, INT_MAX] but becomes [0, INT_MIN] after applying the offset. This is a wrapping range, so getSignedMin/getSignedMax will treat it as a full range. However, the range excludes the elements [INT_MIN+1, -1], which is enough to prove NoAlias with an access at offset -1. Differential Revision: https://reviews.llvm.org/D112486	2021-10-27 12:40:58 +02:00
Roman Lebedev	db848fbf67	[NFC][LV][X86] Improve test coverage for masked mem ops	2021-10-27 13:36:04 +03:00
David Green	74b2a4edcc	[AArch64] Add a costmodel test for overflowing arithmatic. NFC	2021-10-26 10:35:12 +01:00
Nikita Popov	721569cc36	[BasicAA] Add test for benign range overflow (NFC)	2021-10-25 22:09:39 +02:00
Nikita Popov	7e97347409	[BasicAA] Add test for incorrect non-negative logic (NFC)	2021-10-25 18:02:41 +02:00
Nikita Popov	0d20ebf686	[BasicAA] Use ranges for more than one index D109746 made BasicAA use range information to determine the minimum/maximum GEP offset. However, it was limited to the case of a single variable index. This patch extends support to multiple indices by adding all the ranges together. Differential Revision: https://reviews.llvm.org/D112378	2021-10-25 15:30:50 +02:00
Nikita Popov	2ae67c9684	[BasicAA] Add range test with multiple indices (NFC)	2021-10-24 16:13:03 +02:00
Nikita Popov	61cfdf636d	[BasicAA] Model implicit trunc of GEP indices GEP indices larger than the GEP index size are implicitly truncated to the index size. BasicAA currently doesn't model this, resulting in incorrect alias analysis results. Fix this by explicitly modelling truncation in CastedValue in the same way we do zext and sext. Additionally we need to disable a number of optimizations for truncated values, in particular "non-zero" and "non-equal" may no longer hold after truncation. I believe the constant offset heuristic is also not necessarily correct for truncated values, but wasn't able to come up with a test for that one. A possible followup here would be to use the new mechanism to model explicit trunc as well (which should be much more common, as it is the canonical form). This is straightforward, but omitted here to separate the correctness fix from the analysis improvement. (Side note: While I say "index size" above, BasicAA currently uses the pointer size instead. Something for another day...) Differential Revision: https://reviews.llvm.org/D110977	2021-10-22 23:47:02 +02:00
Roman Lebedev	8fac9e95ad	[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) Right now, for such not-fully-interleaved loads we just use the costs for fully-interleaved loads. But at least in general, that is obviously overly pessimistic, because in general*, not all the shuffles needed to perform the full interleaving will end up being live. So what this does, is naively scales the interleaving cost by the fraction of the live members. I believe this should still result in the right ballpark cost estimate, although it may be over/under -estimate. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D112307	2021-10-22 16:33:58 +03:00
David Sherwood	9448cdc900	[SVE][Analysis] Tune the cost model according to the tune-cpu attribute This patch introduces a new function: AArch64Subtarget::getVScaleForTuning that returns a value for vscale that can be used for tuning the cost model when using scalable vectors. The VScaleForTuning option in AArch64Subtarget is initialised according to the following rules: 1. If the user has specified the CPU to tune for we use that, else 2. If the target CPU was specified we use that, else 3. The tuning is set to "generic". For CPUs of type "generic" I have assumed that vscale=2. New tests added here: Analysis/CostModel/AArch64/sve-gather.ll Analysis/CostModel/AArch64/sve-scatter.ll Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll Differential Revision: https://reviews.llvm.org/D110259	2021-10-21 09:33:50 +01:00
Simon Pilgrim	5b395bd633	[CostModel][X86] Add costs for multiply-by-pow2 constants These are folded to left shifts in the backend. We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436	2021-10-20 13:11:21 +01:00
David Green	862e8d7e55	[AArch64] Improve div and rem costmodel tests. NFC Copied from the X86 tests, these give a better test coveraged than the existing tests.	2021-10-20 09:58:35 +01:00
Bjorn Pettersson	08619006a0	[SCEV] Avoid compile time explosion in ScalarEvolution::isImpliedCond As seen in PR51869 the ScalarEvolution::isImpliedCond function might end up spending lots of time when doing the isKnownPredicate checks. Calling isKnownPredicate for example result in isKnownViaInduction being called, which might result in isLoopBackedgeGuardedByCond being called, and then we might get one or more new calls to isImpliedCond. Even if the scenario described here isn't an infinite loop, using some random generated C programs as input indicates that those isKnownPredicate checks quite often returns true. On the other hand, the third condition that needs to be fulfilled in order to "prove implications via truncation", i.e. the isImpliedCondBalancedTypes check, is rarely fulfilled. I also made some similar experiments to look at how often we would get the same result when using isKnownViaNonRecursiveReasoning instead of isKnownPredicate. So far I haven't seen a single case when codegen is negatively impacted by using isKnownViaNonRecursiveReasoning. On the other hand, it seems like we get rid of the compile time explosion seen in PR51869 that way. Hence this patch. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D112080	2021-10-19 21:37:57 +02:00
Kirill Stoimenov	62627c7217	[Sanitizers] Replaced getMaxPointerSizeInBits with getPointerSizeInBits, which was causing failures for 32bit x86. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D111829	2021-10-18 09:31:14 -07:00
Simon Pilgrim	f041338153	[X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:46:10 +01:00
Simon Pilgrim	c850d5c5c8	[X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:15:14 +01:00
Simon Pilgrim	dc3382dc2c	[CostModel][X86] Add mul by positive/negative power-of-2 constants tests We have backend optimizations for these, but currently the costmodel doesn't match them	2021-10-17 20:34:17 +01:00
Simon Pilgrim	dbf5dc8930	[CostModel][X86] Add div/rem by negative power-of-2 constants We have backend optimizations for these (like we do for power-of-2 divisions), but currently the costmodel doesn't match them	2021-10-17 18:51:15 +01:00
Roman Lebedev	91373bf12e	[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So could pick cost of `40` For store we have: https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So we could pick cost of `40`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111945	2021-10-17 17:28:10 +03:00
Roman Lebedev	3274ce3a28	[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So could pick cost of `32` For store we have: https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111944	2021-10-17 17:28:10 +03:00
Roman Lebedev	3a6a9f74d3	[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0` So could pick cost of `68` For store we have: https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111943	2021-10-17 17:28:09 +03:00
Roman Lebedev	4b76a74b42	[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32` For store we have: https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `48`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111942	2021-10-17 17:28:09 +03:00
Roman Lebedev	887acf6842	[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0` So could pick cost of `212` For store we have: https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0` So we could pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111940	2021-10-17 17:28:09 +03:00
Simon Pilgrim	85b87179f4	[TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938	2021-10-16 17:28:07 +01:00
Simon Pilgrim	6ec644e215	[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64. It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win. Fixes PR47437 Differential Revision: https://reviews.llvm.org/D111938	2021-10-16 16:21:45 +01:00
Roman Lebedev	d137f1288e	[X86][LV] X86 does not prefer vectorized addressing And another attempt to start untangling this ball of threads around gather. There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`, which tells LV to try to vectorize the addresses that lead to loads, but X86 generally can not deal with vectors of addresses, the only instructions that support that are GATHER/SCATTER, but even those aren't available until AVX2, and aren't really usable until AVX512. This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111546	2021-10-16 12:32:18 +03:00
Roman Lebedev	3d7bf6625a	[X86][Costmodel] Improve cost modelling for not-fully-interleaved load While i've modelled most of the relevant tuples for AVX2, that only covered fully-interleaved groups. By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)*stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) As we have established over the course of last ~70 patches, (wow) `BaseT::getInterleavedMemoryOpCos()` is absolutely bogus, it is usually almost an order of magnitude overestimation, so i would claim that we should at least use the hardcoded costs of fully interleaved load groups. We could go further and adjust them e.g. by the number of demanded indices, but then i'm somewhat fearful of underestimating the cost. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111174	2021-10-14 23:14:36 +03:00
Nikita Popov	5f05ff081f	[BasicAA] Improve scalable vector handling Currently, DecomposeGEP() bails out on the whole decomposition if it encounters a scalable GEP type anywhere. However, it is fine to still analyze other GEPs that we look through before hitting the scalable GEP. This does mean that the decomposed GEP base is no longer required to be the same as the underlying object. However, I don't believe this property is necessary for correctness anymore. This allows us to compute slightly more precise aliasing results for GEP chains containing scalable vectors, though my primary interest here is simplifying the code. Differential Revision: https://reviews.llvm.org/D110511	2021-10-14 20:23:50 +02:00
Simon Pilgrim	77dcdc2f50	[CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32 Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))	2021-10-14 12:17:40 +01:00
Roman Lebedev	cb41efb5f4	[NFC][Costmodel][X86] Fix broken `CHECK-NOT`'s in interleave costmodel tests	2021-10-13 22:44:57 +03:00
Roman Lebedev	18eef13dad	[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()` `X86TTIImpl::getGSScalarCost()` has (at least) two issues: * it naively computes the cost of sequence of `insertelement`/`extractelement`. If we are operating not on the XMM (but YMM/ZMM), this widely overestimates the cost of subvector insertions/extractions. * Gather/scatter takes a vector of pointers, and scalarization results in us performing scalar memory operation for each of these pointers, but we never account for the cost of extracting these pointers out of the vector of pointers. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111222	2021-10-13 22:35:39 +03:00
Florian Hahn	4cd6cc64ed	[SCEV] Add test for propagating poison through select condition. Precommit a test for D111643.	2021-10-13 17:14:35 +01:00
Arthur Eubanks	259390de9a	[LCG] Don't skip invalidation of LazyCallGraph if CFG analyses are preserved The CFG being changed and the overall call graph are not related, we can introduce/remove calls without changing the CFG. Resolves one of the issues in PR51946. Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D111275	2021-10-11 13:30:47 -07:00
Clement Courbet	342d7b654c	[BasicAA][NFC] Improve comment.	2021-10-11 10:42:59 +02:00
Clement Courbet	83ded5d323	re-land "[AA] Teach BasicAA to recognize basic GEP range information." Now that PR52104 is fixed.	2021-10-11 10:04:22 +02:00
David Green	adec922361	[AArch64] Make -mcpu=generic schedule for an in-order core We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from `eecb353d0e` which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830	2021-10-09 15:58:31 +01:00
Simon Pilgrim	b6426d5211	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT/UGT for generic abs/min/max cost expansion Split off ABS cost handling from MIN/MAX and use explicit predicates for each Our generic expansion of ABS doesn't use NEG+CMP+SELECT any more (its now ASHR+ADD+XOR) so this needs to be updated.	2021-10-08 12:41:58 +01:00
Simon Pilgrim	716883736b	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT for generic sadd/ssub sat cost expansion The comparison always checks for negative values so know the icmp predicate will be ICMP_SGT	2021-10-07 15:42:45 +01:00
Philip Reames	1183d65b4d	[SCEV] Search operand tree for scope bound when inferring flags from IR When checking to see if we can apply IR flags to a SCEV, we need to identify a bound on the defining scope of the SCEV to be produced. We'd previously added support for a couple SCEVExpr types which trivially imply bounds, but hadn't handled types such as umax where the bounds come from the bounds of the operands. This does the obvious thing, and recurses through operands searching for a tighter bound on the defining scope. I'm honestly surprised by how little this seems to mater on existing tests, but it's worth doing for completeness sake alone. Differential Revision: https://reviews.llvm.org/D111191	2021-10-06 15:10:02 -07:00
Philip Reames	2b3d913cc5	[tests] precommit test changes for D111191	2021-10-06 12:12:49 -07:00
Philip Reames	67896f494e	Returning poison from a function w/ noundef return attribute is UB This does for readability of returns within said function as what we do for the caller side when reasoning about what might be poison. Differential Revision: https://reviews.llvm.org/D111180	2021-10-06 11:52:18 -07:00
Philip Reames	0658bab870	[SCEV] Infer flags from add/gep in any block This patch removes a compile time restriction from isSCEVExprNeverPoison. We've strengthened our ability to reason about flags on scopes other than addrecs, and this bailout prevents us from using it. The comment is also suspect as well in that we're in the middle of constructing a SCEV for I. As such, we're going to visit all operands anyways. Differential Revision: https://reviews.llvm.org/D111186	2021-10-06 11:11:54 -07:00
Simon Pilgrim	2ced9a42be	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_NE for generic smulo/umulo cost expansion Match the predicate used in TargetLowering::expandMULO to detect overflow	2021-10-06 19:11:33 +01:00
Simon Pilgrim	7bd097fd1e	[CostModel][TTI] Fix ops used for generic smulo/umulo cost expansion Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values. The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply. More closely matches the default expansion from TargetLowering::expandMULO	2021-10-06 19:11:32 +01:00
Simon Pilgrim	81b5da8c97	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_ULT/UGT for generic uadd/usubo cost expansion Match the predicates used in TargetLowering::expandUADDSUBO	2021-10-06 19:11:32 +01:00
Nikita Popov	1301a8b473	[BasicAA] Don't unnecessarily extend pointer size BasicAA GEP decomposition currently performs all calculation on the maximum pointer size, but at least 64-bit, with an option to double the size. The code comment claims that this improves analysis power when working with uint64_t indices on 32-bit systems. However, I don't see how this can be, at least while maintaining correctness: When working on canonical code, the GEP indices will have GEP index size. If the original code worked on uint64_t with a 32-bit size_t, then there will be truncs inserted before use as a GEP index. Linear expression decomposition does not look through truncs, so this will be an opaque value as far as GEP decomposition is concerned. Working on a wider pointer size does not help here (or have any effect at all). When working on non-canonical code (before first InstCombine), the GEP indices are implicitly truncated to GEP index size. The BasicAA code currently just ignores this fact completely, and pretends that this truncation doesn't happen. This is incorrect and will be addressed by D110977. I believe that for correctness reasons, it is important to work on the actual GEP index size to properly model potential overflow. BasicAA tries to patch over the fact that it uses the wrong size (see adjustToPointerSize), but it only does that in limited cases (only for constant values, and not all of them either). I'd like to move this code towards always working on the correct size, and dropping these artificial pointer size adjustments is the first step towards that. Differential Revision: https://reviews.llvm.org/D110657	2021-10-06 18:40:21 +02:00
Simon Pilgrim	3dda247e18	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_EQ for generic funnel shift cost expansion The comparison always checks for zero value so know the icmp predicate will be ICMP_EQ	2021-10-06 16:39:16 +01:00
Clement Courbet	ff41fc07b1	Revert "[AA] Teach BasicAA to recognize basic GEP range information." We have found a miscompile with this change, reverting while working on a reproducer. This reverts commit `455b60ccfb`.	2021-10-06 16:49:10 +02:00
Simon Pilgrim	0776924a17	[CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost. These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.	2021-10-06 10:14:56 +01:00
Philip Reames	e64ed3c8df	[test] autogen a couple of additional tests	2021-10-05 18:58:08 -07:00
Philip Reames	c59c32caa0	[test] factor out reliance on noundef return value	2021-10-05 14:45:48 -07:00
Philip Reames	5020e104a1	[test] rework recently added SCEV tests These are meant to check a future patch which recurses through operands of SCEVs, but because all SCEVs are trivially bounded by function entry, we need to arrange the trivial scope not to be valid. (i.e. we specifically need a lower defining scope)	2021-10-05 14:42:53 -07:00
Philip Reames	94c1c56cc5	[tests] Cover cases we could infer SCEV flags, but don't	2021-10-05 13:16:16 -07:00
Roman Lebedev	f92961d238	[NFC] Fixup newly-added costmodel tests to actually test what they should	2021-10-05 21:35:47 +03:00
Roman Lebedev	200edc152b	[NFC][X86][LV] Add basic costmodel test coverage for not-fully-interleaved i32 loads The coverage could have cumulative explosion here, so i'm adding only the most basic cases, and hoping it's enough, though more can be added if needed.	2021-10-05 19:39:50 +03:00
Roman Lebedev	3f9b235482	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0` So could pick cost of `36` For store we have: https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0` So we could pick cost of `30`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111094	2021-10-05 16:58:58 +03:00
Roman Lebedev	e2784c5d8c	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0` So could pick cost of `18`. For store we have: https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0` So we could pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111093	2021-10-05 16:58:58 +03:00
Roman Lebedev	3960693048	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111092	2021-10-05 16:58:58 +03:00
Roman Lebedev	79d6d12d95	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs This one required quite a bit of an assembly surgery, but i think it's in the right ballpark.. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So could pick cost of `64`. For store we have: https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5` So we could pick cost of `66`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111091	2021-10-05 16:58:58 +03:00
Roman Lebedev	2996a2b50f	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0` So could pick cost of `31`. For store we have: https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8` So we could pick cost of `33`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111089	2021-10-05 16:58:57 +03:00
Roman Lebedev	d51532d8aa	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8` So could pick cost of `15`. For store we have: https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0` So we could pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111087	2021-10-05 16:58:57 +03:00
Roman Lebedev	764fd5f463	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3` So could pick cost of `6`. For store we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0` So we could pick cost of `9`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111083	2021-10-05 16:58:57 +03:00
Roman Lebedev	c800119c46	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So could pick cost of `20`. For store we have: https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So we could pick cost of `20`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111076	2021-10-05 16:58:57 +03:00
Roman Lebedev	000ce0bfd5	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111075	2021-10-05 16:58:57 +03:00
Roman Lebedev	dcc2b0d933	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111073	2021-10-05 16:58:57 +03:00

1 2 3 4 5 ...

3272 Commits