llvm-project

Commit Graph

Author	SHA1	Message	Date
Simon Pilgrim	c850d5c5c8	[X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs Differential Revision: https://reviews.llvm.org/D111941	2021-10-18 13:15:14 +01:00
Simon Pilgrim	dc3382dc2c	[CostModel][X86] Add mul by positive/negative power-of-2 constants tests We have backend optimizations for these, but currently the costmodel doesn't match them	2021-10-17 20:34:17 +01:00
Simon Pilgrim	dbf5dc8930	[CostModel][X86] Add div/rem by negative power-of-2 constants We have backend optimizations for these (like we do for power-of-2 divisions), but currently the costmodel doesn't match them	2021-10-17 18:51:15 +01:00
Roman Lebedev	91373bf12e	[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So could pick cost of `40` For store we have: https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0` So we could pick cost of `40`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111945	2021-10-17 17:28:10 +03:00
Roman Lebedev	3274ce3a28	[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So could pick cost of `32` For store we have: https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111944	2021-10-17 17:28:10 +03:00
Roman Lebedev	3a6a9f74d3	[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0` So could pick cost of `68` For store we have: https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111943	2021-10-17 17:28:09 +03:00
Roman Lebedev	4b76a74b42	[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32` For store we have: https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0` So we could pick cost of `48`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111942	2021-10-17 17:28:09 +03:00
Roman Lebedev	887acf6842	[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs A few more tuples are being queried after D111546. Might be good to model them, They all require a lot of manual assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0` So could pick cost of `212` For store we have: https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0` So we could pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111940	2021-10-17 17:28:09 +03:00
Simon Pilgrim	85b87179f4	[TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938	2021-10-16 17:28:07 +01:00
Simon Pilgrim	6ec644e215	[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64. It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win. Fixes PR47437 Differential Revision: https://reviews.llvm.org/D111938	2021-10-16 16:21:45 +01:00
Roman Lebedev	d137f1288e	[X86][LV] X86 does not prefer vectorized addressing And another attempt to start untangling this ball of threads around gather. There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`, which tells LV to try to vectorize the addresses that lead to loads, but X86 generally can not deal with vectors of addresses, the only instructions that support that are GATHER/SCATTER, but even those aren't available until AVX2, and aren't really usable until AVX512. This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111546	2021-10-16 12:32:18 +03:00
Roman Lebedev	3d7bf6625a	[X86][Costmodel] Improve cost modelling for not-fully-interleaved load While i've modelled most of the relevant tuples for AVX2, that only covered fully-interleaved groups. By definition, interleaving load of stride N means: load NVF elements, and shuffle them into N VF-sized vectors, with 0'th vector containing elements `[0, VF)stride + 0`, and 1'th vector containing elements `[0, VF)*stride + 1`. Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6) Now, not fully interleaved load, is when not all of these vectors is demanded. So at worst, we could just pretend that everything is demanded, and discard the non-demanded vectors. What this means is that the cost for not-fully-interleaved group should be not greater than the cost for the same fully-interleaved group, but perhaps somewhat less. Examples: https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4) https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2) https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1) As we have established over the course of last ~70 patches, (wow) `BaseT::getInterleavedMemoryOpCos()` is absolutely bogus, it is usually almost an order of magnitude overestimation, so i would claim that we should at least use the hardcoded costs of fully interleaved load groups. We could go further and adjust them e.g. by the number of demanded indices, but then i'm somewhat fearful of underestimating the cost. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111174	2021-10-14 23:14:36 +03:00
Nikita Popov	5f05ff081f	[BasicAA] Improve scalable vector handling Currently, DecomposeGEP() bails out on the whole decomposition if it encounters a scalable GEP type anywhere. However, it is fine to still analyze other GEPs that we look through before hitting the scalable GEP. This does mean that the decomposed GEP base is no longer required to be the same as the underlying object. However, I don't believe this property is necessary for correctness anymore. This allows us to compute slightly more precise aliasing results for GEP chains containing scalable vectors, though my primary interest here is simplifying the code. Differential Revision: https://reviews.llvm.org/D110511	2021-10-14 20:23:50 +02:00
Simon Pilgrim	77dcdc2f50	[CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32 Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))	2021-10-14 12:17:40 +01:00
Roman Lebedev	cb41efb5f4	[NFC][Costmodel][X86] Fix broken `CHECK-NOT`'s in interleave costmodel tests	2021-10-13 22:44:57 +03:00
Roman Lebedev	18eef13dad	[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()` `X86TTIImpl::getGSScalarCost()` has (at least) two issues: * it naively computes the cost of sequence of `insertelement`/`extractelement`. If we are operating not on the XMM (but YMM/ZMM), this widely overestimates the cost of subvector insertions/extractions. * Gather/scatter takes a vector of pointers, and scalarization results in us performing scalar memory operation for each of these pointers, but we never account for the cost of extracting these pointers out of the vector of pointers. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111222	2021-10-13 22:35:39 +03:00
Florian Hahn	4cd6cc64ed	[SCEV] Add test for propagating poison through select condition. Precommit a test for D111643.	2021-10-13 17:14:35 +01:00
Arthur Eubanks	259390de9a	[LCG] Don't skip invalidation of LazyCallGraph if CFG analyses are preserved The CFG being changed and the overall call graph are not related, we can introduce/remove calls without changing the CFG. Resolves one of the issues in PR51946. Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D111275	2021-10-11 13:30:47 -07:00
Clement Courbet	342d7b654c	[BasicAA][NFC] Improve comment.	2021-10-11 10:42:59 +02:00
Clement Courbet	83ded5d323	re-land "[AA] Teach BasicAA to recognize basic GEP range information." Now that PR52104 is fixed.	2021-10-11 10:04:22 +02:00
David Green	adec922361	[AArch64] Make -mcpu=generic schedule for an in-order core We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from `eecb353d0e` which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830	2021-10-09 15:58:31 +01:00
Simon Pilgrim	b6426d5211	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT/UGT for generic abs/min/max cost expansion Split off ABS cost handling from MIN/MAX and use explicit predicates for each Our generic expansion of ABS doesn't use NEG+CMP+SELECT any more (its now ASHR+ADD+XOR) so this needs to be updated.	2021-10-08 12:41:58 +01:00
Simon Pilgrim	716883736b	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_SGT for generic sadd/ssub sat cost expansion The comparison always checks for negative values so know the icmp predicate will be ICMP_SGT	2021-10-07 15:42:45 +01:00
Philip Reames	1183d65b4d	[SCEV] Search operand tree for scope bound when inferring flags from IR When checking to see if we can apply IR flags to a SCEV, we need to identify a bound on the defining scope of the SCEV to be produced. We'd previously added support for a couple SCEVExpr types which trivially imply bounds, but hadn't handled types such as umax where the bounds come from the bounds of the operands. This does the obvious thing, and recurses through operands searching for a tighter bound on the defining scope. I'm honestly surprised by how little this seems to mater on existing tests, but it's worth doing for completeness sake alone. Differential Revision: https://reviews.llvm.org/D111191	2021-10-06 15:10:02 -07:00
Philip Reames	2b3d913cc5	[tests] precommit test changes for D111191	2021-10-06 12:12:49 -07:00
Philip Reames	67896f494e	Returning poison from a function w/ noundef return attribute is UB This does for readability of returns within said function as what we do for the caller side when reasoning about what might be poison. Differential Revision: https://reviews.llvm.org/D111180	2021-10-06 11:52:18 -07:00
Philip Reames	0658bab870	[SCEV] Infer flags from add/gep in any block This patch removes a compile time restriction from isSCEVExprNeverPoison. We've strengthened our ability to reason about flags on scopes other than addrecs, and this bailout prevents us from using it. The comment is also suspect as well in that we're in the middle of constructing a SCEV for I. As such, we're going to visit all operands anyways. Differential Revision: https://reviews.llvm.org/D111186	2021-10-06 11:11:54 -07:00
Simon Pilgrim	2ced9a42be	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_NE for generic smulo/umulo cost expansion Match the predicate used in TargetLowering::expandMULO to detect overflow	2021-10-06 19:11:33 +01:00
Simon Pilgrim	7bd097fd1e	[CostModel][TTI] Fix ops used for generic smulo/umulo cost expansion Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values. The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply. More closely matches the default expansion from TargetLowering::expandMULO	2021-10-06 19:11:32 +01:00
Simon Pilgrim	81b5da8c97	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_ULT/UGT for generic uadd/usubo cost expansion Match the predicates used in TargetLowering::expandUADDSUBO	2021-10-06 19:11:32 +01:00
Nikita Popov	1301a8b473	[BasicAA] Don't unnecessarily extend pointer size BasicAA GEP decomposition currently performs all calculation on the maximum pointer size, but at least 64-bit, with an option to double the size. The code comment claims that this improves analysis power when working with uint64_t indices on 32-bit systems. However, I don't see how this can be, at least while maintaining correctness: When working on canonical code, the GEP indices will have GEP index size. If the original code worked on uint64_t with a 32-bit size_t, then there will be truncs inserted before use as a GEP index. Linear expression decomposition does not look through truncs, so this will be an opaque value as far as GEP decomposition is concerned. Working on a wider pointer size does not help here (or have any effect at all). When working on non-canonical code (before first InstCombine), the GEP indices are implicitly truncated to GEP index size. The BasicAA code currently just ignores this fact completely, and pretends that this truncation doesn't happen. This is incorrect and will be addressed by D110977. I believe that for correctness reasons, it is important to work on the actual GEP index size to properly model potential overflow. BasicAA tries to patch over the fact that it uses the wrong size (see adjustToPointerSize), but it only does that in limited cases (only for constant values, and not all of them either). I'd like to move this code towards always working on the correct size, and dropping these artificial pointer size adjustments is the first step towards that. Differential Revision: https://reviews.llvm.org/D110657	2021-10-06 18:40:21 +02:00
Simon Pilgrim	3dda247e18	[CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_EQ for generic funnel shift cost expansion The comparison always checks for zero value so know the icmp predicate will be ICMP_EQ	2021-10-06 16:39:16 +01:00
Clement Courbet	ff41fc07b1	Revert "[AA] Teach BasicAA to recognize basic GEP range information." We have found a miscompile with this change, reverting while working on a reproducer. This reverts commit `455b60ccfb`.	2021-10-06 16:49:10 +02:00
Simon Pilgrim	0776924a17	[CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost. These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.	2021-10-06 10:14:56 +01:00
Philip Reames	e64ed3c8df	[test] autogen a couple of additional tests	2021-10-05 18:58:08 -07:00
Philip Reames	c59c32caa0	[test] factor out reliance on noundef return value	2021-10-05 14:45:48 -07:00
Philip Reames	5020e104a1	[test] rework recently added SCEV tests These are meant to check a future patch which recurses through operands of SCEVs, but because all SCEVs are trivially bounded by function entry, we need to arrange the trivial scope not to be valid. (i.e. we specifically need a lower defining scope)	2021-10-05 14:42:53 -07:00
Philip Reames	94c1c56cc5	[tests] Cover cases we could infer SCEV flags, but don't	2021-10-05 13:16:16 -07:00
Roman Lebedev	f92961d238	[NFC] Fixup newly-added costmodel tests to actually test what they should	2021-10-05 21:35:47 +03:00
Roman Lebedev	200edc152b	[NFC][X86][LV] Add basic costmodel test coverage for not-fully-interleaved i32 loads The coverage could have cumulative explosion here, so i'm adding only the most basic cases, and hoping it's enough, though more can be added if needed.	2021-10-05 19:39:50 +03:00
Roman Lebedev	3f9b235482	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0` So could pick cost of `36` For store we have: https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0` So we could pick cost of `30`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111094	2021-10-05 16:58:58 +03:00
Roman Lebedev	e2784c5d8c	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0` So could pick cost of `18`. For store we have: https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0` So we could pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111093	2021-10-05 16:58:58 +03:00
Roman Lebedev	3960693048	[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111092	2021-10-05 16:58:58 +03:00
Roman Lebedev	79d6d12d95	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs This one required quite a bit of an assembly surgery, but i think it's in the right ballpark.. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0` So could pick cost of `64`. For store we have: https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5` So we could pick cost of `66`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111091	2021-10-05 16:58:58 +03:00
Roman Lebedev	2996a2b50f	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0` So could pick cost of `31`. For store we have: https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8` So we could pick cost of `33`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111089	2021-10-05 16:58:57 +03:00
Roman Lebedev	d51532d8aa	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8` So could pick cost of `15`. For store we have: https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0` So we could pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111087	2021-10-05 16:58:57 +03:00
Roman Lebedev	764fd5f463	[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3` So could pick cost of `6`. For store we have: https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0` So we could pick cost of `9`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111083	2021-10-05 16:58:57 +03:00
Roman Lebedev	c800119c46	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So could pick cost of `20`. For store we have: https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0` So we could pick cost of `20`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111076	2021-10-05 16:58:57 +03:00
Roman Lebedev	000ce0bfd5	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111075	2021-10-05 16:58:57 +03:00
Roman Lebedev	dcc2b0d933	[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So could pick cost of `6`. For store we have: https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111073	2021-10-05 16:58:57 +03:00
Roman Lebedev	7d91037fd2	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=16 interleaving costs This one required quite a bit of assembly surgery, but the trend continues, so i think this is right. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/EKWdj8cKT - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0` So could pick cost of `32`. For store we have: https://godbolt.org/z/zj4bb9P75 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0` So we could pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111064	2021-10-05 16:58:57 +03:00
Roman Lebedev	4aee1e5b93	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/a6rxMG6ec - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=12.0` So could pick cost of `16`. For store we have: https://godbolt.org/z/ced1bdqc9 - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0` So we could pick cost of `16`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111063	2021-10-05 16:58:57 +03:00
Roman Lebedev	3c2e22b795	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/avq1oz98W - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0` So could pick cost of `8`. For store we have: https://godbolt.org/z/89PGMc1qs - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=6.0` So we could pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111061	2021-10-05 16:58:57 +03:00
Roman Lebedev	b6234c1edf	[X86][Costmodel] Load/store i32/f32 Stride=4 VF=2 interleaving costs Finally, we are getting to the heavy-hitter stuff! The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/7crGWoar6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So could pick cost of `4`. For store we have: https://godbolt.org/z/T8aq3MszM - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.0` So we could pick cost of `5`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111060	2021-10-05 16:58:56 +03:00
Nikita Popov	30001af84e	[BasicAA] Ignore CanBeFreed in minimal extent reasoning When determining NoAlias based on object size and dereferenceability information, we can ignore frees for the same reason we can ignore possible null pointers (if null is not a valid pointer): Actually accessing the null pointer / freed pointer would be immediate UB, and AA results are only valid under the assumption of an access. This addresses a minor regression from D110745. Differential Revision: https://reviews.llvm.org/D111028	2021-10-04 22:08:57 +02:00
Roman Lebedev	dee4d699b2	[NFC][X86][LV] Add costmodel test coverage for interleaved i64/f64 load/store stride=6	2021-10-04 20:57:35 +03:00
Roman Lebedev	c4dd0fe4b3	[NFC][X86][LV] Add costmodel test coverage for interleaved i32/f32 load/store stride=6	2021-10-04 20:57:35 +03:00
Roman Lebedev	b8c7d5229c	[NFC][X86][LV] Add costmodel test coverage for interleaved i64/f64 load/store stride=4	2021-10-04 17:31:57 +03:00
Roman Lebedev	f38cbd7859	[NFC][X86][LV] Add costmodel test coverage for interleaved i32/f32 load/store stride=4	2021-10-04 17:31:57 +03:00
Roman Lebedev	cef0a693b6	[X86][Costmodel] Load/store i64/f64 Stride=3 VF=16 interleaving costs This required huge amount of assembly surgery, but i think this is about right. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/z11crMEcj - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: <=18.0` So could pick cost of `25`. For store we have: https://godbolt.org/z/eqT4ze3j4 - for intels `Block RThroughput: =24.0`; for ryzens, `Block RThroughput: <=16.0` So we could pick cost of `24`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111031	2021-10-04 14:35:17 +03:00
Roman Lebedev	ede0611e79	[X86][Costmodel] Load/store i64/f64 Stride=3 VF=8 interleaving costs This one required quite a bit of assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/oYWv4cTnK - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=8.0` So pick cost of `10`. For store we have: https://godbolt.org/z/33GMhrsG9 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=8.0` So pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111027	2021-10-04 14:35:01 +03:00
Roman Lebedev	eb9a694c17	[X86][Costmodel] Load/store i64/f64 Stride=3 VF=4 interleaving costs This one required quite a bit of assembly surgery. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Tce3osvcz - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `5`. For store we have: https://godbolt.org/z/oc3arEcnE - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111026	2021-10-04 14:34:47 +03:00
Roman Lebedev	d3bbe781ea	[X86][Costmodel] Load/store i64/f64 Stride=3 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/sz5qdKnr4 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `1`. For store we have: https://godbolt.org/z/Kzdjff63v - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111025	2021-10-04 14:34:33 +03:00
Roman Lebedev	4ca5bc07af	[X86][Costmodel] Load/store i32/f32 Stride=3 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/5fqrh4qqo - for intels `Block RThroughput: =14.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `14`. For store we have: https://godbolt.org/z/5fqrh4qqo - for intels `Block RThroughput: =22.0`; for ryzens, `Block RThroughput: <=16.0` So pick cost of `22`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111022	2021-10-04 14:34:19 +03:00
Roman Lebedev	198aa84973	[X86][Costmodel] Load/store i32/f32 Stride=3 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/zdz5Ga6fs - for intels `Block RThroughput: =7.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `7`. For store we have: https://godbolt.org/z/qn71513ac - for intels `Block RThroughput: =11.0`; for ryzens, `Block RThroughput: <=8.0` So pick cost of `11`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111021	2021-10-04 14:34:05 +03:00
Roman Lebedev	a93411c3af	[X86][Costmodel] Load/store i32/f32 Stride=3 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/d8PdhEszo - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `3`. For store we have: https://godbolt.org/z/WojonfG5n - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `5`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111020	2021-10-04 14:34:03 +03:00
Roman Lebedev	3e93fcdfc8	[X86][Costmodel] Load/store i32/f32 Stride=3 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/z8qa14bs3 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: =1.5` So pick cost of `3`. For store we have: https://godbolt.org/z/GYGajoc4K - for intels `Block RThroughput: <=4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111019	2021-10-04 14:31:50 +03:00
Philip Reames	35ab211c37	[SCEV] Use trivial bound on defining scope of all SCEVs when computing flags This addresses a comment from review on D109845. Even for SCEVs which we can't find true bounds without recursing through operands, entry to the function forms a trivial upper bound. In some cases, this trivial bound is enough to prove safety of flag inference.	2021-10-03 16:01:30 -07:00
Philip Reames	d02db32644	[SCEV] Use full logic when infering flags on add and gep This is a followon to D109845. With that landed, we will have fixed all known instances of pr51817, and can thus start inferring flags more aggressively with greatly reduced risk of miscompiles. This patch simply applies the same inference logic used in that patch to our other major flag inference path. We can still do much better here (on both paths), but this is our first step. Differential Revision: https://reviews.llvm.org/D111003	2021-10-03 15:32:15 -07:00
Philip Reames	f39978b84f	[SCEV] Correctly propagate nowrap flags across scopes when folding invariant add through addrec This fixes a violation of the wrap flag rules introduced in `c4048d8f`. This is an alternate fix to D106852. The basic problem being fixed is that we infer a set of flags which is valid at some inner scope S1 (usually by correctly propagating them from IR), and then (incorrectly) extend them to a SCEV in scope S2 where S1 != S2. This is not in general safe per the wrap flags semantics recently defined. In this patch, I include a simple inference step to handle the case where we can prove that S2 is the preheader of the loop S1, and that entry into S2 implies execution of S1. See the code for a more detailed explanation. One worry I have with this patch is that I might be over-fitting what shows up in tests - and thus hiding negative impact we'd see in the real world. My best defense is that the rule used here very closely follows the one used to propagate the flags from IR to the inner add to start with, and thus if one is reasonable, so probably is the other. Curious what others think about that piece. The test diffs are roughly as expected. Mostly analysis only, with two transform changes. Oddly, the result looks better in the loop-idiom test, and I don't understand the PPC output enough to have tell. Nothing terrible looking though. (For context, without the scope inference peephole, the test delta includes a couple of vectorization tests. Again, not super concerning, but slightly more so.) Differential Revision: https://reviews.llvm.org/D109845	2021-10-03 15:19:33 -07:00
Roman Lebedev	67f1ee2e38	[X86][Costmodel] Load/store i16 Stride=3 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rMaYr67hz - for intels `Block RThroughput: =56.0`; for ryzens, `Block RThroughput: <=17.8` So pick cost of `56`. For store we have: https://godbolt.org/z/eMsbKqnvv - for intels `Block RThroughput: <=54.0`; for ryzens, `Block RThroughput: <=15.0` So pick cost of `54`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111018	2021-10-03 23:40:35 +03:00
Roman Lebedev	3cbc0a07f9	[X86][Costmodel] Load/store i16 Stride=3 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1T6MMzeh3 - for intels `Block RThroughput: =28.0`; for ryzens, `Block RThroughput: <=8.5` So pick cost of `28`. For store we have: https://godbolt.org/z/1T6MMzeh3 - for intels `Block RThroughput: <=27.0`; for ryzens, `Block RThroughput: <=7.0` So pick cost of `27`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111017	2021-10-03 23:40:21 +03:00
Roman Lebedev	72f8a9244a	[X86][Costmodel] Load/store i16 Stride=3 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Mh9MnnT8W - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=2.3` So pick cost of `9`. For store we have: https://godbolt.org/z/Mh9MnnT8W - for intels `Block RThroughput: <=12.0`; for ryzens, `Block RThroughput: <=3.3` So pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111016	2021-10-03 23:40:05 +03:00
Roman Lebedev	04f1469cb4	[X86][Costmodel] Load/store i16 Stride=3 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/sP4j1173f - for intels `Block RThroughput: =7.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `7`. For store we have: https://godbolt.org/z/sP4j1173f - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111015	2021-10-03 23:39:51 +03:00
Roman Lebedev	8e8fb77aa4	[X86][Costmodel] Load/store i16 Stride=3 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/xnE988aej - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.5` So pick cost of `5`. For store we have: https://godbolt.org/z/rMGT31Tnh - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111014	2021-10-03 23:39:36 +03:00
Roman Lebedev	a5e5883ef5	[X86][Costmodel] Load/store i8 Stride=6 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/c1jjKqP7b - for intels `Block RThroughput: <=82.0`; for ryzens, `Block RThroughput: <=26.0` So pick cost of `82`. For store we have: https://godbolt.org/z/YM4ErY8x7 - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=25.5` So pick cost of `90`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111013	2021-10-03 23:39:22 +03:00
Roman Lebedev	bd5ba437fd	[X86][Costmodel] Load/store i8 Stride=6 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Gz8hhqfTM - for intels `Block RThroughput: <=43.0`; for ryzens, `Block RThroughput: <=14.0` So pick cost of `43`. For store we have: https://godbolt.org/z/9vrdssYa8 - for intels `Block RThroughput: <=27.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `27`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111012	2021-10-03 23:39:08 +03:00
Roman Lebedev	0b27f9c088	[X86][Costmodel] Load/store i8 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/v98qPTTf6 - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0` So pick cost of `18`. For store we have: https://godbolt.org/z/rn5T9E8q6 - for intels `Block RThroughput: <=16.0`; for ryzens, `Block RThroughput: <=4.5` So pick cost of `16`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111011	2021-10-03 23:38:54 +03:00
Roman Lebedev	6fe4cce558	[X86][Costmodel] Load/store i8 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/4sWhs396o - for intels `Block RThroughput: =14.0`; for ryzens, `Block RThroughput: <=7.0` So pick cost of `14`. For store we have: https://godbolt.org/z/4sWhs396o - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `9`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111010	2021-10-03 23:38:40 +03:00
Roman Lebedev	396b95e5c9	[X86][Costmodel] Load/store i8 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/jvj6jzns5 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `6`. For store we have: https://godbolt.org/z/ros7eebMP - for intels `Block RThroughput: =7.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `7`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111008	2021-10-03 23:38:10 +03:00
Roman Lebedev	025ce15435	[NFC][X86][LV] Add costmodel test coverage for interleaved i64/f64 load/store stride=3	2021-10-03 17:52:11 +03:00
Roman Lebedev	f3c6c76cfd	[NFC][X86][LV] Add costmodel test coverage for interleaved i32/f32 load/store stride=3	2021-10-03 16:49:51 +03:00
Roman Lebedev	e311cdd18d	[NFC][X86][LV] Add costmodel test coverage for interleaved i8 load/store stride=6	2021-10-03 14:33:59 +03:00
Roman Lebedev	acb459574a	[X86][Costmodel] Load/store i8 Stride=4 VF=32 interleaving costs While we already model this tuple, the load cost is divergent from reality, so fix it. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/zWMhhnPYa - for intels `Block RThroughput: =56.0`; for ryzens, `Block RThroughput: <=24.0` So pick cost of `56`. For store we have: https://godbolt.org/z/vnqqjWx51 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `12`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110971	2021-10-02 13:40:21 +03:00
Roman Lebedev	0e71ae6da8	[X86][Costmodel] Load/store i8 Stride=4 VF=16 interleaving costs While we already model this tuple, the values are divergent from reality, so fix them. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/TrGW7cKsE - for intels `Block RThroughput: =24.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `24`. For store we have: https://godbolt.org/z/Mh7qaqEfe - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110970	2021-10-02 13:40:21 +03:00
Roman Lebedev	74e4a0e327	[X86][Costmodel] Load/store i8 Stride=4 VF=8 interleaving costs While we already model this tuple, the values are divergent from reality, so fix them. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/v7746Wcf7 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `12`. For store we have: https://godbolt.org/z/aEeEohEbP - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110969	2021-10-02 13:40:20 +03:00
Roman Lebedev	ae08362cb8	[X86][Costmodel] Load/store i8 Stride=4 VF=4 interleaving costs While we already model this tuple, the store cost is divergent from reality, so fix it. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1n4bPh7Tn - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. For store we have: https://godbolt.org/z/r8K9sveqo - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110968	2021-10-02 13:40:20 +03:00
Roman Lebedev	935b9693ae	[X86][Costmodel] Load/store i8 Stride=4 VF=2 interleaving costs While we already model this tuple, the values are divergent from reality, so fix them. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/KP6nn36zs - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. For store we have: https://godbolt.org/z/ov95zhrq6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110966	2021-10-02 13:40:20 +03:00
Roman Lebedev	448c939839	[X86][Costmodel] Load/store i8 Stride=3 VF=32 interleaving costs For VF=16, costs are correct. For VF=32, load cost is divergent. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/qKjevqf4W - for intels `Block RThroughput: <=14.0`; for ryzens, `Block RThroughput: <=4.5` So pick cost of `14`. For store we have: https://godbolt.org/z/xTssTq319 - for intels `Block RThroughput: =13.0`; for ryzens, `Block RThroughput: <=5.5` So pick cost of `13`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110961	2021-10-02 13:39:15 +03:00
Roman Lebedev	d1460c88a6	[X86][Costmodel] Load/store i8 Stride=3 VF=8 interleaving costs While we already model this tuple, the values are divergent from reality, so fix them. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1jeocxj55 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `6`. For store we have: https://godbolt.org/z/fr7xfa3K5 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110960	2021-10-02 13:39:15 +03:00
Roman Lebedev	f1df2d8eaf	[X86][Costmodel] Load/store i8 Stride=3 VF=4 interleaving costs While we already model this tuple, the values are divergent from reality, so fix them. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/obWz3PrfK - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=1.5` So pick cost of `3`. For store we have: https://godbolt.org/z/orjPshn3h - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110958	2021-10-02 13:39:10 +03:00
Roman Lebedev	8a3c64c3a2	[X86][Costmodel] Load/store i8 Stride=3 VF=2 interleaving costs While we already model this tuple, the values are divergent from reality, so fix them. The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/WYscYMcW4 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=1.5` So pick cost of `3`. For store we have: https://godbolt.org/z/e9qvYdbbs - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110956	2021-10-02 13:39:05 +03:00
Philip Reames	91dfc0840d	[test] add coverage for a SCEVUnknown scoped value in isSCEVExprNeverPoison Note that a couple of the "negative" tests also end up showing miscompiles due to D109845 which is not yet fixed.	2021-10-01 16:39:23 -07:00
Philip Reames	2ca8a3f213	[SCEV] Stop blindly propagating flags from inbound geps to SCEV nodes This fixes a violation of the wrap flag rules introduced in `c4048d8f`. This was also noted in the (very old) PR23527. The issue being fixed is that we assume the inbound flag on any GEP assumes that all users of any gep (or add) which happens to map to that SCEV would also be UB if the (other) gep overflowed. That's simply not true. In terms of the test diffs, I don't see anything seriously problematic. The lost flags are expected (given the semantic restriction on when its legal to tag the SCEV), and there are several cases where the previously inferred flags are unsound per the new semantics. The only common trend I noticed when looking at the deltas is that by not considering branch on poison as immediate UB in ValueTracking, we do miss a few cases we could reclaim. We may be able to claw some of these back with the follow ideas mentioned in PR51817. It's worth noting that most of the changes are analysis result only changes. The two transform changes are pretty minimal. In one case, we miss the opportunity to infer a nuw (correctly). In the other, we fail to fold an exit and produce a loop invariant form instead. This one is probably over-reduced as the program appears to be undefined in practice, and neither before or after exploits that. Differential Revision: https://reviews.llvm.org/D109789	2021-10-01 16:30:44 -07:00
Philip Reames	24cde2f602	[SCEV] Remove invariant requirement from isSCEVExprNeverPoison This code is attempting to prove that I must execute if we enter the defining scope of the SCEV which will be created from I. In the case where it found a defining addrec scope, it had a rather odd restriction that all of the other operands must be loop invariant in that addrec's loop. As near as I can tell here, we really only need a upper bound on the defining scope. If we can prove the stronger property, then we must also have proven the property on the exact defining scope as well. In practice, the actual effect of this change is narrow. The compile time restriction at the top of the routine basically limits us to I being an arithmetic in some loop L with both an addrec operand in L, and a unknown operands in L. Possible to demonstrate, but the main value of the change is removing unneeded code. Differential Revision: https://reviews.llvm.org/D110892	2021-10-01 15:57:37 -07:00
Philip Reames	d0bca006bb	[test] split flags-from-poison.ll to allow ease of autogen update	2021-10-01 15:35:09 -07:00
Nikita Popov	b084b98abe	[BasicAA] Make test more robust (NFC) When taking into account the fact that GEP indices are truncated to 32-bits in this test, the "path dependence" goes away, so inferring MustAlias for all pointers would be correct. As this goes against the spirit of the test, change it to extend from i16 instead.	2021-10-01 22:57:01 +02:00
Nikita Popov	b7ff048915	[BasicAA] Add additional truncation tests (NFC) These show that the known bits and non-zero heuristics are incorrect when truncation is involved.	2021-10-01 22:57:01 +02:00
Roman Lebedev	53d7bdbfbf	[NFC][X86][LV] Improve costmodel test coverage for interleaved i8 load/store stride=4	2021-10-01 22:49:06 +03:00
Nikita Popov	04a6f80e9b	[BasicAA] Add additional 32-bit truncation test (NFC) This is a variant with a variable index, in which case the pointer size adjustment is not performed.	2021-10-01 21:20:59 +02:00
Roman Lebedev	727a359979	[NFC][X86][LV] Improve costmodel test coverage for interleaved i8 load/store stride=3	2021-10-01 18:47:25 +03:00
Roman Lebedev	3e260efdfc	[X86][Costmodel] Load/store i64/f64 Stride=2 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1WMTojvfW - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0` So pick cost of `16`. For store we have: https://godbolt.org/z/1WMTojvfW - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=16.0` So pick cost of `16`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110840	2021-10-01 17:48:14 +03:00
Roman Lebedev	abd37de63e	[X86][Costmodel] Load/store i64/f64 Stride=2 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/PGYbYKPq8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `8`. For store we have: https://godbolt.org/z/PGYbYKPq8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=8.0` So pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110838	2021-10-01 17:48:14 +03:00
Roman Lebedev	71bc31b907	[X86][Costmodel] Load/store i64/f64 Stride=2 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/j5co1qWEW - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. For store we have: https://godbolt.org/z/j5co1qWEW - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110837	2021-10-01 17:48:14 +03:00
Roman Lebedev	612e5b05a2	[X86][Costmodel] Load/store i64/f64 Stride=2 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/8a1cfGeMn - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/jMdcM47bx - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `2`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110835	2021-10-01 17:48:14 +03:00
Roman Lebedev	ea76cb87ee	[X86][Costmodel] Load/store i32/f32 Stride=2 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 Here for `store` pattern we are starting to have spilling, so accurate modelling may be problematic, although if i drop the spilling, the measurements don't change. For load we have: https://godbolt.org/z/1oTTnncbx - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0` So pick cost of `16`. For store we have: https://godbolt.org/z/1oTTnncbx - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: =8.0` So pick cost of `16`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110761	2021-10-01 17:48:14 +03:00
Roman Lebedev	80cd8da78d	[X86][Costmodel] Load/store i32/f32 Stride=2 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/M9eev3xe8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `8`. For store we have: https://godbolt.org/z/M9eev3xe8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0` So pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110756	2021-10-01 17:48:14 +03:00
Roman Lebedev	3a0643e9c2	[X86][Costmodel] Load/store i32/f32 Stride=2 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. For store we have: https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110755	2021-10-01 17:48:13 +03:00
Roman Lebedev	b12aeaec9a	[X86][Costmodel] Load/store i32/f32 Stride=2 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `2`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110754	2021-10-01 17:48:13 +03:00
Roman Lebedev	f44d9009c2	[X86][Costmodel] Load/store i32/f32 Stride=2 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/4rY96hnGT - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/vbo37Y3r9 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: =0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110753	2021-10-01 17:48:13 +03:00
Florian Hahn	413b7ac6b5	[BasicAA] Add test showing 32 bit overflow issue for GEPs. This patch additional tests with i64 GEP indices for 32 bit pointers. @mustalias_overflow_in_32_bit_add_mul_gep highlights a case where BasicAA currently incorrectly determines noalias. Modeled in Alive2 for 32 bit pointers: https://alive2.llvm.org/ce/z/HHjQgb Modeled in Alive2 for 64 bit pointers: https://alive2.llvm.org/ce/z/DoWK2c	2021-10-01 11:37:56 +01:00
Philip Reames	bdb5aa65b1	[test] Add tests covering a missing opt in SCEV's isSCEVExprNeverPoison	2021-09-30 16:15:06 -07:00
Florian Hahn	1fbdbb5595	Revert "Recommit "[SCEV] Look through single value PHIs." (take 2)" This reverts commit `764d9aa979`. This patch exposed a few additional cases where SCEV expressions are not properly invalidated. See PR52024, PR52023.	2021-09-30 20:53:51 +01:00
Craig Topper	765348298c	[CostModel] Update default cost model for sadd/ssub overflow to match TargetLowering The expansion for these was updated in https://reviews.llvm.org/D47927 but the cost model was not adjusted. I believe the cost model was also incorrect for the old expansion. The expansion prior to D47927 used 3 icmps using LHS, RHS, and Result to calculate theirs signs. Then 2 icmps to compare the signs. Followed by an And. The previous cost model was using 3 icmps and 2 selects. Digging back through git blame, those 2 selects in the cost model used to be 2 icmps, but were changed in https://reviews.llvm.org/D90681 Differential Revision: https://reviews.llvm.org/D110739	2021-09-30 09:41:14 -07:00
Daniil Fukalov	cf362ff4ca	[NFC][AMDGPU] Improve cost model tests coverage.	2021-09-30 18:13:17 +03:00
Roman Lebedev	6be397eb35	[NFC][X86][LV] Add costmodel test coverage for interleaved i64/f64 load/store stride=2	2021-09-30 17:31:18 +03:00
Roman Lebedev	6776bcfeb6	[NFC][Costmodel][LV][X86] Add test coverage for f32 interleaved load/store stride=2	2021-09-30 14:29:35 +03:00
Clement Courbet	455b60ccfb	[AA] Teach BasicAA to recognize basic GEP range information. The information can be implicit (from `ValueTracking`) or explicit. This implements the backend part of the following RFC https://groups.google.com/g/llvm-dev/c/T9o51zB1JY. We still need to settle on how to best represent the information in the IR, but this is a separate discussion. Differential Revision: https://reviews.llvm.org/D109746	2021-09-30 08:29:32 +02:00
Roman Lebedev	52912fe7ae	[NFC][X86][LV] Add costmodel test coverage for interleaved i32 load/store stride=2	2021-09-29 22:16:59 +03:00
Daniil Fukalov	6a187f9a57	[NFC][AMDGPU] Add missing gfx90a test cases to fsub.ll.	2021-09-29 21:55:54 +03:00
Roman Lebedev	2d42a192e0	[X86][Costmodel] Load/store i8 Stride=2 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/xz6x7c35P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.5` So pick cost of `6`. For store we have: https://godbolt.org/z/xz6x7c35P - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110709	2021-09-29 21:52:45 +03:00
Roman Lebedev	bac60c55e0	[X86][Costmodel] Load/store i8 Stride=2 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/a9hv4z47v - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `4`. For store we have: https://godbolt.org/z/6GfPn1b79 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `3`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110708	2021-09-29 21:52:45 +03:00
Roman Lebedev	1962185671	[X86][Costmodel] Load/store i8 Stride=2 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 Identical to VF=2. For load we have: https://godbolt.org/z/4TEbdzbMM - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/MYfzGPf3Y - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110705	2021-09-29 21:52:45 +03:00
Roman Lebedev	08face1f9a	[X86][Costmodel] Load/store i8 Stride=2 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 Identical to VF=2. For load we have: https://godbolt.org/z/sGE41GYo7 - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/ba5r3s9xa - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110704	2021-09-29 21:52:45 +03:00
Roman Lebedev	7d52628eb0	[X86][Costmodel] Load/store i8 Stride=2 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/caKqjr9hb - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/6TTn3eKj8 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110702	2021-09-29 21:52:44 +03:00
Simon Pilgrim	17f1fc1e54	[TTI] BasicTTI::getInterleavedMemoryOpCost(): use getScalarizationOverhead() getScalarizationOverhead() results in a somewhat better cost estimation than counting the insertion/extraction costs directly. Notably, this is still overestimating the costs. Original Patch by: @lebedev.ri (Roman Lebedev) Differential Revision: https://reviews.llvm.org/D110713	2021-09-29 16:41:53 +01:00
Roman Lebedev	c13b4b6b0d	[NFC][X86][LV] Add costmodel test coverage for interleaved i8 load/store stride=2	2021-09-29 15:28:05 +03:00
Roman Lebedev	ff05e25a84	[NFC][X86][LV] Add some test coverage for [un]masked gather/scatter While we did have test coverage for the intrinsics, i don't believe there was LV-based test coverage.	2021-09-29 14:28:49 +03:00
Simon Pilgrim	bddc04bc4c	[CostModel][X86] Add SSE2/AVX1/AVX512BW test coverage for i16 interleaved load/store	2021-09-28 18:00:56 +01:00
Roman Lebedev	b6b7860954	[X86][Costmodel] Load/store i16 Stride=6 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For this tuple, measuring becomes problematic since there's a lot of spilling going on, but apparently all these memory ops do not affect worst-case estimate at all here. For load we have: https://godbolt.org/z/5qGb9odP6 - for intels `Block RThroughput: <=106.0`; for ryzens, `Block RThroughput: <=34.8` So pick cost of `106`. For store we have: https://godbolt.org/z/KrWcv4Ph7 - for intels `Block RThroughput: =58.0`; for ryzens, `Block RThroughput: <=20.5` So pick cost of `58`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110593	2021-09-28 19:15:08 +03:00
Roman Lebedev	24e42f7d28	[X86][Costmodel] Load/store i16 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/3Tc5s897j - for intels `Block RThroughput: =39.0`; for ryzens, `Block RThroughput: <=13.5` So pick cost of `39`. For store we have: https://godbolt.org/z/fo1h9E67e - for intels `Block RThroughput: =21.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `21`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110592	2021-09-28 19:15:07 +03:00
Roman Lebedev	b3011bcc78	[X86][Costmodel] Load/store i16 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1Wcaf9c7T - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=4.5` So pick cost of `9`. For store we have: https://godbolt.org/z/1Wcaf9c7T - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110591	2021-09-28 19:15:01 +03:00
Roman Lebedev	aa93c55889	[X86][Costmodel] Load/store i16 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/bhscej4WM - for intels `Block RThroughput: =13.0`; for ryzens, `Block RThroughput: <=7.0` So pick cost of `13`. For store we have: https://godbolt.org/z/Yf4Pfnxbq - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=3.5` So pick cost of `10`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110590	2021-09-28 19:14:56 +03:00
Max Kazantsev	00be84f910	Recommit "[Test] Add more tests with cycled phis"	2021-09-28 19:36:47 +07:00
Max Kazantsev	a91145f75a	Revert "[Test] Add more tests with cycled phis" This reverts commit `7128a545b3`. Need to regenerate tests after rebase.	2021-09-28 19:32:26 +07:00
Max Kazantsev	7128a545b3	[Test] Add more tests with cycled phis	2021-09-28 19:04:12 +07:00
Florian Hahn	764d9aa979	Recommit "[SCEV] Look through single value PHIs." (take 2) This reverts commit `8fdac7cb7a`. The issue causing the revert has been fixed a while ago in `60b852092c`. Original message: Now that SCEVExpander can preserve LCSSA form, we do not have to worry about LCSSA form when trying to look through PHIs. SCEVExpander will take care of inserting LCSSA PHI nodes as required. This increases precision of the analysis in some cases. Reviewed By: mkazantsev, bmahjour Differential Revision: https://reviews.llvm.org/D71539	2021-09-28 10:32:17 +01:00
Roman Lebedev	2a7a768dad	[X86][Costmodel] Load/store i16 Stride=4 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For this tuple, measuring becomes problematic since there's a lot of spilling going on, but apparently all these memory ops do not affect worst-case estimate at all here. For load we have: https://godbolt.org/z/zP4hd8MT6 - for intels `Block RThroughput: =150.0`; for ryzens, `Block RThroughput: <=59` So pick cost of `150`. For store we have: https://godbolt.org/z/vKb8zTK8E - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=24.0` So pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110548	2021-09-27 22:20:01 +03:00
Roman Lebedev	ee5a050e2e	[X86][Costmodel] Load/store i16 Stride=4 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =75.0`; for ryzens, `Block RThroughput: <=29.5` So pick cost of `75`. (note that `# 32-byte Reload` does not affect throughput there.) For store we have: https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110543	2021-09-27 22:20:01 +03:00
Roman Lebedev	5615d6a6dd	[X86][Costmodel] Load/store i16 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/dd8T5P471 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=14.5` So pick cost of `33`. For store we have: https://godbolt.org/z/zPxcKWhn4 - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `10`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110541	2021-09-27 22:20:01 +03:00
Roman Lebedev	df2b42d12e	[X86][Costmodel] Load/store i16 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rnsf639Wh - for intels `Block RThroughput: =17.0`; for ryzens, `Block RThroughput: <=7.5` So pick cost of `17`. For store we have: https://godbolt.org/z/565KKrcY6 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110537	2021-09-27 22:20:01 +03:00
Roman Lebedev	45caac91c4	[X86][Costmodel] Load/store i16 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/5EYc6r9nh - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `6`. For store we have: https://godbolt.org/z/z61e5d6GE - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110536	2021-09-27 22:20:01 +03:00
Daniil Fukalov	1f73f0c19d	[NFC][AMDGPU] Update cost model tests: 1. Convert to generated tests. 2. Added code-size case in few places.	2021-09-27 19:26:02 +03:00
Roman Lebedev	7424deb743	[X86][Costmodel] Load/store i16 Stride=2 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/q6GbK89br - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: <=7.0` So pick cost of `18`. For store we have: https://godbolt.org/z/Yzfoo5TnW - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110507	2021-09-27 14:21:12 +03:00
Roman Lebedev	a5113e9445	[X86][Costmodel] Load/store i16 Stride=2 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.5` So pick cost of `9`. For store we have: https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110506	2021-09-27 14:20:11 +03:00
Roman Lebedev	70c90cc5bd	[X86][Costmodel] Load/store i16 Stride=2 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/e5YE99a4P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `6`. For store we have: https://godbolt.org/z/3vM4KsE1n - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `3`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110505	2021-09-27 14:18:29 +03:00
Roman Lebedev	49e532aa52	[X86][Costmodel] Load/store i16 Stride=2 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1j3nf3dro - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/4n1zvP37j - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110504	2021-09-27 14:15:25 +03:00
Max Kazantsev	4992220ea7	[Test] Regenerate test checks with autogen script	2021-09-27 16:55:59 +07:00
Max Kazantsev	0bd9162fd7	[Test] Add test showing that SCEV cannot properly infer ranges of cycled phis	2021-09-27 15:01:43 +07:00
Roman Lebedev	d9413f46b3	[X86][Costmodel] Load/store i16 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/M8vEKs5jY - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/Kx1nKz7je - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D103144	2021-09-26 19:13:23 +03:00
Simon Pilgrim	3538ee763d	[CostModel][X86] Improve AVX1/AVX2 v16i32->v16i16/v16i8 truncation costs (PR51972) Based off worst case btver2 (AVX1) and haswell (AVX2) llvm-mca reports	2021-09-26 13:43:46 +01:00
Simon Pilgrim	8c83bd3bd4	[CostModel][X86] Adjust vXi32 multiply costs if it can be performed using PMADDWD Update the costs to match the codegen from combineMulToPMADDWD - not only can we use PMADDWD is its zero-extended, but also if its a constant or sign-extended from a vXi16 (which can be replaced with a zero-extension).	2021-09-25 16:28:48 +01:00
Daniil Fukalov	4f28a2eb03	[NFC] Refactor tests to improve readability.	2021-09-24 01:57:30 +03:00
Simon Pilgrim	c931d35216	[CostModel][X86] Increase i64 mul cost from 1 to 2 Only the most recent cpus support really 1cy 64-bit multiplies, and the X64 cost table represents a realistic worst case. The 1cy value was also discouraging vectorization when most vXi64 PMULDQ expansions aren't actually slower than scalarization. Noticed while investigating PR51436.	2021-09-23 14:48:21 +01:00
Florian Mayer	36daf074d9	[hwasan] also omit safe mem[cpy\|mov\|set]. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D109816	2021-09-22 11:08:27 +01:00
Antonio Frighetto	43d6991c2a	[IR] Look through bitcast in hasFnAttribute() A logic incompleteness may lead MemorySSA to be too conservative in its results. Specifically, when dealing with a call of kind `call i32 bitcast (i1 (i1)* @test to i32 (i32)*)(i32 %1)`, where the function `test` is declared with readonly attribute, the bitcast is not looked through, obscuring function attributes. Hence, some methods of CallBase (e.g., doesNotReadMemory) could provide suboptimal results. Differential Revision: https://reviews.llvm.org/D109888	2021-09-21 21:57:02 +02:00
David Spickett	92c9b28347	Revert "[AArch64][SVE] Teach cost model that masked loads/stores are cheap" This reverts commit `734708e04f`. Due to build failures on the 2 stage SVE VLS bot. https://lab.llvm.org/buildbot/#/builders/176/builds/908/steps/11/logs/stdio	2021-09-20 08:45:18 +00:00
Nikita Popov	80110aafa0	[Tests] Fix incorrect noalias metadata Mostly this fixes cases where !noalias or !alias.scope were passed a scope rather than a scope list. In some cases I opted to drop the metadata entirely instead, because it is not really relevant to the test.	2021-09-18 20:51:00 +02:00
Philip Reames	df7c2bcf4e	precommit tests for D109457	2021-09-16 12:43:22 -07:00
Philip Reames	f79ce5875f	autogen a SCEV test for ease of update	2021-09-16 12:19:30 -07:00
Max Kazantsev	e4da0f9657	[Test] Add test showing missing opportunity in range inference for SCEV	2021-09-16 15:40:56 +07:00
Philip Reames	248e430f37	precommit test for D109845/D106852	2021-09-15 12:53:55 -07:00
Philip Reames	9bdb19cca2	[SCEV] (udiv X, Y) * Y is always NUW Motivated by the removal done in D109782. This implements the correct flag part generically. Differential Revision: https://reviews.llvm.org/D109786	2021-09-15 11:34:50 -07:00
Philip Reames	a92f11b682	switch a couple of SCEV tests to autogen for ease of update	2021-09-15 11:11:07 -07:00
Simon Pilgrim	0767e43d87	[CostModel][X86] Adjust bitreverse/ctpop/ctlz/cttz AVX2+ costs based on llvm-mca reports Based off the worse case numbers generated by D103695, the AVX2/512 bit reversing/counting costs were higher than necessary (based off instruction counts instead of actual throughput).	2021-09-15 13:04:40 +01:00
Philip Reames	baff4b4105	[test] precommit anoter test for D109786	2021-09-14 15:31:44 -07:00
Philip Reames	162aed4824	[test] precommit test for D109786	2021-09-14 15:28:26 -07:00
Philip Reames	336291e777	autogen a test for ease of update in later patch	2021-09-14 14:48:47 -07:00
Florian Hahn	e248d69036	Recommit "[LAA] Support pointer phis in loop by analyzing each incoming pointer." SCEV does not look through non-header PHIs inside the loop. Such phis can be analyzed by adding separate accesses for each incoming pointer value. This results in 2 more loops vectorized in SPEC2000/186.crafty and avoids regressions when sinking instructions before vectorizing. Fixes PR50296, PR50288. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D102266	2021-09-14 11:19:12 +01:00
Florian Mayer	5b5d774f5d	[hwasan] Respect returns attribute when tracking values. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D109233	2021-09-13 20:52:24 +01:00
Florian Hahn	4c84a0f24c	[LAA] Add additional pointer phi tests.	2021-09-13 10:05:31 +01:00
Florian Mayer	57335b6e2e	[stack-safety] Allow to determine safe accesses. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D109503	2021-09-10 19:23:54 +01:00
Philip Reames	eede4846a9	[SCEV] Allow negative steps for LT exit count computation for unsigned comparisons This bit of code is incredibly suspicious. It allows fully unknown (but potentially negative) steps, but not steps known to be negative. The comment about scev flag inference is worrying, but also not correct to my knowledge. At best, this might be covering up some related miscompile. However, there's no test in tree for it, the review history doesn't include obvious motivation, and the C++ example doesn't appear to give wrong results when hand translated to IR. I think it's time to remove this and see what falls out. During review, there were concerns raised about the correctness of the corresponding signed case. This change was deliberately narrowed to the unsigned case which has been auditted and appears correct for negative values. We need to get back to the known-negative signed case, but that'll be a future patch if nothing falls out from this one. Differential Revision: https://reviews.llvm.org/D104140	2021-09-09 14:09:29 -07:00
Eli Friedman	8f792707c4	[ScalarEvolution] Fix pointer/int confusion in howManyLessThans. In general, howManyLessThans doesn't really want to work with pointers at all; the result is an integer, and the operands of the icmp are effectively integers. However, isLoopEntryGuardedByCond doesn't like extra ptrtoint casts, so the arguments to isLoopEntryGuardedByCond need to be computed without those casts. Somehow, the values got mixed up with the recent howManyLessThans improvements; fix the confused values, and add a better comment to explain what's happening. Differential Revision: https://reviews.llvm.org/D109465	2021-09-09 12:38:33 -07:00
Eli Friedman	0375734439	[NFC] Add extra test for D106331	2021-09-08 14:18:47 -07:00
Michael Kruse	088577a38e	[Delinerization] Require by offset to be zero. Users of delinearization assume that the the offset into the array element is zero. In most cases it will indeed be zero, but if it is not, the delinearization has to fail since it violates that assumption without the API even allowing to signal to the caller that the by offset is non-zero. This bug caused Polly to miscompile blender (526.blender_r from SPEC CPU 2017) in -polly-process-unprofitable mode. The SCEV expression incorrectly delinearized has been reduced in the test case byte_offset.ll. The dropped offset into the array element of size 4 (a float) is ((sext i32 %mul7.i4534 to i64) + {(sext i32 %i1 to i64),+,((sext i32 (1 + ((1 + %shl.i.i) * (1 + %shl.i.i)) + %shl.i.i) to i64) * (sext i32 %i1 to i64))}<%for.body703>). This significant component was just dropped, and the wrong pointer was computed when regenerating code from the remaining delinearized subscripts. This occurred during blender's subsurface scattering implementation. As a result, blender's rendering diverged from the reference image. Patch D108885 would also fix the API. Reviewed By: bmahjour Differential Revision: https://reviews.llvm.org/D109133	2021-09-08 16:02:37 -05:00
Arthur Eubanks	b493124ae2	[MemorySSA] Support invariant.group metadata The implementation is mostly copied from MemDepAnalysis. We want to look at all loads and stores to the same pointer operand. Bitcasts and zero GEPs of a pointer are considered the same pointer value. We choose the most dominating instruction. Since updating MemorySSA with invariant.group is non-trivial, for now handling of invariant.group is not cached in any way, so it's part of the walker. The number of loads/stores with invariant.group is small for now anyway. We can revisit if this actually noticeably affects compile times. To avoid invariant.group affecting optimized uses, we need to have optimizeUsesInBlock() not use invariant.group in any way. Co-authored-by: Piotr Padlewski <prazek@google.com> Reviewed By: asbirlea, nikic, Prazek Differential Revision: https://reviews.llvm.org/D109134	2021-09-08 13:06:12 -07:00
Philip Reames	6cdca906c7	[SCEV] Use no-self-wrap flags infered from exit structure to compute trip count The basic problem being solved is that we largely give up when encountering a trip count involving an IV which is not an addrec. We will fall back to the brute force constant eval, but that doesn't have the information about the fact that we can't cycle back through the same set of values. There's a high level design question of whether this is the right place to handle this, and if not, where that place is. The major alternative here would be to return a conservative upper bound, and then rely on two invocations of indvars to add the facts to the narrow IV, and then reconstruct SCEV. (I have not implemented the alternative and am not 100% sure this would work out.) That's arguably more in line with existing code, but I find this substantially easier to reason about. During review, no one expressed a strong opinion, so we went with this one. Differential Revision: D108651	2021-09-07 17:00:02 -07:00
David Sherwood	5dcf4b4fe0	[SVE][NFC] Add SVE cost model tests for gathers/scatters We previously didn't have any tests to defend the cost model for gathers and scatters using SVE without a vscale_range attribute. I've added tests to existing files: Analysis/CostModel/AArch64/sve-gather.ll Analysis/CostModel/AArch64/sve-scatter.ll Differential Revision: https://reviews.llvm.org/D109055	2021-09-07 14:13:37 +01:00
Nikita Popov	8d54c8a0c3	[SCEV] Fix applyLoopGuards() with range check idiom (PR51760) Due to a typo, this replaced %x with umax(C1, umin(C2, %x + C3)) rather than umax(C1, umin(C2, %x)). This didn't make a difference for the existing tests, because the result is only used for range calculation, and %x will usually have an unknown starting range, and the additional offset keeps it unknown. However, if %x already has a known range, we may compute a result range that is too small.	2021-09-06 22:22:41 +02:00
Andrew Litteken	bd4b1b5f6d	[IRSim] Adding support for recognizing branch similarity The current IRSimilarityIdentifier does not try to find similarity across blocks, this patch provides a mechanism to compare two branches against one another, to find similarity across basic blocks, rather than just within them. This adds a step in the similarity identification process that labels all of the basic blocks so that we can identify the relative branching locations. Within an IRSimilarityCandidate we use these relative locations to determine whether if the branching to other relative locations in the same region is the same between branches. If they are, we consider them similar. We do not consider the relative location of the branch if the target branch is outside of the region. In this case, both branches must exit to a location outside the region, but the exact relative location does not matter. Reviewers: paquette, yroux Differential Revision: https://reviews.llvm.org/D106989	2021-09-06 11:55:38 -07:00
Simon Pilgrim	f114ef3731	[CostModel][X86] Add generic costs for vXi32 MUL -> v2Xi16 PMADDDW folds Based off the improved fold in D108522 This should eventually allow us to replace the SLM only cost patterns with generic versions.	2021-09-05 16:08:11 +01:00
Simon Pilgrim	9962ebaee5	[CostModel][X86] Add vXi32 multiply pattern tests Add tests for vXi32 multiplies where the operands have been extended from vXi8/vXi16	2021-09-05 16:08:11 +01:00
Arthur Eubanks	bd020bbbd2	[test] Cleanup tests with -enable-new-pm in llvm/test/Analysis	2021-09-04 16:06:10 -07:00
Arthur Eubanks	d896f22fda	[test] Cleanup legacy PM tests in llvm/test/Analyis/ScalarEvolution	2021-09-04 15:57:30 -07:00
Arthur Eubanks	813a7f1ad7	[MemorySSA] Properly handle liveOnEntry in the walker printer Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D109177	2021-09-02 12:51:27 -07:00
Arthur Eubanks	a270de359f	[test] Remove missed RUN line after D109040	2021-09-02 11:44:45 -07:00
Arthur Eubanks	50153213c8	[test][NewPM] Remove RUN lines using -analyze Only tests in llvm/test/Analysis. -analyze is legacy PM-specific. This only touches files with `-passes`. I looked through everything and made sure that everything had a new PM equivalent. Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D109040	2021-09-02 11:38:14 -07:00
Nikita Popov	c86e1ce73b	[SCEVExpander] Simplify pointer overflow check This is a followup to D104662 to generate slightly nicer code for pointer overflow checks. Bypass expandAddToGEP and instead explicitly generate i8 GEPs. This saves some bitcasts and negates the value in a more obvious way. In particular, this prevents SCEV from looking through the umul.with.overflow, same as in the integer case. The wrapping-pointer-ni.ll test deserves a comment: Previously, this generated a typed GEP which used the umulo argument rather than the multiplication result. This results in more compact IR in that case, but effectively does the multiplication twice, the second one is just hidden in the GEP. Reusing the umulo result seems pretty reasonable to me. Differential Revision: https://reviews.llvm.org/D109093	2021-09-02 20:15:59 +02:00
Roman Lebedev	3f1f08f0ed	Revert @llvm.isnan intrinsic patchset. Please refer to https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html (and that whole thread.) TLDR: the original patch had no prior RFC, yet it had some changes that really need a proper RFC discussion. It won't be productive to discuss such an RFC, once it's actually posted, while said patch is already committed, because that introduces bias towards already-committed stuff, and the tree is potentially in broken state meanwhile. While the end result of discussion may lead back to the current design, it may also not lead to the current design. Therefore i take it upon myself to revert the tree back to last known good state. This reverts commit `4c4093e6e3`. This reverts commit `0a2b1ba33a`. This reverts commit `d9873711cb`. This reverts commit `791006fb8c`. This reverts commit `c22b64ef66`. This reverts commit `72ebcd3198`. This reverts commit `5fa6039a5f`. This reverts commit `9efda541bf`. This reverts commit `94d3ff09cf`.	2021-09-02 13:53:56 +03:00
David Sherwood	d581d94385	[SVE] Fix the FP arithmetic instruction costs for SVE Several FP instructions (fadd, fsub, etc.) were incorrectly assigned a higher cost for SVE because they have custom lowering, however we know they are legal. This patch explicitly assigns a cost of 2 to these opcodes. Tests added here: Analysis/CostModel/AArch64/arith-fp-sve.ll Differential Revision: https://reviews.llvm.org/D108993	2021-09-02 09:55:13 +01:00
Arthur Eubanks	1c503e923a	[test] Precommit/fix up existing test for MemorySSA/invariant.group	2021-09-01 22:58:17 -07:00
Arthur Eubanks	7b08d9da55	Reland [MemorySSA] Add pass to print results of MemorySSA walker Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D109028	2021-09-01 18:58:57 -07:00
Arthur Eubanks	0f63496ea4	Revert "[MemorySSA] Add pass to print results of MemorySSA walker" This reverts commit `8f98477c2d`. Breaks bots	2021-09-01 18:45:19 -07:00
Arthur Eubanks	8f98477c2d	[MemorySSA] Add pass to print results of MemorySSA walker Reviewed By: asbirlea Differential Revision: https://reviews.llvm.org/D109028	2021-09-01 18:29:15 -07:00
Philip Reames	29fa37ec9f	[SCEV] If max BTC is zero, then so is the exact BTC [2 of 2] This extends D108921 into a generic rule applied to constructing ExitLimits along all paths. The remaining paths (primarily howFarToZero) don't have the same reasoning about UB sensitivity as the howManyLessThan ones did. Instead, the remain cause for max counts being more precise than exact counts is that we apply context sensitive loop guards on the max path, and not on the exact path. That choice is mildly suspect, but out of scope of this patch. The MVETailPredication.cpp change deserves a bit of explanation. We were previously figuring out that two SCEVs happened to be equal because the happened to be identical. When we optimized one with context sensitive information, but not the other, we lost the ability to prove them equal. So, cover this case by subtracting and then applying loop guards again. Without this, we see changes in test/CodeGen/Thumb2/mve-blockplacement.ll Differential Revision: https://reviews.llvm.org/D109015	2021-09-01 11:51:48 -07:00
David Sherwood	f024a4818d	[NFC] Re-run update_analyze_test_checks on Analysis/CostModel/AArch64/sve-intrinsics.ll	2021-09-01 12:09:58 +01:00
David Sherwood	930d5077f4	Revert "[NFC] Re-run update_analyze_test_checks on Analysis/CostModel/AArch64/sve-intrinsics.ll" This reverts commit `aeb2bd68dc`.	2021-09-01 11:52:29 +01:00
David Sherwood	aeb2bd68dc	[NFC] Re-run update_analyze_test_checks on Analysis/CostModel/AArch64/sve-intrinsics.ll	2021-09-01 11:44:02 +01:00
Philip Reames	c49503a76d	[SCEV] Add a testcase for zero max btc with non-constant exact btc Reduced from the ArchiveCommandLine.ll case seen in D108848.	2021-08-31 11:00:41 -07:00
Philip Reames	6600e1759b	[SCEV] If max BTC is zero, then so is the exact BTC [1 of N] This patch is specifically the howManyLessThan case. There will be a couple of followon patches for other codepaths. The subtle bit is explaining why the two codepaths have a difference while both are correct. The test case with modifications is a good example, so let's discuss in terms of it. * The previous exact bounds for this example of (-126 + (126 smax %n))<nsw> can evaluate to either 0 or 1. Both are "correct" results, but only one of them results in a well defined loop. If %n were 127 (the only possible value producing a trip count of 1), then the loop must execute undefined behavior. As a result, we can ignore the TC computed when %n is 127. All other values produce 0. * The max taken count computation uses the limit (i.e. the maximum value END can be without resulting in UB) to restrict the bound computation. As a result, it returns 0 which is also correct. WARNING: The logic above only holds for a single exit loop. The current logic for max trip count would be incorrect for multiple exit loops, except that we never call computeMaxBECountForLT except when we can prove either a) no overflow occurs in this IV before exit, or b) this is the sole exit. An alternate approach here would be to add the limit logic to the symbolic path. I haven't played with this extensively, but I'm hesitant because a) the term is optional and b) I'm not sure it'll reliably simplify away. As such, the resulting code quality from expansion might actually get worse. This was noticed while trying to figure out why D108848 wasn't NFC, but is otherwise standalone. Differential Revision: https://reviews.llvm.org/D108921	2021-08-31 08:50:11 -07:00
Philip Reames	301fbf9b81	[SCEV] Clarify the overflow precondition of computeMaxBECountForLT [NFC] And add a test case to illustrate that we do in fact produce the right result for the multiple exit case. I have gotten myself confused at least three times when reading this code, so clarify to prevent future confusion.	2021-08-30 09:49:17 -07:00
Daniil Fukalov	5b3fad4966	[AMDGPU][CostModel] Update shuffle instruction tests. NFC. New tests ported over from test/Analysis/CostModel/AArch64/shuffle-other.ll.	2021-08-30 19:17:27 +03:00
Matthew Devereau	9b830c798e	[AArch64][SVE] Teach cost model masked gathers/scatters are cheap Tell the cost model to use the scalable calculation for non-neon fixed vector. This results in a cheaper cost for fixed-length SVE masked gathers/scatters allowing the vectorizor to emit them more frequently.	2021-08-26 11:17:47 +01:00
Philip Reames	4d235bf75d	[tests] Add a couple tests for intersection of `ec8d87e` and D108651	2021-08-24 14:29:36 -07:00
Philip Reames	ec8d87e9f5	[SCEV] Infer nuw from nw for addrecs This was previously committed in `914836b`, and reverted due to confusion on the status of the review. Differential Revision: https://reviews.llvm.org/D108601	2021-08-24 14:24:05 -07:00
Philip Reames	35b0b1a64a	[test] Prcommit tests for D108651	2021-08-24 14:18:58 -07:00
Philip Reames	58582bae63	Revert "[SCEV] Infer nsw/nuw from nw for addrecs" This reverts commit `914836b1c8`. Further comments on review came up after initial approval. Reverting while addressing.	2021-08-24 09:28:37 -07:00
Philip Reames	914836b1c8	[SCEV] Infer nsw/nuw from nw for addrecs If we no an addrec doesn't self-wrap, the increment is strictly positive, and the start value is the smallest representable value, then we know that the corresponding wrap type can not occur. Differential Revision: https://reviews.llvm.org/D108601	2021-08-24 08:53:21 -07:00
Simon Pilgrim	9efda541bf	[CostModel][X86] Add costs for f32/f64 scalar and vector types. The f16 half types are still pretty useless as we don't have it as a legal type (we treat them as i16 most of the time)	2021-08-20 14:31:12 +01:00
Bjorn Pettersson	d52f506192	[NewPM] Use parameterized syntax for a couple of more passes A couple of passes that are parameterized in new-PM used different pass names (in cmd line interface) while using the same pass class name. This patch updates the PassRegistry to model pass parameters more properly using PASS_WITH_PARAMS. Reason for the change is to ensure that we have a 1-1 mapping between class name and pass name (when disregarding the params). With a 1-1 mapping it is more obvious which pass name to use in options such as -debug-only, -print-after etc. The opt -passes syntax is changed for the following passes: early-cse-memssa => early-cse<memssa> post-inline-ee-instrument => ee-instrument<post-inline> loop-extract-single => loop-extract<single> lower-matrix-intrinsics-minimal => lower-matrix-intrinsics<minimal> This patch is not updating pass names in docs/Passes.rst. Not quite sure what the status is for that document (e.g. when it comes to listing pass paramters). It is only loop-extract-single that is mentioned in Passes.rst today, out of the passes mentioned above. Differential Revision: https://reviews.llvm.org/D108362	2021-08-20 14:59:21 +02:00
Simon Pilgrim	72ebcd3198	[CostModel][X86] Add isnan half/float/double costs tests	2021-08-19 18:07:06 +01:00
Simon Pilgrim	9419729b6a	[CostModel][X86] Add VPOPCNTDQ/BITALG ctpop costs VPOPCNTDQ + BITALG add ctpop instructions for vXi64/vXi32 + vXi16/vXi8 vector types respectively	2021-08-19 15:40:09 +01:00
Simon Pilgrim	2d60fdd7aa	[CostModel][X86] Add VPOPCNT/BITALG test coverage for ctpop/cttz costs	2021-08-19 14:05:58 +01:00
Matthew Devereau	734708e04f	[AArch64][SVE] Teach cost model that masked loads/stores are cheap Reduce the cost of VLS masked loads/stores to make the vectorizor emit them more frequently.	2021-08-19 13:01:33 +01:00
Peter Collingbourne	6f85225ef3	StackLifetime: Remove asserts for multiple lifetime intrinsics. According to the langref, it is valid to have multiple consecutive lifetime start or end intrinsics on the same object. For llvm.lifetime.start: "If ptr [...] is a stack object that is already alive, it simply fills all bytes of the object with poison." For llvm.lifetime.end: "Calling llvm.lifetime.end on an already dead alloca is no-op." However, we currently fail an assertion in such cases. I've observed the assertion failure when the loop vectorization pass duplicates the intrinsic. We can conservatively handle these intrinsics by ignoring all but the first one, which can be implemented by removing the assertions. Differential Revision: https://reviews.llvm.org/D108337	2021-08-18 18:45:28 -07:00
Nikita Popov	3dd8c9176b	[LICM] Remove AST-based implementation MSSA-based LICM has been enabled by default for a few years now. This drops the old AST-based implementation. Using loop(licm) will result in a fatal error, the use of loop-mssa(licm) is required (or just licm, which defaults to loop-mssa). Note that the core canSinkOrHoistInst() logic has to retain AST support for now, because it is shared with LoopSink. Differential Revision: https://reviews.llvm.org/D108244	2021-08-18 20:21:53 +02:00
David Sherwood	219d4518fc	[Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive For tight loops like this: float r = 0; for (int i = 0; i < n; i++) { r += a[i]; } it's better not to vectorise at -O3 using fixed-width ordered reductions on AArch64 targets. Although the resulting number of instructions in the generated code ends up being comparable to not vectorising at all, there may be additional costs on some CPUs, for example perhaps the scheduling is worse. It makes sense to deter vectorisation in tight loops. Differential Revision: https://reviews.llvm.org/D108292	2021-08-18 17:01:56 +01:00
Dylan Fleming	ef198cd99e	[SVE] Remove usage of getMaxVScale for AArch64, in favour of IR Attribute Removed AArch64 usage of the getMaxVScale interface, replacing it with the vscale_range(min, max) IR Attribute. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D106277	2021-08-17 14:42:47 +01:00
Nikita Popov	735a590471	[MemorySSA] Remove -enable-mssa-loop-dependency option This option has been enabled by default for quite a while now. The practical impact of removing the option is that MSSA use cannot be disabled in default pipelines (both LPM and NPM) and in manual LPM invocations. NPM can still choose to enable/disable MSSA using loop vs loop-mssa. The next step will be to require MSSA for LICM and drop the AST-based implementation entirely. Differential Revision: https://reviews.llvm.org/D108075	2021-08-16 20:59:37 +02:00
Nikita Popov	e11354c0a4	[Tests] Remove explicit -enable-mssa-loop-dependency options (NFC) This is enabled by default. Drop explicit uses in preparation for removing the option. Also drop RUN lines that are now the same (typically modulo a -verify-memoryssa option).	2021-08-14 21:21:07 +02:00
Florian Hahn	f999312872	Recommit "[Matrix] Overload stride arg in matrix.columnwise.load/store." This reverts the revert `28c04794df`. The failing MLIR test that caused the revert should be fixed in this version. Also includes a PPC test fix previously in `1f87c7c478`.	2021-08-12 18:31:57 +01:00
Florian Hahn	a72cd6353c	Revert "[Matrix] Update column.major.load call in PPC test." Dependent commit `a1ef81de35` has been reverted in `a1ef81de35`.	2021-08-12 13:13:52 +01:00
Florian Hahn	1f87c7c478	[Matrix] Update column.major.load call in PPC test. `a1ef81de35` adjusted the definition of the intrinsic, but did not update a PowerPC test. Fix the test by updating the call & declaration of @llvm.matrix.column.major.load.	2021-08-12 11:26:33 +01:00
Archibald Elliott	b764b1ef2f	[NFC][X86] New Test Requires Asserts D105263 introduced this new test. It fails when asserts are disabled, due to using a debug option on opt. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D107805	2021-08-10 10:22:04 +01:00
Wang, Pengfei	6f7f5b54c8	[X86] AVX512FP16 instructions enabling 1/6 1. Enable FP16 type support and basic declarations used by following patches. 2. Enable new instructions VMOVW and VMOVSH. Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html Reviewed By: LuoYuanke Differential Revision: https://reviews.llvm.org/D105263	2021-08-10 12:46:01 +08:00
Nikita Popov	88003cea1c	[MemCpyOpt] Remove MemDepAnalysis-based implementation The MemorySSA-based implementation has been enabled for a few months (since D94376). This patch drops the old MDA-based implementation entirely. I've kept this to only the basic cleanup of dropping various conditions -- the code could be further cleaned up now that there is only one implementation. Differential Revision: https://reviews.llvm.org/D102113	2021-08-07 22:35:44 +02:00
Zheng Chen	30b0c455b1	[LoopCacheAnalysis]: handle mismatch type for Numerator and CacheLineSize fix an assertion due to mismatch type for Numerator and CacheLineSize in loop cache analysis pass. Reviewed By: bmahjour Differential Revision: https://reviews.llvm.org/D107618	2021-08-06 16:51:09 +00:00
David Green	649cf4514d	[AArch64] Expand the SVE min/max reduction costs to NEON This takes the existing SVE costing for the various min/max reduction intrinsics and expands it to NEON, where I believe it applies equally well. In the process it changes the lowering to use min/max cost, as opposed to summing up the cost of ICmp+Select. Differential Revision: https://reviews.llvm.org/D106239	2021-08-05 23:23:24 +01:00
Bardia Mahjour	0e08891ec1	[DA] control compile-time spent by MIV tests Function exploreDirections() in DependenceAnalysis implements a recursive algorithm for refining direction vectors. This algorithm has worst-case complexity of O(3^(n+1)) where n is the number of common loop levels. In this patch I'm adding a threshold to control the amount of time we spend in doing MIV tests (which most of the time end up resulting in over pessimistic direction vectors anyway). Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D107159	2021-08-05 09:50:11 -04:00
Irina Dobrescu	b01417d3c5	[AArch64] Optimise min/max lowering in ISel Differential Revision: https://reviews.llvm.org/D106561	2021-08-02 13:40:21 +01:00
Sjoerd Meijer	46a861af3d	[CostModel][AArch64] Add some shuffle concat tests. NFC. Test ported over from test/Analysis/CostModel/ARM/shuffle.ll.	2021-08-02 12:11:00 +01:00
Simon Pilgrim	872a950033	[CostModel] Treat 'widen subvector' patterns as zero cost As discussed on D107228, widening a subvector by inserting the whole subvector into the bottom a larger undef vector should always be cheap enough that we can treat it as zero cost. NOTE: If this proves to cause issues we have the option of introducing a "SK_WidenSubvector" shuffle kind enum that targets could override the zero cost, but that doesn't seem necessary atm. Differential Revision: https://reviews.llvm.org/D107228	2021-08-02 11:43:10 +01:00
Simon Pilgrim	7397dcb403	[TTI] Add basic SK_InsertSubvector shuffle mask recognition This patch adds an initial ShuffleVectorInst::isInsertSubvectorMask helper to recognize 2-op shuffles where the lowest elements of one of the sources are being inserted into the "in-place" other operand, this includes "concat_vectors" patterns as can be seen in the Arm shuffle cost changes. This also helped fix a x86 issue with irregular/length-changing SK_InsertSubvector costs - I'm hoping this will help with D107188 This doesn't currently attempt to work with 1-op shuffles that could either be a "widening" shuffle or a self-insertion. The self-insertion case is tricky, but we currently always match this with the existing SK_PermuteSingleSrc logic. The widening case will be addressed in a follow up patch that treats the cost as 0. Masks with a high number of undef elts will still struggle to match optimal subvector widths - its currently bounded by minimum-width possible insertion, whilst some cases would benefit from wider (pow2?) subvectors. Differential Revision: https://reviews.llvm.org/D107228	2021-08-02 11:23:44 +01:00
David Green	098984a80c	[AArch64] Update and expand min-max cost model test. NFC This expands the cost model test for min/max to many more types, including floating point minnum/maxnum and minimum/maximum, and FP16 with and without fullfp16. The old llc run lines are removed, as those are better tested by CodeGen tests.	2021-07-27 18:48:58 +01:00
Simon Pilgrim	77c5e6ba90	[Analysis] Fix getOrderedReductionCost to call target's getArithmeticInstrCost implementation The getOrderedReductionCost implementation introduced in D105432 calls the CRTP base version getArithmeticInstrCost instead of the redirecting to the target version. Differential Revision: https://reviews.llvm.org/D106795	2021-07-26 17:15:43 +01:00
David Sherwood	0aff1798b5	[Analysis] Add simple cost model for strict (in-order) reductions I have added a new FastMathFlags parameter to getArithmeticReductionCost to indicate what type of reduction we are performing: 1. Tree-wise. This is the typical fast-math reduction that involves continually splitting a vector up into halves and adding each half together until we get a scalar result. This is the default behaviour for integers, whereas for floating point we only do this if reassociation is allowed. 2. Ordered. This now allows us to estimate the cost of performing a strict vector reduction by treating it as a series of scalar operations in lane order. This is the case when FP reassociation is not permitted. For scalable vectors this is more difficult because at compile time we do not know how many lanes there are, and so we use the worst case maximum vscale value. I have also fixed getTypeBasedIntrinsicInstrCost to pass in the FastMathFlags, which meant fixing up some X86 tests where we always assumed the vector.reduce.fadd/mul intrinsics were 'fast'. New tests have been added here: Analysis/CostModel/AArch64/reduce-fadd.ll Analysis/CostModel/AArch64/sve-intrinsics.ll Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll Differential Revision: https://reviews.llvm.org/D105432	2021-07-26 10:26:06 +01:00
Sander de Smalen	c3277a8828	[BasicTTI] Set scalarization cost of scalable vector casts to Invalid. When BasicTTIImpl::getCastInstrCost can't determine the cost of a vector cast operation when the types need legalization, it falls back to calculating scalarization costs. Instead of crashing on `cast<FixedVectorType>(DstVTy)` when the type is a scalable vector, return an Invalid cost. Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D106655	2021-07-24 14:13:21 +01:00
Philip Reames	e9d4bb43f8	[tests] SCEV trip count w/ neg step and varying rhs	2021-07-23 17:19:46 -07:00
Philip Reames	4a3dc7dc9a	[SCEV] Fix bug involving zero step and non-invariant RHS in trip count logic Eli pointed out the issue when reviewing D104140. The max trip count logic makes an assumption that the value of IV changes. When the step is zero, the nowrap fact becomes trivial, and thus there's nothing preventing the loop from being nearly infinite. (The "nearly" part is because mustprogress may disallow an infinite loop while still allowing 999999999 iterations before RHS happens to allow an exit.) This is very difficult to see in practice. You need a means to produce a loop varying RHS in a mustprogress loop which doesn't allow the loop to be infinite. In most cases, LICM or SCEV are smart enough to remove the loop varying expressions. Differential Revision: https://reviews.llvm.org/D106327	2021-07-23 15:19:23 -07:00
David Green	38986c6782	[AArch64] Add worst case shuffle costs This adds some missing single source shuffle costs for AArch64, of i16 and i8 vectors. v4i16 are the same as v4i32 with a worse case cost of 3 coming from the perfect shuffle tables. The larger vector sizes expand into a constant pool, plus a load (and adrp) and a tbl. I arbitrarily chose 8 for the cost to be expensive but not too expensive. Differential Revision: https://reviews.llvm.org/D106241	2021-07-23 09:01:58 +01:00
Simon Pilgrim	4185c5502c	[CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports. Update shl/lshr/ashr costs based on the worst case costs from the script in D103695 - many of the 128-bit shifts (usually where integer multiplies aren't used) have similar behaviour to AVX1 so we can merge them.	2021-07-22 20:07:32 +01:00
Simon Pilgrim	2657fe1721	[CostModel][X86] Fix funnel shift check prefixes We'd lost AVX1 test coverage due to bulldozer (XOP) trying to use the same check prefixes - we really need to fix the update script to avoid this!	2021-07-22 20:07:31 +01:00
David Green	c9cebda772	[AArch64] Adjust the cost of integer sum reductions This changes the cost to (LT.first-1) * cost(add) + 2, where the cost of an add is assumed to be 1. This brings it inline with the other reductions. Differential Revision: https://reviews.llvm.org/D106240	2021-07-22 18:19:54 +01:00
Simon Pilgrim	e1bdb57958	[CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports. Update shl/lshr/ashr costs based on the worst case costs from the script in D103695.	2021-07-22 18:12:49 +01:00
David Green	a92974bfdf	[AArch64] Add and update reduction and shuffle costs. NFC	2021-07-22 10:22:42 +01:00
Philip Reames	4c40cfc20b	[tests] Add a couple of tests for zero stride trip counts w/loop varying exit values	2021-07-19 16:33:10 -07:00
Eli Friedman	de3ea51be4	[ScalarEvolution] Refine computeMaxBECountForLT to be accurate in more cases. Allow arbitrary strides, and make sure we return the correct result when the backedge-taken count is zero. Differential Revision: https://reviews.llvm.org/D106197	2021-07-19 15:43:30 -07:00
Simon Pilgrim	5939c642ae	[CostModel][X86] Add fast math tests for float reductions As noticed on D105432 we didn't have any coverage to distinguish between fast/exact float reductions	2021-07-19 13:01:28 +01:00
Eli Friedman	cbba71bfb5	[ScalarEvolution] Fix overflow in computeBECount. The current implementation of computeBECount doesn't account for the possibility that adding "Stride - 1" to Delta might overflow. For almost all loops, it doesn't, but it's not actually proven anywhere. To deal with this, use a variety of tricks to try to prove that the addition doesn't overflow. If the proof is impossible, use an alternate sequence which never overflows. Differential Revision: https://reviews.llvm.org/D105216	2021-07-16 16:15:18 -07:00

... 3 4 5 6 7 ...

3272 Commits