Commit Graph

20031 Commits

Author SHA1 Message Date
David Sherwood 9448cdc900 [SVE][Analysis] Tune the cost model according to the tune-cpu attribute
This patch introduces a new function:

  AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

  Analysis/CostModel/AArch64/sve-gather.ll
  Analysis/CostModel/AArch64/sve-scatter.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D110259
2021-10-21 09:33:50 +01:00
eopXD 76db6d8080 [NFC][LoopIdiom] Add more test case to runtime-determined memset size
This patch supplements missing test case for D107353.
- Fix wrong descriptions in 64-bit mode test case
- Added testcase under 32-bit mode

Reviewed By: bmahjour

Differential Revision: https://reviews.llvm.org/D108507
2021-10-21 00:05:18 -07:00
Nikita Popov 8e4ae603d6 [Tests] Add tests for non-speculatable ephemeral values
The loads in these examples are currently not considered ephemeral
because they are not speculatable.
2021-10-20 23:33:36 +02:00
Stanislav Mekhanoshin b92412fb28 [InstCombine] Fold `(a & ~b) & ~c` to `a & ~(b | c)`
%not1 = xor i32 %b, -1
  %not2 = xor i32 %c, -1
  %and1 = and i32 %a, %not1
  %and2 = and i32 %and1, %not2
=>
  %i1 = or i32 %b, %c
  %i2 = xor i32 %1, -1
  %and2 = and i32 %i2, %a

Differential Revision: https://reviews.llvm.org/D112108
2021-10-20 13:05:46 -07:00
Stanislav Mekhanoshin 3c59cdee5c Precommit updated InstCombine/and-xor-or.ll test. NFC. 2021-10-20 12:50:23 -07:00
Florian Hahn 8977bd5806
[IndVars] Invalidate SCEV when IR is changed in rewriteLoopExitValue.
At the moment, rewriteLoopExitValue forgets the current phi node in the
loop that collects phis to rewrite. A few lines after the value is
forgotten, SCEV is used again to analyze incoming values and
potentially expand SCEV expression. This means that another SCEV is
created for PN, before the IR is actually updated in the next loop.

This leads to accessing invalid cached expression in combination with
D71539.

PN should only be changed once the actual incoming exit value is set in
the next loop. Moving invalidation there should ensure that PN is
invalidated in all relevant cases.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D111495
2021-10-20 20:48:33 +01:00
Sanjay Patel 80ab06c599 [InstCombine] fold fake vector insert to bit-logic
bitcast (inselt (bitcast X), Y, 0) --> or (and X, MaskC), (zext Y)

https://alive2.llvm.org/ce/z/Ux-662

Similar to D111082 / db231ebdb0 :
We want to avoid relatively opaque vector ops on types that are
likely supported by the backend as scalar integers. The bitwise
logic ops are more likely to allow further combining.

We probably want to generalize this to allow a shift too, but
that would oppose instcombine's general rule of not creating
extra instructions, so that's left as a potential follow-up.
Alternatively, we could do that transform in VectorCombine
with the help of the TTI cost model.

This is part of solving:
https://llvm.org/PR52057
2021-10-20 14:21:40 -04:00
Stanislav Mekhanoshin 503d061dc7 Precommit InstCombine/and-xor-or.ll test. NFC. 2021-10-20 11:13:12 -07:00
Sanjay Patel ea9a0556b4 [InstCombine] add tests for casted insertelement; NFC 2021-10-20 12:17:58 -04:00
Fraser Cormack eabf11f9ea [CodeGenPrepare] Avoid a scalable-vector crash in ctlz/cttz
This patch fixes a crash when despeculating ctlz/cttz intrinsics with
scalable-vector types. It is not safe to speculatively get the size of
the vector type in bits in case the vector type is not a fixed-length type. As
it happens this isn't required as vector types are skipped anyway.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112141
2021-10-20 16:45:55 +01:00
Bjorn Pettersson 3d152bc49d [NewPM][test] Strickly use -passes in some more lit tests
Removed/replaced RUN lines using legacy PM syntax in favor of using
-passes in lit tests for Float2Int, MetaRenamer, StripDeadPrototypes
and StripSymbols.
2021-10-20 17:06:47 +02:00
Bjorn Pettersson a3ca7dd0ab [NewPM][test] Use -passes syntax in Mem2Reg lit tests
The legacy PM is deprecated, so use the new PM syntax in lit tests
verifying the mem2reg pass.
2021-10-20 17:06:47 +02:00
Bjorn Pettersson e9320b1a95 [NewPM][test] Only use -passes syntax in Scalarizer lit tests
With legacy PM being deprecated it should be enough to verify the
scalarizer pass using the new-PM syntax when invoking opt.
2021-10-20 15:16:18 +02:00
Bjorn Pettersson 5e4dbd7a2f [NewPM][test] Use -passes syntax in VectorCombine lit tests
The legacy PM is deprecated, so use the new PM syntax in lit tests
running the vector-combine pass.
2021-10-20 15:16:17 +02:00
Bjorn Pettersson 57bd67abfc [NewPM][test] Use -passes syntax in SpeculativeExecution lit tests
The legacy PM is deprecated, so use the new PM syntax in lit tests
running the speculative-execution pass.
2021-10-20 15:16:17 +02:00
Bjorn Pettersson a413663d8f [NewPM][test] Avoid using -enable-new-pm=1 since -passes implies new PM 2021-10-20 15:16:17 +02:00
Simon Pilgrim a3c05982ac [SLP][X86] Improve SLP tests for division/multiplication by +/- pow2
Add PR51436 test as well as some basic multiply tests, and include SSE2 division coverage
2021-10-20 13:30:27 +01:00
Evgeniy Brevnov 269f563a2b [NARY-REASSOCIATE] Fix infinite recursion optimizing min\max
To guarantee convergence of the algorithm each optimization step should decrease number of instructions when IR is modified. This property is not held in this test case. The problem is that SCEV Expander may do "unexpected" reassociation what results in creation of new min/max chains and introduction of extra instructions. As a result on each step we indefinitely optimize back and forth.

The solution is to restrict SCEV Expander to perform uncontrolled reassociations by means of "Unknown" expressions.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D112060
2021-10-20 14:23:03 +07:00
Philip Reames 0836a1059d Extend transform introduced in D111896 to multiple exits
This is trivial.  It was left out of the original review only because we had multiple copies of the same code in review at the same time, and keeping them in sync was easiest if the structure was kept in sync.
2021-10-19 12:12:19 -07:00
Philip Reames fca0218875 [indvars] Canonicalize exit conditions to unsigned using range info
This patch duplicates a bit of logic we apply to comparisons encountered during the IV users walk to conditions which feed exit conditions. Why? simplifyAndExtend has a very limited list of users it walks. In particular, in the examples is stops at the zext and never visits the icmp. (Because we can't fold the zext to an addrec yet in SCEV.) Being willing to visit when we haven't simplified regresses multiple tests (seemingly because of less optimal results when computing trip counts).

Note that this can be trivially extended to multiple exiting blocks. I'm leaving that to a future patch (solely to cut down on the number of versions of the same code in review at once.)

Differential Revision: https://reviews.llvm.org/D111896
2021-10-19 11:49:12 -07:00
Anna Thomas 9403514e76 [LoopPredication] Calculate profitability without BPI
Using BPI within loop predication is non-trivial because BPI is only
preserved lossily in loop pass manager (one fix exposed by lossy
preservation is up for review at D111448). However, since loop
predication is only used in downstream pipelines, it is hard to keep BPI
from breaking for incomplete state with upstream changes in BPI.
Also, correctly preserving BPI for all loop passes is a non-trivial
undertaking (D110438 does this lossily), while the benefit of using it
in loop predication isn't clear.

In this patch, we rely on profile metadata to get almost similar benefit as
BPI, without actually using the complete heuristics provided by BPI.
This avoids the compile time explosion we tried to fix with D110438 and
also avoids fragile bugs because BPI can be lossy in loop passes
(D111448).

Reviewed-By: asbirlea, apilipenko
Differential Revision: https://reviews.llvm.org/D111668
2021-10-19 14:24:04 -04:00
Arthur Eubanks 15fefcb9eb [opt] Directly translate -O# to -passes='default<O#>'
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.

Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.

Tests need to be updated to not use `opt -O# -pass-name`.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D112036
2021-10-18 16:48:10 -07:00
modimo 41f814589f [InlineAdvisor][NFC] Fix tests added in D110658 V2
On Windows there's an *.exe suffix to opt that isn't present in Linux.
Remove the check for opt in the string
2021-10-18 15:27:33 -07:00
modimo 2786dc1096 [InlineAdvisor][NFC] Fix tests added in D110658 on
Windows which outputs "is a directory" rather than "Is a directory" on error compared to linux
2021-10-18 14:21:01 -07:00
Arthur Eubanks ecd25edfc5 [InlineCost] Add empty line between call sites when printing inline costs 2021-10-18 13:56:48 -07:00
Alexey Bataev b9cfa016da [SLP]Fix emission of the shrink shuffles.
Need to follow the order of the reused scalars from the
ReuseShuffleIndices mask rather than rely on the natural order.

Differential Revision: https://reviews.llvm.org/D111898
2021-10-18 13:13:12 -07:00
modimo 313c657fce [InlineAdvisor] Add -inline-replay-scope=<Function|Module> to control replay scope
The goal is to allow grafting an inline tree from Clang or GCC into a new compilation without affecting other functions. For GCC, we're doing this by extracting the inline tree from dwarf information and generating the equivalent remarks.

This allows easier side-by-side asm analysis and a trial way to see if a particular inlining setup provides benefits by itself.

Testing:
ninja check-all

Reviewed By: wenlei, mtrofin

Differential Revision: https://reviews.llvm.org/D110658
2021-10-18 13:08:39 -07:00
Florian Hahn 74c4d44d47
[LV] Update test that was missed in e844f05397. 2021-10-18 18:23:00 +01:00
Florian Hahn e844f05397
[LoopUtils] Simplify addRuntimeCheck to return a single value.
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.

The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.

Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.

Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
2021-10-18 18:03:09 +01:00
Peter Waller ee7ca88a3e [InstCombine][DebugInfo] Remove superflous assertion, add test [2/2]
Accidentally committed a prior version of this patch. This is the
correct version.
2021-10-18 13:48:23 +00:00
Peter Waller c4603a8a43 [InstCombine][DebugInfo] Remove superflous assertion, add test
When this code was added, an unnecessary assertion slipped in which we
now hit in real code.

Add a test to defend against it firing again.
2021-10-18 11:00:01 +00:00
David Green 052b77e49f [InstCombine] Add some extra tests for truncated saturates. NFC 2021-10-17 13:21:28 +01:00
Ben Shi d0dbc991c0 Revert "[AArch64] Optimize add/sub with immediate"
This reverts commit 9bf6bef995.
2021-10-16 22:17:18 +00:00
Simon Pilgrim 85b87179f4 [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs
Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938
2021-10-16 17:28:07 +01:00
Simon Pilgrim 6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Simon Pilgrim d5f5121ea6 [LV][X86] Add PR47437 test case 2021-10-16 13:40:54 +01:00
Roman Lebedev d137f1288e
[X86][LV] X86 does *not* prefer vectorized addressing
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.

This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111546
2021-10-16 12:32:18 +03:00
Ben Shi 9bf6bef995 [AArch64] Optimize add/sub with immediate
Optimize ([add|sub] r, imm) -> ([ADD|SUB] ([ADD|SUB] r, #imm0, lsl #12), #imm1),
if imm == (imm0<<12)+imm1. and both imm0 and imm1 are non-zero 12-bit unsigned
integers.

Optimize ([add|sub] r, imm) -> ([SUB|ADD] ([SUB|ADD] r, #imm0, lsl #12), #imm1),
if imm == -(imm0<<12)-imm1, and both imm0 and imm1 are non-zero 12-bit unsigned
integers.

Reviewed By: jaykang10, dmgreen

Differential Revision: https://reviews.llvm.org/D111034
2021-10-16 08:50:39 +00:00
Sanjay Patel a49f5386ce [InstCombine] generalize fold for mask-with-signbit-splat, part 2
This removes an over-specified fold. The more general transform
was added with:
727e642e97

There's a difference on an existing test that shows a potentially
unnecessary use limit on an icmp fold.

That fold is in InstCombinerImpl::foldICmpSubConstant(), and IIRC
there was some back-and-forth on it and similar folds because they
could cause analysis/passes (SCEV, LSR?) to miss optimizations.

Differential Revision: https://reviews.llvm.org/D111410
2021-10-15 17:11:29 -04:00
Sanjay Patel 727e642e97 [InstCombine] generalize fold for mask-with-signbit-splat
(iN X s>> (N-1)) & Y --> (X < 0) ? Y : 0

https://alive2.llvm.org/ce/z/qeYhdz

I was looking at a missing abs() transform and found my way to this
generalization of an existing fold that was added with D67799.
As discussed in that review, we want to make sure codegen handles
this difference well, and for all of the targets/types that I
spot-checked, it looks good.

I am leaving the existing fold in place in this commit because
it covers a potentially missing icmp fold, but I plan to remove
that as a follow-up commit as suggested during review.

Differential Revision: https://reviews.llvm.org/D111410
2021-10-15 16:25:48 -04:00
Florian Hahn 4a1d63d7d0
[VectorCombine] Add option to only run scalarization transforms.
This patch adds a pass option to only run transforms that scalarize
vector operations and do not create new vector instructions.

When running VectorCombine early in the pipeline introducing new vector
operations can have negative effects, like blocking loop or SLP
vectorization. To avoid regressions, restrict the early VectorCombine
run (when using -enable-matrix) to only perform scalarization and not
introduce new vector operations.

This is done as option to the pass directly, which is then set when
adding the pass to the pipeline. This is done for the new pass manager
only.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D111800
2021-10-15 20:35:58 +01:00
Alexey Bataev 1312aff768 [SLP]Add a test for shrink shuffle after reorder, NFC. 2021-10-15 09:42:43 -07:00
Sanjay Patel 8cd9c351a1 [VectorCombine] add tests for shuffle of binops; NFC 2021-10-15 11:22:06 -04:00
Anton Afanasyev 7b07c01351 [InstCombine] Support arbitrary const shift amount for `lshr (sext i1 ...)`
Add lshr (sext i1 X to iN), C --> select (X, -1 >> C, 0) case. This expands
C == N-1 case to arbitrary C.

Fixes PR52078.

Reviewed By: spatel, RKSimon, lebedev.ri

Differential Revision: https://reviews.llvm.org/D111330
2021-10-15 13:39:13 +03:00
Anton Afanasyev e23351cdc9 [Test][InstCombine] Precommit tests for PR52078 2021-10-15 13:38:40 +03:00
Max Kazantsev 90ae538cab [SCEV] Prove implication of predicates to their sign-flipped counterparts
This patch teaches SCEV two implication rules:

  x <u y && y >=s 0 --> x <s y,
  x <s y && y <s 0 --> x <u y.

And all equivalents with signs/parts swapped.

Differential Revision: https://reviews.llvm.org/D110517
Reviewed By: nikic
2021-10-15 11:49:18 +07:00
Artur Pilipenko 3f96f7b30c Fix getInlineCost with ComputeFullInlineCost enabled
Fix a bug when getInlineCost incorrectly returns a
cost/threshold pair instead of an explicit never inline.

Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D111687
2021-10-14 17:41:41 -07:00
Alexey Bataev 414abff1fe [SLP]Fix PR52090: clang crashes: Assertion `Index < Length && "Invalid index!"' failed.
Need to check that either Idx is UndefMaskElem and value is UndefValue
or Idx is valid and value is the same as the scalar value in the node.

Differential Revision: https://reviews.llvm.org/D111802
2021-10-14 14:26:29 -07:00
Philip Reames 8b31f07cdf [tests] Add indvars tests showing missing transforms with small IVs
This shows the transform side of D109457, but also lets us try other approaches to the same problem.  The common trend to all is that we need to explicit reason about UB to disallow possibility of infinite loops.
2021-10-14 13:28:18 -07:00
Roman Lebedev 3d7bf6625a
[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174
2021-10-14 23:14:36 +03:00
Philip Reames 7f3861cfdb autogen tests for ease of update 2021-10-14 13:04:22 -07:00
Kevin P. Neal 727a891ec8 [FPEnv][InstSimplify] Fold fadd X, 0 ==> X, when we know X is not -0
Currently the fadd optimizations in InstSimplify don't know how to do this
NoSignedZeros "X + 0.0 ==> X" fold when using the constrained intrinsics.
This adds the support.

This review is derived from D106362 with some improvements from D107285
and is a follow-on to D111085.

Differential Revision: https://reviews.llvm.org/D111450
2021-10-14 12:32:45 -04:00
Florian Hahn 094faa5fca
[VectorCombine] Add test showing issue when running VectorCombine early.
Running -vector-combine early can introduce new vector operations,
blocking loop/SLP vectorization. The added test case could be better
optimized by the SLPVectorizer if no new vector operations are added
early.
2021-10-14 14:03:02 +01:00
Andrew Savonichev dc8a41de34 [ARM] Simplify address calculation for NEON load/store
The patch attempts to optimize a sequence of SIMD loads from the same
base pointer:

    %0 = gep float*, float* base, i32 4
    %1 = bitcast float* %0 to <4 x float>*
    %2 = load <4 x float>, <4 x float>* %1
    ...
    %n1 = gep float*, float* base, i32 N
    %n2 = bitcast float* %n1 to <4 x float>*
    %n3 = load <4 x float>, <4 x float>* %n2

For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16].
However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so
the address is computed before every ld/st instruction:

    add r2, r0, #32
    add r0, r0, #16
    vld1.32 {d18, d19}, [r2]
    vld1.32 {d22, d23}, [r0]

This can be improved by computing address for the first load, and then
using a post-indexed form of VLD1/VST1 to load the rest:

    add r0, r0, #16
    vld1.32 {d18, d19}, [r0]!
    vld1.32 {d22, d23}, [r0]

In order to do that, the patch adds more patterns to DAGCombine:

  - (load (add ptr inc1)) and (add ptr inc2) are now folded if inc1
    and inc2 are constants.

  - (or ptr inc) is now recognized as a pointer increment if ptr is
    sufficiently aligned.

In addition to that, we now search for all possible base updates and
then pick the best one.

Differential Revision: https://reviews.llvm.org/D108988
2021-10-14 15:23:10 +03:00
Shoaib Meenai 6404f4b5af [InstCombine] Remove attributes after hoisting free above null check
If the parameter had been annotated as nonnull because of the null
check, we want to remove the attribute, since it may no longer apply and
could result in miscompiles if left. Similarly, we also want to remove
undef-implying attributes, since they may not apply anymore either.

Fixes PR52110.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D111515
2021-10-13 15:34:56 -07:00
Mircea Trofin 6c76d01011 [mlgo][aot] requrie the model is autogenerated for test determinism
The tests that exercise the 'release' mode, where the model is AOT-ed,
check the output has certain properties, to validate that, indeed, a
different policy from the default one was exercised. For determinism, we
can't reliably check that output for an arbitrary learned policy, since
it could be that policy happens to mimic the default one in that
particular case.

This patch adds a requirement that those tests run only when the model
is autogenerated (e.g. on build bots).

Differential Revision: https://reviews.llvm.org/D111747
2021-10-13 14:02:41 -07:00
Philip Reames 47d10b25f8 [instcombine] PRE freeze to only potentially posion/undef operand of phi
This extends the foldOpIntoPhi code used when visiting a freeze user of a phi to allow any non-undef/poison operand as opposed to only non-undef/poison constants.  This lets us hoist a freeze in the increment of an IV into the preheader in many cases.

Differential Revision: https://reviews.llvm.org/D111744
2021-10-13 13:55:54 -07:00
Roman Lebedev a8a64eaafc
[NFC][X86][LV] Autogenerate checklines in cost-model.ll to simplify further updates 2021-10-13 22:47:43 +03:00
Roman Lebedev 18eef13dad
[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()`
`X86TTIImpl::getGSScalarCost()` has (at least) two issues:
* it naively computes the cost of sequence of `insertelement`/`extractelement`.
  If we are operating not on the XMM (but YMM/ZMM),
  this widely overestimates the cost of subvector insertions/extractions.
* Gather/scatter takes a vector of pointers, and scalarization results in us performing
  scalar memory operation for each of these pointers, but we never account for the cost
  of extracting these pointers out of the vector of pointers.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111222
2021-10-13 22:35:39 +03:00
Sjoerd Meijer 67a58fa3a6 [FuncSpec] Don't run the solver if there's nothing to do
Even if there are no interesting functions, the SCCP solver would still run
before bailing. Now bail earlier, avoid running the solver for nothing.

Differential Revision: https://reviews.llvm.org/D111645
2021-10-13 19:05:19 +01:00
Arthur Eubanks 3628bb7436 Make various assume bundle data structures use uint64_t
Following D110451, we need to make sure to support 64 bit values.
2021-10-13 10:38:41 -07:00
Philip Reames 24c9016574 [instcombine] propagate single use freeze(gep inbounds X)
This is a follow on for D111675 which implements the gep case. I'd originally left it out because I was hoping to actually implement the inrange todo, but after a bit of staring at the code, decided to leave it as is since it doesn't effect this use case (i.e. instcombine requires the op to freeze to be an instruction).

Differential Revision: https://reviews.llvm.org/D111691
2021-10-13 09:25:00 -07:00
Sanjay Patel 905d170803 [InstCombine] allow matching vector splat constants in foldLogOpOfMaskedICmps()
This is NFC-intended for scalar code. There are still unnecessary
m_ConstantInt restrictions in surrounding code, so this is not a
complete fix.

This prevents regressions seen with a planned follow-on to D111410.
2021-10-13 10:15:26 -04:00
Sanjay Patel d67022fba9 [InstCombine] add vector splat tests for foldLogOpOfMaskedICmps(); NFC
There's a substantial pile of scalar tests for transforms that
depend on this code, but zero vector coverage. This patch adds
a vector test next to the first scalar test in each file that
is affected by foldLogOpOfMaskedICmps.

The code that handles these transforms is artificially limited
from working with vector splat constants.
2021-10-13 09:53:59 -04:00
Dávid Bolvanský 93fd30a163 [NFC] Added test for PR50339 2021-10-13 12:15:57 +02:00
Dávid Bolvanský 005b715b54 [NFC] Added test for PR49927 2021-10-13 12:15:57 +02:00
Philip Reames 84fae3bce8 [tests] Add coverage for follow ons to D111675 2021-10-12 20:37:30 -07:00
Philip Reames 4c5702cb12 Fix bug introduced with 6f34839 (poison flags on floating point ops)
The newly introduced API for checking whether poison comes solely from flags which can be dropped was out of sync.  This was noticed by a reviewer post commit.

For the moment, disable the floating point flags.  In a follow up change, I plan to add support in dropPoisonGeneratingFlags, but that deserves to be a change of it's own.
2021-10-12 20:25:00 -07:00
Philip Reames 6f34839407 [instcombine] propagate freeze through single use poison producing flag instruction
If we have an instruction which produces poison only when flags are specified on the instruction, then we know that freezing the operands and dropping flags is equivalent to freezing the result. If we know those flags don't result in any undefined behavior being executed, then there's no point in preserving the flags as we gain no knowledge by having them.

This patch extends the existing propagation logic which sinks freeze to single potential non-poison operands to allow dropping of flags when we know the freeze is the sole use of the instruction with poison flags.

The main value is that we tend to sink freezes towards the phi in IV cycles where the incoming value to the phi is the freeze of an IV increment. This will in turn (in a future patch), let us fold the freeze through the phi into the loop preheader. Motivated by eliminating need for CanonicalizeFreezeInLoops for the clearly profitable cases from onephi.ll test case in the test directory.

Differential Revision: https://reviews.llvm.org/D111675
2021-10-12 13:52:41 -07:00
Philip Reames c24b2ad0e2 Add extra tests for D111675 2021-10-12 13:37:13 -07:00
Philip Reames 357b8d7ddb [tests] Add coverage for cases we can drop flags to propagate freeze without cost 2021-10-12 12:30:46 -07:00
Ayal Zaks 15692fd6b5 [LV] Fix 2nd crash for reverse interleaved groups under mask/fold-tail.
This patch fixes another crash revealed by PR51614:
when *deciding* to vectorize with masked interleave groups, check if the access
is reverse (which is currently not supported).

Differential Revision: https://reviews.llvm.org/D108900
2021-10-12 21:44:42 +03:00
Kevin P. Neal bdf6ba2d30 [FPEnv][InstSimplify] Precommit tests: Enable more folds for constrained fsub
Precommit tests for D107285 as requested. TODO notes left at individual
functions also as requested.
2021-10-12 13:47:28 -04:00
Mircea Trofin ea4a6c8426 [Inline] Make sure the InlineAdvisor is correctly cleared.
If another inlining session came after a ModuleInlinerWrapperPass, the
advisor alanysis would still be cached, but its Result would be cleared.
We need to clear both.

This addresses PR52118

Differential Revision: https://reviews.llvm.org/D111586
2021-10-12 10:42:41 -07:00
Sanjay Patel 7a2949647a [InstCombine] propagate no-wrap flag through select-of-mul fold
This may not be obvious, but Alive2 agrees:
https://alive2.llvm.org/ce/z/Ld9qNT

If the mul has "nsw", then -1 * INT_MIN is poison, so the
negate can also have "nsw" because 0 - INT_MIN is poison.

If the mul has "nuw", then that means the "OtherOp" can only
be 0 or 1 (anything else multiplied by 0xfff... would wrap).
So the replacement negate must be "nsw" because it is either
"0-0" or "0-1".

This is another regression noticed with a planned follow-up
to D111410.
2021-10-12 12:57:20 -04:00
Sanjay Patel fae7d6886e [InstCombine] add tests with nsw/nuw for mul-of-select; NFC 2021-10-12 12:57:20 -04:00
Hongtao Yu 098a0d8fbc [CSSPGO] Unblock optimizations with pseudo probe instrumentation part 3.
This patch continues unblocking optimizations that are blocked by pseudo probe instrumentation.

Not exactly like DbgIntrinsics, PseudoProbe intrinsic has other attributes (such as mayread, maywrite, mayhaveSideEffect) that can block optimizations. The issues fixed are:
- Flipped default param of getFirstNonPHIOrDbg API to skip pseudo probes
- Unblocked CSE by avoiding pseudo probe from clobbering memory SSA
- Unblocked induction variable simpliciation
- Allow empty loop deletion by treating probe intrinsic isDroppable
- Some refactoring.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D110847
2021-10-12 09:44:12 -07:00
Kerry McLaughlin 1439ef1a3f [LoopVectorize] Classify pointer induction updates as scalar only if they have one use
collectLoopScalars collects pointer induction updates in ScalarPtrs, assuming
that the instruction will be scalar after vectorization. This may crash later
in VPReplicateRecipe::execute() if there there is another user of the instruction
other than the Phi node which needs to be widened.

This changes collectLoopScalars so that if there are any other users of
Update other than a Phi node, it is not added to ScalarPtrs.

Reviewed By: david-arm, fhahn

Differential Revision: https://reviews.llvm.org/D111294
2021-10-12 13:24:49 +01:00
Sjoerd Meijer fc0fa85171 [FuncSpec] Allow ConstExprs that are function pointers
This is a follow up of D110529 that disallowed constexprs. That change
introduced a regression as this also disallowed constexprs that are function
pointers, which is actually one of the motivating use cases that we do want to
support.

Differential Revision: https://reviews.llvm.org/D111567
2021-10-12 11:44:26 +01:00
Florian Hahn cd0ba9dc58
[LoopPeel] Peel if it turns invariant loads dereferenceable.
This patch adds a new cost heuristic that allows peeling a single
iteration off read-only loops, if the loop contains a load that

    1. is feeding an exit condition,
    2. dominates the latch,
    3. is not already known to be dereferenceable,
    4. and has a loop invariant address.

If all non-latch exits are terminated with unreachable, such loads
in the loop are guaranteed to be dereferenceable after peeling,
enabling hoisting/CSE'ing them.

This enables vectorization of loops with certain runtime-checks, like
multiple calls to `std::vector::at` if the vector is passed as pointer.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D108114
2021-10-12 11:42:28 +01:00
Sanjay Patel 59441c7329 [InstCombine] fold signbit check of X | (X -1)
There may be some other patterns like this or a generalization,
but this is an example that I noticed would definitely regress
with a planned follow-up to D111410.

https://alive2.llvm.org/ce/z/GVpQDb
2021-10-11 16:14:13 -04:00
Sanjay Patel 518ec39de7 [InstCombine] add signbit check for or'd operands; NFC 2021-10-11 16:14:13 -04:00
Arthur Eubanks fbddf22ef7 [SCCP] Properly report changes when changing a pointer argument
Fixes one of the issues in PR51946.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D111277
2021-10-11 13:12:08 -07:00
Florian Hahn ab33427c86
[VPlan] Print live-in backedge taken count as part of plan.
At the moment, a VPValue is created for the backedge-taken count, which
is used by some recipes. To make it easier to identify the operands of
recipes using the backedge-taken count, print it at the beginning of the
VPlan if it is used.

Reviewed By: a.elovikov

Differential Revision: https://reviews.llvm.org/D111298
2021-10-11 20:13:01 +01:00
hyeongyu kim 0aeb37324d [SimpleLoopUnswitch] Re-fix introduction of UB when hoisted condition may be undef or poison
https://bugs.llvm.org/show_bug.cgi?id=27506
https://bugs.llvm.org/show_bug.cgi?id=31652
https://bugs.llvm.org/show_bug.cgi?id=51043

Problems with SimpleLoopUnswitch cause the bug reports above.

```
while (...) {
  if (C) { A }
  else   { B }
}
Into:

C' = freeze(C)
if (C') {
  while (...) { A }
} else {
  while (...) { B }
}
```
This problem can be solved by adding a freeze on hoisted branches(above transform) and has been solved by D29015.
However, D29015 is now reverted by performance regression(2b5a897651)

It is not the first time that an added freeze has caused performance regression.
SimplifyCFG also had a problem with UB caused by branching-on-undef, which was solved by adding freeze to the branching condition. (D104569)
Performance regression occurred in D104569, and patches such as D105344 and D105392 were written to minimize it.

This patch will correct the SimpleLoopUnswitch as D104569 handles the SimplyCFG while minimizing performance loss by introducing patches like D105344 and D105392(This patch was rebased with the author's permission)

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D106041
2021-10-12 01:02:09 +09:00
David Sherwood 26b7d9d622 [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-11 09:41:38 +01:00
Clement Courbet 6aaf1e7ea9 [LoopIdiom] Fix store size SCEV type.
We were using the type of the loop back edge count to represent the
store size. This failed for small loop counts (e.g. in the added test,
the loop count was an i2).

Use the index type instead.

Fixes PR52104.

Differential Revision: https://reviews.llvm.org/D111401
2021-10-11 09:39:06 +02:00
Dawid Jurczak 9e65929a8e [DSE] Re-enable calloc transformation with extra care (PR25892)
Transformation from malloc+memset to calloc is always correct and in many situations
it brings significant observable benefits in terms of execution speed and memory consumption [1][2].
Unfortunately there are cases when producing calloc cause performance drops [3].
As discussed here: https://reviews.llvm.org/D103009 it's possible to differentiate between those 2 scenarios.
If optimizer is able to prove that after malloc call it's _very_ likely to reach memset branch then after
calloc emission we shouldn't observe any performance hits. Therefore finding "null pointer check" pattern
before memset basic block sounds like good justification for performing transformation.
Also that method was already suggested by GCC folks [4]. Main reason for change is that for now
to be safe we check for post dominance relation which is way too conservative approach making transformation
"almost" disabled in practice. This patch tends to enable transformation again but with extra care.

[1] https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc
[2] https://vorpus.org/blog/why-does-calloc-exist/
[3] http://smalldatum.blogspot.com/2017/11/a-new-optimization-in-gcc-5x-and-mysql.html
[4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83022

Differential Revision: https://reviews.llvm.org/D110021
2021-10-10 21:47:14 +02:00
Sanjay Patel cbd8041b0b [InstCombine] add tests for (X - Y) == 0; NFC 2021-10-10 11:13:46 -04:00
Sanjay Patel da210f5d34 [InstCombine] canonicalize "(C2 - Y) > C" as (Y + ~C2) < ~C
The test diffs show that we have better analysis/folds for 'add'
(although we should at least have the simplifications
independently, so we don't have the one-use restriction).

This is related to solving regressions that would appear in
transforms related to D111410, and that is part of a series
of enhancements that may eventually helpi solve PR34047.

https://alive2.llvm.org/ce/z/3tB9KG

  define i1 @src(i8 %x, i8 %C, i8 %C2) {
    %sub = sub nuw i8 %C2, %x
    %r = icmp slt i8 %sub, %C
    ret i1 %r
  }

  define i1 @tgt(i8 %x, i8 %C, i8 %C2) {
    %Cnot = xor i8 %C, -1
    %C2not = xor i8 %C2, -1
    %add = add nuw i8 %x, %C2not
    %r = icmp sgt i8 %add, %Cnot
    ret i1 %r
  }
2021-10-10 11:06:49 -04:00
Sanjay Patel c00cab878a [InstCombine] add test for or-of-icmps; NFC 2021-10-10 11:06:49 -04:00
Sanjay Patel acafde09a3 [InstCombine] enhance icmp with sub folds
There were 2 related but over-specified folds for:
C1 - X == C

One allowed multi-use but was limited to equal constants.
The other allowed different constants but disallowed multi-use.

This combines the 2 folds into a more general match.
The test diffs show the multi-use cases that were falling
through the cracks.

https://alive2.llvm.org/ce/z/4_hEt2

  define i1 @src(i8 %x, i8 %subC, i8 %C) {
    %s = sub i8 %subC, %x
    %r = icmp eq i8 %s, %C
    ret i1 %r
  }

  define i1 @tgt(i8 %x, i8 %subC, i8 %C) {
    %newC = sub i8 %subC, %C
    %isneg = icmp eq i8 %x, %newC
    ret i1 %isneg
  }
2021-10-09 11:39:49 -04:00
Sanjay Patel cd76fa79b0 [InstCombine] add tests for icmp of negated op; NFC 2021-10-09 11:39:49 -04:00
Sanjay Patel 38e3b30bd6 [InstCombine] add tests for (iN X s>> N-1) | Y; NFC
These are for a sibling fold suggested in D111410.
The tests correspond to the 'and' tests added with:
a35673f4cf
2021-10-09 11:39:49 -04:00
David Green adec922361 [AArch64] Make -mcpu=generic schedule for an in-order core
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.

The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.

On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.

When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.

A lot of existing tests have updated. This is a summary of the important
differences:
 - Most changes are the same instructions in a different order.
 - Sometimes this leads to very minor inefficiencies, such as requiring
   an extra mov to move variables into r0/v0 for the return value of a test
   function.
 - misched-fusion.ll was no longer fusing the pairs of instructions it
   should, as per D110561. I've changed the schedule used in the test
   for now.
 - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
   the different latencies. This seems fine to me.
 - Some SVE tests do not always remove movprfx where they did before due
   to different register allocation giving different destructive forms.
 - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
   produce two LDR where they previously produced an LDP due to
   store-pair-suppress kicking in.
 - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
 - Some tests such as arm64-neon-mul-div.ll and
   ragreedy-local-interval-cost.ll have more, less or just different
   spilling.
 - In aarch64_generated_funcs.ll.generated.expected one part of the
   function is no longer outlined. Interestingly if I switch this to use
   any other scheduled even less is outlined.

Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.

Differential Revision: https://reviews.llvm.org/D110830
2021-10-09 15:58:31 +01:00
Max Kazantsev 4c0da23663 [LoopDeletion] Support selects when symbolically evaluating 1st iteration
Adds support for selects for which we know value on the 1st iteration.

Differential Revision: https://reviews.llvm.org/D104111
Reviewed By: nikic
2021-10-09 14:47:44 +07:00
Max Kazantsev 49ca01047f [Test] Add commit justifying revert of D110922
Test by Arthur Eubanks!
2021-10-09 14:32:46 +07:00
Qiu Chaofan f45d5e71d3 [APFloat] Set size of PPCDoubleDouble to 128
566690b0 uses size information in float semantics, but PPCDoubleDouble
left them empty.

As follow-up, we can consider remove PPCDoubleDoubleLegacy and fill
other fields in the future.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D111398
2021-10-09 10:12:10 +08:00
Qiu Chaofan 573531fb1f Fix typo of colon to semicolon in lit tests 2021-10-09 10:03:50 +08:00
Nick Desaulniers 9697f93587 [InlineCost] model calls to llvm.is.constant* more carefully
llvm.is.constant* intrinsics are evaluated to 0 or 1 integral values.

A common use case for llvm.is.constant comes from the higher level
__builtin_constant_p. A common usage pattern of __builtin_constant_p in
the Linux kernel is:

    void foo (int bar) {
      if (__builtin_constant_p(bar)) {
        // lots of code that will fold away to a constant.
      } else {
        // a little bit of code, usually a libcall.
      }
    }

A minor issue in InlineCost calculations is when `bar` is _not_ Constant
and still will not be after inlining, we don't discount the true branch
and the inline cost of `foo` ends up being the cost of both branches
together, rather than just the false branch.

This leads to code like the above where inlining will not help prove bar
Constant, but it still would be beneficial to inline foo, because the
"true" branch is irrelevant from a cost perspective.

For example, IPSCCP can sink a passed constant argument to foo:

    const int x = 42;
    void bar (void) { foo(x); }

This improves our inlining decisions, and fixes a few head scratching
cases were the disassembly shows a relatively small `foo` not inlined
into a lone caller.

We could further improve this modeling by tracking whether the argument
to llvm.is.constant* is a parameter of the function, and if inlining
would allow that parameter to become Constant. This idea is noted in a
FIXME comment.

Link: https://github.com/ClangBuiltLinux/linux/issues/1302

Reviewed By: kazu

Differential Revision: https://reviews.llvm.org/D111272
2021-10-08 15:27:30 -07:00