This patch adds support to remove stores that write the same value
as earlier memesets.
It uses isOverwrite to check that a memset completely overwrites a later
store. The candidate store must store the same bytewise value as the
byte stored by the memset.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D112321
Adds the following switches:
1. --sample-profile-inline-replay-fallback/--cgscc-inline-replay-fallback: controls what the replay advisor does for inline sites that are not present in the replay. Options are:
1. Original: defers to original advisor
2. AlwaysInline: inline all sites not in replay
3. NeverInline: inline no sites not in replay
2. --sample-profile-inline-replay-format/--cgscc-inline-replay-format: controls what format should be generated to match against the replay remarks. Options are:
1. Line
2. LineColumn
3. LineDiscriminator
4. LineColumnDiscriminator
Adds support for negative inlining decisions. These are denoted by "will not be inlined into" as compared to the positive "inlined into" in the remarks.
All of these together with the previous `--sample-profile-inline-replay-scope/--cgscc-inline-replay-scope` allow tweaking in how to apply replay. In my testing, I'm using:
1. --sample-profile-inline-replay-scope/--cgscc-inline-replay-scope = Function to only replay on a function
2. --sample-profile-inline-replay-fallback/--cgscc-inline-replay-fallback = NeverInline since I'm feeding in only positive remarks to the replay system
3. --sample-profile-inline-replay-format/--cgscc-inline-replay-format = Line since I'm generating the remarks from DWARF information from GCC which can conflict quite heavily in column number compared to Clang
An alternative configuration could be to do Function, AlwaysInline, Line fallback with negative remarks which closer matches the final call-sites. Note that this can lead to unbounded inlining if a negative remark doesn't match/exist for one reason or another.
Updated various tests to cover the new switches and negative remarks
Testing:
ninja check-all
Reviewed By: wenlei, mtrofin
Differential Revision: https://reviews.llvm.org/D112040
Replay in sample profiling needs to be asked on candidates that may not have counts or below the threshold. If replay is in effect for a function make sure these are captured and also imported during thinLTO.
Testing:
ninja check-all
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D112033
createReplacementInstr was a trivial wrapper around
ConstantExpr::getAsInstruction, which also inserted the new instruction
into a basic block. Implement this directly in getAsInstruction by
adding an InsertBefore parameter and change all callers to use it. NFC.
A follow-up patch will remove createReplacementInstr.
Differential Revision: https://reviews.llvm.org/D112791
The sequence of instructions `xor (ashr X, BW-1), C` (or with a truncation
`xor (trunc (ashr X, BW-1)), C)` takes a value, produces all zeros or all
ones and with it optionally inverts a constant depending on whether the
original input was positive or negative. This is the same as checking if
the value is positive, and selecting between the constant and ~constant.
https://alive2.llvm.org/ce/z/NJ85qY
This is a fairly general version of a fold that helps pull saturating
arithmetic into a canonical form.
Differential Revision: https://reviews.llvm.org/D109151
Fixes non-determinisctic order of XOR instructions created after
5a7a458306. The order of call argument evaluation is not
defined, so create one Value before the call.
Move the section collecting `AlwaysPreserved` up before any
`maybeInternalize` is called. Otherwise, functions in `AlwaysPreserved` (in this case, `__stack_chk_fail`)
are not preserved.
Reviewed By: MaskRay, tejohnson
Differential Revision: https://reviews.llvm.org/D112684
This patch updates recipe creation to ensure all
VPWidenIntOrFpInductionRecipes are in the header block. At the moment,
new induction recipes can be created in different blocks when trying to
optimize casts and induction variables.
Having all induction recipes in the header makes it easier to
analyze/transform them in VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D111300
With D112160 and D112164, on a Chrome Mac build this reduces the total
size of CGProfile sections by 78% (around 25% eliminated entirely) and
total size of object files by 0.14%.
Differential Revision: https://reviews.llvm.org/D112655
This extends the canonicalizeClampLike function to allow cases where the
input is truncated, but still matching on the types of the ICmps. For
example
%t = trunc i32 %X to i8
%a = add i32 %X, 128
%cmp = icmp ult i32 %a, 256
%c = icmp sgt i32 %X, -1
%f = select i1 %c, i8 High, i8 Low
%r = select i1 %cmp, i8 %t, i8 %f
becomes
%c1 = icmp slt i32 %X, -128
%c2 = icmp sge i32 %X, 128
%s1 = select i1 %c1, i32 sext(Low), i32 %X
%s2 = select i1 %c2, i32 sext(High), i32 %s1
%t = trunc i32 %s2 to i8
https://alive2.llvm.org/ce/z/vPzfxH
We limit the transform to constant High and Low values, where we know
the sext are free.
Differential Revision: https://reviews.llvm.org/D108049
That's https://reviews.llvm.org/D90328 follow-up.
This change eliminates writes to variables where the value that is being written is already stored in the variable.
This achieves the goal by looping through all memory definitions in the current state and getting defining access from each of them.
When there is defining access where the write instruction is identical to the original instruction it will remove this redundant write.
For example:
void f() {
x = 1;
if foo() {
x = 1;
g();
} else {
h();
}
}
void g();
void h();
The second x=1 will be eliminated since it is rewriting 1 to x. This pass will produce this:
void f() {
x = 1;
if foo() {
g();
} else {
h();
}
}
void g();
void h();
Differential Revision: https://reviews.llvm.org/D111727
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.
Differential Revision: https://reviews.llvm.org/D112454
The motivating test is reduced from:
https://llvm.org/PR52261
Note that the more general problem of folding any binop into a multi-use
select of constants is still there. We need to ease the restriction in
InstCombinerImpl::FoldOpIntoSelect() to catch those. But these examples
never reach that code because Negator exclusively handles negation
patterns within visitSub().
Differential Revision: https://reviews.llvm.org/D112657
Even if we look for `nocapture` we need to bail on escaping pointers.
The crucial thing is that we might not look at a big enough scope when
we derive the memory behavior. Thus, it might be `nocapture` in a larger
context while it is "captured" in a smaller context.
When we strip and accumulate constant offsets we need to pick the right
address space such that the offset APInt has the right bit width.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112544
A constant complaint we get is that the __typeid__ symbols in the CFI
jump tables causes confusing stack traces in applications. Emit the more
readable cfi_jt aliases regardless of function export (LTO vs Thin LTO).
Reviewed By: pcc, tejohnson
Differential Revision: https://reviews.llvm.org/D107934
It's a no-op, no overflow happens ever: https://alive2.llvm.org/ce/z/Zw89rZ
While generally i don't like such hacks,
we have a very good reason to do this: here we are expanding
a run-time correctness check for the vectorization,
and said `umul_with_overflow` will not be optimized out
before we query the cost of the checks we've generated.
Which means, the cost of run-time checks would be artificially inflated,
and after https://reviews.llvm.org/D109368 that will affect
the minimal trip count for which these checks are even evaluated.
And if they aren't even evaluated, then the vectorized code
certainly won't be run.
We could consider doing this in IRBuilder, but then we'd need to
also teach `CreateExtractValue()` to look into chain of `insertvalue`'s,
and i'm not sure there's precedent for that.
Refs. https://reviews.llvm.org/D109368#3089809
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.
Differential Revision: https://reviews.llvm.org/D112454
This patch changes the definition of getStepVector from:
Value *getStepVector(Value *Val, int StartIdx, Value *Step, ...
to
Value *getStepVector(Value *Val, Value *StartIdx, Value *Step, ...
because:
1. it seems inconsistent to pass some values as Value* and some as
integer, and
2. future work will require the StartIdx to be an expression made up
of runtime calculations of the VF.
In widenIntOrFpInduction I've changed the code to pass in the
value returned from getRuntimeVF, but the presence of the assert:
assert(!VF.isScalable() && "scalable vectors not yet supported.");
means that currently this code path is only exercised for fixed-width
VFs and so the patch is still NFC.
Differential revision: https://reviews.llvm.org/D111882
Gathered loads/extractelements/extractvalue instructions should be
checked if they can represent a vector reordering node too and their
order should ve taken into account for better graph reordering analysis/
Also, if the gather node has reused scalars, they must be reordered
instead of the scalars themselves.
Differential Revision: https://reviews.llvm.org/D112454
Need to emit select(cmp) instructions for poison-safe forms of select
ops. Currently alive reports that `Target is more poisonous than source`
for operations we generating for such instructions.
https://alive2.llvm.org/ce/z/FiNiAA
Differential Revision: https://reviews.llvm.org/D112562
I have removed LoopVectorizationPlanner::setBestPlan, since this
function is quite aggressive because it deletes all other plans
except the one containing the <VF,UF> pair required. The code is
currently written to assume that all <VF,UF> pairs will live in the
same vplan. This is overly restrictive, since scalable VFs live in
different plans to fixed-width VFS. When we add support for
vectorising epilogue loops when the main loop uses scalable vectors
then we will the vplan for the main loop will be different to the
epilogue.
Instead I have added a new function called
LoopVectorizationPlanner::getBestPlanFor
that returns the best vplan for the <VF,UF> pair requested and leaves
all the vplans untouched. We then pass this best vplan to
LoopVectorizationPlanner::executePlan
which now takes an additional VPlanPtr argument.
Differential revision: https://reviews.llvm.org/D111125
Use RdxDesc->getOpcode instead of getUnderlingInstr()->getOpcode.
Move the code which finds Kind and IsOrdered to be outside the for loop
since neither of these change with the vector part.
Differential Revision: https://reviews.llvm.org/D112547
The final reduction nodes should not be reordered, the order does not
matter for reductions. Also, it might be profitable to vectorize smaller
reduction trees, reduction cost may compensate small tree cost.
Part of D111574
Differential Revision: https://reviews.llvm.org/D112467
The function simplifyOnce only calls simplifyOnceImpl and does nothing else.
Having this separate helper makes no sense. Removing it.
Patch by Dmitry Bakunevich!
Differential Revision: https://reviews.llvm.org/D112517
Reviewed By: mkazantsev
When peeling a loop, we assume that the latch has a `br` terminator and that
all loop exits are either terminated with an `unreachable` or have a terminating
deoptimize call. So when we peel off the 1st iteration, we change the IDom of
all loop exits to the peeled copy of `NCD(IDom(Exit), Latch)`. This works now,
but if we add logic to support loops with exits that are followed by a block
with an `unreachable` or a terminating deoptimize call, changing the exit's idom
wouldn't be enough and DT would be broken.
For example, let `Exit1` and `Exit2` are loop exits, and each of them
unconditionally branches to the same `unreachable` terminated block. So neither
of the exits dominates this unreachable block. If we change the IDoms of the
exits to some peeled loop block, we don't update the dominators of the unreachable
block. Currently we just don't get to the peeling logic, saying that we can't peel
such loops.
Previously we stored exits' IDoms in a map before peeling a loop and then, after
peeling off one iteration, we changed their IDoms.
Now we use the same logic not only for exits but for all non-loop blocks dominated
by the loop.
So when we add logic to support peeling loops with exits which branch, for example,
to an unreachable-terminated block, we would update the IDoms not only for exits,
but for their successors.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D111611
Reviewed By: mkazantsev, nikic
Always insert values into ExprValueMap, and instead skip using them
in SCEVExpander if poison-generating flags have been lost. This
ensures that all values that are in ValueExprMap are also in
ExprValueMap, so we can use the latter to invalidate the former.
This change is probably not entirely NFC for the case where
originally the SCEV had no nowrap flags but they were inferred
later, in which case that would now allow reusing the existing
value for expansion.
Differential Revision: https://reviews.llvm.org/D112389
The recently added logic to canonicalize exit conditions to unsigned relies on facts which hold about the use (i.e. exit test). Applying this blindly to the icmp is not legal, as there may be another use which never reaches the exit. Restrict ourselves to case where we have a single use.
Need to change the order of the reduction/binops args pair vectorization
attempts. Need to try to find the reduction at first and postpone
vectorization of binops args. This may help to find more reduction
patterns and vectorize them.
Part of D111574.
Differential Revision: https://reviews.llvm.org/D112224
We observe a hang within iterativelySimplifyCFG due to infinite
loop execution. Currently, there is no limit to this loop, so
in case of bug it just works forever. This patch adds an assert
that will break it after 1000 iterations if it didn't converge.
Currently strip.invariant/launder.invariant are handled by
constructing constant expressions with the intrinsics skipped.
This takes an alternative approach of accumulating the offset
using stripAndAccumulateConstantOffsets(), with a flag to look
through invariant.group intrinsics.
Differential Revision: https://reviews.llvm.org/D112382
At the moment a dummy entry block is created at the beginning of VPlan
construction. This dummy block is later removed again.
This means it is not easy to identify the VPlan header block in a
general fashion, because during recipe creation it is the single
successor of the entry block, while later it is the entry block.
To make getting the header easier, just skip creating the dummy block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D111299
Fixes a crash observed by oss-fuzz in 39934. Issue at hand is that code expects a pattern match on m_Mul to imply the operand is a mul instruction, however mul constexprs are also valid here.
As this API is now internally offset-based, we can accept a starting
offset and remove the need to create a temporary bitcast+gep
sequence to perform an offset load. The API now mirrors the
ConstantFoldLoadFromConst() API.
The logic in this patch is that if we find a comparison which would be unsigned except for when the loop is infinite, and we can prove that an infinite loop must be ill defined, we can still make the predicate unsigned.
The eventual goal (combined with a follow on patch) is to use the fact the loop exits to remove the zext (see tests) entirely.
A couple of points worth noting:
* We loose the ability to prove the loop unreachable by committing to the must exit interpretation. If instead, we later proved that rhs was definitely outside the range required for finiteness, we could have killed the loop entirely. (We don't currently implement this transform, but could in theory, do so.)
* simplifyAndExtend has a very limited list of users it walks. In particular, in the examples is stops at the zext and never visits the icmp. (Because we can't fold the zext to an addrec yet in SCEV.) Being willing to visit when we haven't simplified regresses multiple tests (seemingly because of less optimal results when computing trip counts). D112170 explores fixing that, but - at least so far - appears to be too expensive compile time wise.
Differential Revision: https://reviews.llvm.org/D111836
Make use of the getGEPIndicesForOffset() helper for creating GEPs.
This handles arrays as well, uses correct GEP index types and
reduces code duplication.
Differential Revision: https://reviews.llvm.org/D112263
When I playing with Coroutines, I found that it is possible to generate
following IR:
```
%struct = alloca ...
%sub.element = getelementptr %struct, i64 0, i64 index ; index is not
%zero
lifetime.marker.start(%sub.element)
% use of %sub.element
lifetime.marker.end(%sub.element)
store %struct to xxx ; %struct is escaping!
<suspend points>
```
Then the AllocaUseVisitor would collect the lifetime marker for
sub.element and treat it as the lifetime markers of the alloca! So it
judges that the alloca could be put on the stack instead of the frame by
judging the lifetime markers only.
The root cause for the bug is that AllocaUseVisitor collects wrong
lifetime markers.
This patch fixes this.
Reviewed By: lxfind
Differential Revision: https://reviews.llvm.org/D112216
Transformations may strip the attribute from the
argument, e.g. for unused, which will result in
shadow offsets mismatch between caller and
callee.
Stripping noundef for used arguments can be
a problem, as TLS is not going to be set
by caller. However this is not the goal of the
patch and I am not aware if that's even
possible.
Differential Revision: https://reviews.llvm.org/D112197
As discussed in D112016, our current requirement of speculatability
for ephemeral is overly strict: What we really care about is that
the instruction will be DCEd once the assume is dropped. For that
it is sufficient that the instruction is side-effect free and not
a terminator.
In particular, this allows non-dereferenceable loads to be ephemeral
values.
Differential Revision: https://reviews.llvm.org/D112179
shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)
This is motivated by an example in D111800
(although that patch avoids the problem for that particular example).
The pattern is shown in reduced form with:
https://llvm.org/PR52178https://alive2.llvm.org/ce/z/d8zB4D
There is no difference on the PhaseOrdering test from D111800
because the aarch64 cost model says that the shuffle cost is 3 while
the fadd cost is 2.
Differential Revision: https://reviews.llvm.org/D111901
Vectorization of PHIs and stores very similar, it might be beneficial to
try to revectorize stores (like PHIs) if the total number of stores with
the same/alternate opcode is less than the vector size but number of
stores with the same type is larger than the vector size.
Differential Revision: https://reviews.llvm.org/D109831
In order to explore different variants of reassociation current implementation uses "swap in a loop" approach. Unfortunately, the implementation is more complicated than it could be. This is an attempt to streamline the code. New approach is to extract core functionality into a helper function and call it explicitly as many times as required.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D112128
At the moment, rewriteLoopExitValue forgets the current phi node in the
loop that collects phis to rewrite. A few lines after the value is
forgotten, SCEV is used again to analyze incoming values and
potentially expand SCEV expression. This means that another SCEV is
created for PN, before the IR is actually updated in the next loop.
This leads to accessing invalid cached expression in combination with
D71539.
PN should only be changed once the actual incoming exit value is set in
the next loop. Moving invalidation there should ensure that PN is
invalidated in all relevant cases.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D111495
bitcast (inselt (bitcast X), Y, 0) --> or (and X, MaskC), (zext Y)
https://alive2.llvm.org/ce/z/Ux-662
Similar to D111082 / db231ebdb0 :
We want to avoid relatively opaque vector ops on types that are
likely supported by the backend as scalar integers. The bitwise
logic ops are more likely to allow further combining.
We probably want to generalize this to allow a shift too, but
that would oppose instcombine's general rule of not creating
extra instructions, so that's left as a potential follow-up.
Alternatively, we could do that transform in VectorCombine
with the help of the TTI cost model.
This is part of solving:
https://llvm.org/PR52057
As discussed in:
* https://reviews.llvm.org/D94166
* https://lists.llvm.org/pipermail/llvm-dev/2020-September/145031.html
The GlobalIndirectSymbol class lost most of its meaning in
https://reviews.llvm.org/D109792, which disambiguated getBaseObject
(now getAliaseeObject) between GlobalIFunc and everything else.
In addition, as long as GlobalIFunc is not a GlobalObject and
getAliaseeObject returns GlobalObjects, a GlobalAlias whose aliasee
is a GlobalIFunc cannot currently be modeled properly. Creating
aliases for GlobalIFuncs does happen in the wild (e.g. glibc). In addition,
calling getAliaseeObject on a GlobalIFunc will currently return nullptr,
which is undesirable because it should return the object itself for
non-aliases.
This patch refactors the GlobalIFunc class to inherit directly from
GlobalObject, and removes GlobalIndirectSymbol (while inlining the
relevant parts into GlobalAlias and GlobalIFunc). This allows for
calling getAliaseeObject() on a GlobalIFunc to return the GlobalIFunc
itself, making getAliaseeObject() more consistent and enabling
alias-to-ifunc to be properly modeled in the IR.
I exercised some judgement in the API clients of GlobalIndirectSymbol:
some were 'monomorphized' for GlobalAlias and GlobalIFunc, and
some remained shared (with the type adapted to become GlobalValue).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D108872
To guarantee convergence of the algorithm each optimization step should decrease number of instructions when IR is modified. This property is not held in this test case. The problem is that SCEV Expander may do "unexpected" reassociation what results in creation of new min/max chains and introduction of extra instructions. As a result on each step we indefinitely optimize back and forth.
The solution is to restrict SCEV Expander to perform uncontrolled reassociations by means of "Unknown" expressions.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D112060
This is trivial. It was left out of the original review only because we had multiple copies of the same code in review at the same time, and keeping them in sync was easiest if the structure was kept in sync.
This patch duplicates a bit of logic we apply to comparisons encountered during the IV users walk to conditions which feed exit conditions. Why? simplifyAndExtend has a very limited list of users it walks. In particular, in the examples is stops at the zext and never visits the icmp. (Because we can't fold the zext to an addrec yet in SCEV.) Being willing to visit when we haven't simplified regresses multiple tests (seemingly because of less optimal results when computing trip counts).
Note that this can be trivially extended to multiple exiting blocks. I'm leaving that to a future patch (solely to cut down on the number of versions of the same code in review at once.)
Differential Revision: https://reviews.llvm.org/D111896
Using BPI within loop predication is non-trivial because BPI is only
preserved lossily in loop pass manager (one fix exposed by lossy
preservation is up for review at D111448). However, since loop
predication is only used in downstream pipelines, it is hard to keep BPI
from breaking for incomplete state with upstream changes in BPI.
Also, correctly preserving BPI for all loop passes is a non-trivial
undertaking (D110438 does this lossily), while the benefit of using it
in loop predication isn't clear.
In this patch, we rely on profile metadata to get almost similar benefit as
BPI, without actually using the complete heuristics provided by BPI.
This avoids the compile time explosion we tried to fix with D110438 and
also avoids fragile bugs because BPI can be lossy in loop passes
(D111448).
Reviewed-By: asbirlea, apilipenko
Differential Revision: https://reviews.llvm.org/D111668
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
Need to follow the order of the reused scalars from the
ReuseShuffleIndices mask rather than rely on the natural order.
Differential Revision: https://reviews.llvm.org/D111898
The goal is to allow grafting an inline tree from Clang or GCC into a new compilation without affecting other functions. For GCC, we're doing this by extracting the inline tree from dwarf information and generating the equivalent remarks.
This allows easier side-by-side asm analysis and a trial way to see if a particular inlining setup provides benefits by itself.
Testing:
ninja check-all
Reviewed By: wenlei, mtrofin
Differential Revision: https://reviews.llvm.org/D110658
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.
The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.
Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.
Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
Record widening decisions for memory operations within the planned recipes and
use the recorded decisions in code-gen rather than querying the cost model.
Differential Revision: https://reviews.llvm.org/D110479
This is NFC-intended for the callers. Posting in case there are
other potential users that I missed.
I would also use this from VectorCombine in a patch for:
https://llvm.org/PR52178 ( D111901 )
Differential Revision: https://reviews.llvm.org/D111891
When peeling a loop, we assume that the latch has a `br` terminator and
that all loop exits are either terminated with an `unreachable` or have
a terminating deoptimize call. So when we peel off the 1st iteration, we
change the IDom of all loop exits to the peeled copy of
`NCD(IDom(Exit), Latch)`. This works now, but if we add logic to support
loops with exits that are followed by a block with an `unreachable` or a
terminating deoptimize call, changing the exit's idom wouldn't be enough
and DT would be broken.
For example, let `Exit1` and `Exit2` are loop exits, and each of them
unconditionally branches to the same `unreachable` terminated block. So
neither of the exits dominates this unreachable block. If we change the
IDoms of the exits to some peeled loop block, we don't update the
dominators of the unreachable block. Currently we just don't get to the
peeling logic, saying that we can't peel such loops.
With this NFC we just insert edges from cloned exiting blocks to their
exits after peeling each iteration (we accumulate the insertion updates
and then after peeling apply the updates to DT).
This patch was a part of D110922.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D111611
Reviewed By: mkazantsev
This removes an over-specified fold. The more general transform
was added with:
727e642e97
There's a difference on an existing test that shows a potentially
unnecessary use limit on an icmp fold.
That fold is in InstCombinerImpl::foldICmpSubConstant(), and IIRC
there was some back-and-forth on it and similar folds because they
could cause analysis/passes (SCEV, LSR?) to miss optimizations.
Differential Revision: https://reviews.llvm.org/D111410
(iN X s>> (N-1)) & Y --> (X < 0) ? Y : 0
https://alive2.llvm.org/ce/z/qeYhdz
I was looking at a missing abs() transform and found my way to this
generalization of an existing fold that was added with D67799.
As discussed in that review, we want to make sure codegen handles
this difference well, and for all of the targets/types that I
spot-checked, it looks good.
I am leaving the existing fold in place in this commit because
it covers a potentially missing icmp fold, but I plan to remove
that as a follow-up commit as suggested during review.
Differential Revision: https://reviews.llvm.org/D111410
This patch adds a pass option to only run transforms that scalarize
vector operations and do not create new vector instructions.
When running VectorCombine early in the pipeline introducing new vector
operations can have negative effects, like blocking loop or SLP
vectorization. To avoid regressions, restrict the early VectorCombine
run (when using -enable-matrix) to only perform scalarization and not
introduce new vector operations.
This is done as option to the pass directly, which is then set when
adding the pass to the pipeline. This is done for the new pass manager
only.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D111800
Fixes: https://bugs.llvm.org/show_bug.cgi?id=51841
This patch places an arbitrary limit on the size of DIExpressions that
we will produce via salvaging, for performance reasons. This helps to
fix a performance issue observed in the bug above, in which debug values
would be salvaged hundreds of times, producing expressions with over
1000 elements and causing the compiler to hang. Limiting the size of
debug values that we will produce to 128 largely fixes this issue.
Reviewed By: dblaikie, jmorse
Differential Revision: https://reviews.llvm.org/D110332
Add lshr (sext i1 X to iN), C --> select (X, -1 >> C, 0) case. This expands
C == N-1 case to arbitrary C.
Fixes PR52078.
Reviewed By: spatel, RKSimon, lebedev.ri
Differential Revision: https://reviews.llvm.org/D111330
Need to check that either Idx is UndefMaskElem and value is UndefValue
or Idx is valid and value is the same as the scalar value in the node.
Differential Revision: https://reviews.llvm.org/D111802
Rather than checking for loop nest preheaders upfront in IVUsers,
move this requirement into isSafeToExpand() from SCEVExpander.
Historically, LSR did not check whether SCEVs are safe to expand
and fully relied on IVUsers to validate this. Later, support for
non-expandable SCEVs was added via rigid formulas.
Checking this in isSafeToExpand() makes it more obvious what
exactly this check is guarding against, and avoids the awkward
loop nest scan.
This is a followup to https://reviews.llvm.org/D111493#3055286.
Differential Revision: https://reviews.llvm.org/D111681
The initial MemoryAccess *Current assignment is never used, and all other uses are initialized/used within the worklist loop (and not across multiple iterations) - so move the variable internal to the loop.
Fixes scan-build unused assignment warning.
If the parameter had been annotated as nonnull because of the null
check, we want to remove the attribute, since it may no longer apply and
could result in miscompiles if left. Similarly, we also want to remove
undef-implying attributes, since they may not apply anymore either.
Fixes PR52110.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D111515
This extends the foldOpIntoPhi code used when visiting a freeze user of a phi to allow any non-undef/poison operand as opposed to only non-undef/poison constants. This lets us hoist a freeze in the increment of an IV into the preheader in many cases.
Differential Revision: https://reviews.llvm.org/D111744
Even if there are no interesting functions, the SCCP solver would still run
before bailing. Now bail earlier, avoid running the solver for nothing.
Differential Revision: https://reviews.llvm.org/D111645
This is NFC-intended for scalar code. There are still unnecessary
m_ConstantInt restrictions in surrounding code, so this is not a
complete fix.
This prevents regressions seen with a planned follow-on to D111410.
If we have an instruction which produces poison only when flags are specified on the instruction, then we know that freezing the operands and dropping flags is equivalent to freezing the result. If we know those flags don't result in any undefined behavior being executed, then there's no point in preserving the flags as we gain no knowledge by having them.
This patch extends the existing propagation logic which sinks freeze to single potential non-poison operands to allow dropping of flags when we know the freeze is the sole use of the instruction with poison flags.
The main value is that we tend to sink freezes towards the phi in IV cycles where the incoming value to the phi is the freeze of an IV increment. This will in turn (in a future patch), let us fold the freeze through the phi into the loop preheader. Motivated by eliminating need for CanonicalizeFreezeInLoops for the clearly profitable cases from onephi.ll test case in the test directory.
Differential Revision: https://reviews.llvm.org/D111675
This patch fixes another crash revealed by PR51614:
when *deciding* to vectorize with masked interleave groups, check if the access
is reverse (which is currently not supported).
Differential Revision: https://reviews.llvm.org/D108900
If another inlining session came after a ModuleInlinerWrapperPass, the
advisor alanysis would still be cached, but its Result would be cleared.
We need to clear both.
This addresses PR52118
Differential Revision: https://reviews.llvm.org/D111586
This may not be obvious, but Alive2 agrees:
https://alive2.llvm.org/ce/z/Ld9qNT
If the mul has "nsw", then -1 * INT_MIN is poison, so the
negate can also have "nsw" because 0 - INT_MIN is poison.
If the mul has "nuw", then that means the "OtherOp" can only
be 0 or 1 (anything else multiplied by 0xfff... would wrap).
So the replacement negate must be "nsw" because it is either
"0-0" or "0-1".
This is another regression noticed with a planned follow-up
to D111410.
This patch continues unblocking optimizations that are blocked by pseudo probe instrumentation.
Not exactly like DbgIntrinsics, PseudoProbe intrinsic has other attributes (such as mayread, maywrite, mayhaveSideEffect) that can block optimizations. The issues fixed are:
- Flipped default param of getFirstNonPHIOrDbg API to skip pseudo probes
- Unblocked CSE by avoiding pseudo probe from clobbering memory SSA
- Unblocked induction variable simpliciation
- Allow empty loop deletion by treating probe intrinsic isDroppable
- Some refactoring.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D110847
collectLoopScalars collects pointer induction updates in ScalarPtrs, assuming
that the instruction will be scalar after vectorization. This may crash later
in VPReplicateRecipe::execute() if there there is another user of the instruction
other than the Phi node which needs to be widened.
This changes collectLoopScalars so that if there are any other users of
Update other than a Phi node, it is not added to ScalarPtrs.
Reviewed By: david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D111294
This is a follow up of D110529 that disallowed constexprs. That change
introduced a regression as this also disallowed constexprs that are function
pointers, which is actually one of the motivating use cases that we do want to
support.
Differential Revision: https://reviews.llvm.org/D111567
This patch adds a new cost heuristic that allows peeling a single
iteration off read-only loops, if the loop contains a load that
1. is feeding an exit condition,
2. dominates the latch,
3. is not already known to be dereferenceable,
4. and has a loop invariant address.
If all non-latch exits are terminated with unreachable, such loads
in the loop are guaranteed to be dereferenceable after peeling,
enabling hoisting/CSE'ing them.
This enables vectorization of loops with certain runtime-checks, like
multiple calls to `std::vector::at` if the vector is passed as pointer.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D108114
LoopSimplifyCFG does not need MSSA, but should preserve it if it's available.
This is a legacy PM change, aimed to denoise the test changes in D109958.
Differential Revision: https://reviews.llvm.org/D111578
There may be some other patterns like this or a generalization,
but this is an example that I noticed would definitely regress
with a planned follow-up to D111410.
https://alive2.llvm.org/ce/z/GVpQDb
At the moment, a VPValue is created for the backedge-taken count, which
is used by some recipes. To make it easier to identify the operands of
recipes using the backedge-taken count, print it at the beginning of the
VPlan if it is used.
Reviewed By: a.elovikov
Differential Revision: https://reviews.llvm.org/D111298
As a brief reminder, an "exit count" is the number of times the backedge executes before some event. It can be zero if we exit before the backedge is reached. A "trip count" is the number of times the loop header is entered if we branch into the loop. In general, TC = BTC + 1 and thus a zero trip count is ill defined
There is a cornercases which we don't handle well. Let's assume i8 for our examples to keep things simple. If BTC = 255, then the correct trip count is 256. However, 256 is not representable in i8.
In theory, code which needs to reason about trip counts is responsible for checking for this cornercase, and either bailing out, or handling it correctly. Historically, we don't have a great track record about actually doing so.
When reviewing D109676, I found myself asking a basic question. Was there any good reason to preserve the current wrap-to-zero behavior when converting from backedge taken counts to trip counts? After reviewing existing code, I could not find a single case which appears to correctly and precisely handle the overflow case.
This patch changes the default behavior to extend instead of wrap. That is, if the result might be 256, we return a value of i9 type to ensure we interpret the count correctly. I did leave the legacy behavior as an option since a) loop-flatten stops triggering if I extend due to weirdly specific pattern matching I didn't understand and b) we could reasonably use the mode if we'd externally established a lack of overflow.
I want to emphasize that this change is *not* NFC. There are two call sites (one in ScalarEvolution.cpp, one in LoopCacheAnalysis.cpp) which are switched to the extend semantics. The former appears imprecise (but correct) for a constant 255 BTC. The later appears incorrect, though I don't have a test case.
Differential Revision: https://reviews.llvm.org/D110587
https://bugs.llvm.org/show_bug.cgi?id=27506https://bugs.llvm.org/show_bug.cgi?id=31652https://bugs.llvm.org/show_bug.cgi?id=51043
Problems with SimpleLoopUnswitch cause the bug reports above.
```
while (...) {
if (C) { A }
else { B }
}
Into:
C' = freeze(C)
if (C') {
while (...) { A }
} else {
while (...) { B }
}
```
This problem can be solved by adding a freeze on hoisted branches(above transform) and has been solved by D29015.
However, D29015 is now reverted by performance regression(2b5a897651)
It is not the first time that an added freeze has caused performance regression.
SimplifyCFG also had a problem with UB caused by branching-on-undef, which was solved by adding freeze to the branching condition. (D104569)
Performance regression occurred in D104569, and patches such as D105344 and D105392 were written to minimize it.
This patch will correct the SimpleLoopUnswitch as D104569 handles the SimplyCFG while minimizing performance loss by introducing patches like D105344 and D105392(This patch was rebased with the author's permission)
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D106041
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
We were using the type of the loop back edge count to represent the
store size. This failed for small loop counts (e.g. in the added test,
the loop count was an i2).
Use the index type instead.
Fixes PR52104.
Differential Revision: https://reviews.llvm.org/D111401
Transformation from malloc+memset to calloc is always correct and in many situations
it brings significant observable benefits in terms of execution speed and memory consumption [1][2].
Unfortunately there are cases when producing calloc cause performance drops [3].
As discussed here: https://reviews.llvm.org/D103009 it's possible to differentiate between those 2 scenarios.
If optimizer is able to prove that after malloc call it's _very_ likely to reach memset branch then after
calloc emission we shouldn't observe any performance hits. Therefore finding "null pointer check" pattern
before memset basic block sounds like good justification for performing transformation.
Also that method was already suggested by GCC folks [4]. Main reason for change is that for now
to be safe we check for post dominance relation which is way too conservative approach making transformation
"almost" disabled in practice. This patch tends to enable transformation again but with extra care.
[1] https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc
[2] https://vorpus.org/blog/why-does-calloc-exist/
[3] http://smalldatum.blogspot.com/2017/11/a-new-optimization-in-gcc-5x-and-mysql.html
[4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83022
Differential Revision: https://reviews.llvm.org/D110021
The test diffs show that we have better analysis/folds for 'add'
(although we should at least have the simplifications
independently, so we don't have the one-use restriction).
This is related to solving regressions that would appear in
transforms related to D111410, and that is part of a series
of enhancements that may eventually helpi solve PR34047.
https://alive2.llvm.org/ce/z/3tB9KG
define i1 @src(i8 %x, i8 %C, i8 %C2) {
%sub = sub nuw i8 %C2, %x
%r = icmp slt i8 %sub, %C
ret i1 %r
}
define i1 @tgt(i8 %x, i8 %C, i8 %C2) {
%Cnot = xor i8 %C, -1
%C2not = xor i8 %C2, -1
%add = add nuw i8 %x, %C2not
%r = icmp sgt i8 %add, %Cnot
ret i1 %r
}
There were 2 related but over-specified folds for:
C1 - X == C
One allowed multi-use but was limited to equal constants.
The other allowed different constants but disallowed multi-use.
This combines the 2 folds into a more general match.
The test diffs show the multi-use cases that were falling
through the cracks.
https://alive2.llvm.org/ce/z/4_hEt2
define i1 @src(i8 %x, i8 %subC, i8 %C) {
%s = sub i8 %subC, %x
%r = icmp eq i8 %s, %C
ret i1 %r
}
define i1 @tgt(i8 %x, i8 %subC, i8 %C) {
%newC = sub i8 %subC, %C
%isneg = icmp eq i8 %x, %newC
ret i1 %isneg
}
If a loop is flattened, the inner loop is removed and the LPM
should be informed of this fact, so it can invalidate associated
analyses. To support this, we relax an assertion in LPMUpdater to
allow invalidating non-top-level loops when running in LoopNestMode,
as the pass does not know how exactly it will get scheduled.
Differential Revision: https://reviews.llvm.org/D111350
This factors out utilities for scanning a bounded block of instructions since we have this code repeated in a bunch of places. The change to InlineFunction isn't strictly NFC as the limit mechanism there didn't handle debug instructions correctly.
Removed obsolete DT verification that should not be there because the
strategy of DT updates has changed.
Differential Revision: https://reviews.llvm.org/D110922
Added support for peeling loops with "deoptimizing" exits -
such exits that it or any of its children (or any of their
children, etc) either has a @llvm.experimental.deoptimize call
prior to the terminating return instruction of this basic block
or is terminated with unreachable. All blocks in the the
sequence must have a single successor, maybe except for the last
one.
Previously we only checked the exit block for being deoptimizing.
Now we check if the last reachable block from the exit is deoptimizing.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D110922
Reviewed By: mkazantsev
LoopFlatten does preserve loop analyses (DT, LI and SCEV), but
currently doesn't mark them as preserved in the NewPM (they are
marked as preserved in the LegacyPM). I think this doesn't really
have an effect in the end because the loop pass adaptor will just
assume they're preserved anyway, but let's be explicit about this
for the sake of clarity.
Differential Revision: https://reviews.llvm.org/D111328
This patch fixes problems reported in PR51981.
When rotating a loop it isn't enough to just forget SCEV for that
loop nest. When rotating we might clone some instructions from the
old header into the preheader, and insert new PHI nodes to merge
values together. There could be users of the original value that are
updated to use the PHI result. And those users were not necessarily
depending on a PHI node earlier, so they weren't cleaned up when just
forgetting all SCEV:s for the loop nest. So we need to explicitly
forget those values to avoid invalid cached SCEV expressions.
Reviewed By: fhahn, mkazantsev
Differential Revision: https://reviews.llvm.org/D110813
SCEV-based salvaging will use excessive resources if it encounters
very long SCEV expressions. This patch places a limit on the length of
SCEV expression that salvaging will attempt to translate.
Reviewed by: Orlando
Differential Revision: https://reviews.llvm.org/D110558
To better reflect the meaning of the now-disambiguated {GlobalValue,
GlobalAlias}::getBaseObject after breaking off GlobalIFunc::getResolverFunction
(D109792), the function is renamed to getAliaseeObject.
Currently the max alignment representable is 1GB, see D108661.
Setting the align of an object to 4GB is desirable in some cases to make sure the lower 32 bits are clear which can be used for some optimizations, e.g. https://crbug.com/1016945.
This uses an extra bit in instructions that carry an alignment. We can store 15 bits of "free" information, and with this change some instructions (e.g. AtomicCmpXchgInst) use 14 bits.
We can increase the max alignment representable above 4GB (up to 2^62) since we're only using 33 of the 64 values, but I've just limited it to 4GB for now.
The one place we have to update the bitcode format is for the alloca instruction. It stores its alignment into 5 bits of a 32 bit bitfield. I've added another field which is 8 bits and should be future proof for a while. For backward compatibility, we check if the old field has a value and use that, otherwise use the new field.
Updating clang's max allowed alignment will come in a future patch.
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D110451
Currently the max alignment representable is 1GB, see D108661.
Setting the align of an object to 4GB is desirable in some cases to make sure the lower 32 bits are clear which can be used for some optimizations, e.g. https://crbug.com/1016945.
This uses an extra bit in instructions that carry an alignment. We can store 15 bits of "free" information, and with this change some instructions (e.g. AtomicCmpXchgInst) use 14 bits.
We can increase the max alignment representable above 4GB (up to 2^62) since we're only using 33 of the 64 values, but I've just limited it to 4GB for now.
The one place we have to update the bitcode format is for the alloca instruction. It stores its alignment into 5 bits of a 32 bit bitfield. I've added another field which is 8 bits and should be future proof for a while. For backward compatibility, we check if the old field has a value and use that, otherwise use the new field.
Updating clang's max allowed alignment will come in a future patch.
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D110451
Currently the max alignment representable is 1GB, see D108661.
Setting the align of an object to 4GB is desirable in some cases to make sure the lower 32 bits are clear which can be used for some optimizations, e.g. https://crbug.com/1016945.
This uses an extra bit in instructions that carry an alignment. We can store 15 bits of "free" information, and with this change some instructions (e.g. AtomicCmpXchgInst) use 14 bits.
We can increase the max alignment representable above 4GB (up to 2^62) since we're only using 33 of the 64 values, but I've just limited it to 4GB for now.
The one place we have to update the bitcode format is for the alloca instruction. It stores its alignment into 5 bits of a 32 bit bitfield. I've added another field which is 8 bits and should be future proof for a while. For backward compatibility, we check if the old field has a value and use that, otherwise use the new field.
Updating clang's max allowed alignment will come in a future patch.
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D110451
We need to be better at exposing the comparison predicate to getCmpSelInstrCost calls as some targets (e.g. X86 SSE) have very different costs for different comparisons (PR48337), and we can't always rely on the optional Instruction argument.
This initial commit requires explicit condition type and predicate arguments. The next step will be to review a lot of the existing getCmpSelInstrCost calls which have used BAD_ICMP_PREDICATE even when the predicate is known.
Differential Revision: https://reviews.llvm.org/D111024
We already handle more complicated cases like:
extelt (bitcast (inselt poison, X, 0)) --> trunc (lshr X)
But we missed this simpler pattern:
https://alive2.llvm.org/ce/z/D55h64 / https://alive2.llvm.org/ce/z/GKzzRq
This is part of solving:
https://llvm.org/PR52057
I made the transform depend on legal/desirable int type to avoid creating
a shift of an illegal type (for example i128). I'm not sure if that
restriction is actually necessary, but we can change that as a follow-up
if the backend can deal with integer ops on too-wide illegal types.
The pile of AVX512 test changes are all neutral AFAICT - the x86 backend
seems to know how to turn that into the expected "kmov" instructions.
Differential Revision: https://reviews.llvm.org/D111082
As described on D111049, we're trying to remove the <string> dependency from error handling and replace uses of report_fatal_error(const std::string&) with the Twine() variant which can be forward declared.
This also removes the need to disable the mandatory inlining phase in
tests.
In a departure from the previous remark, we don't output a 'cost' in
this case, because there's no such thing. We just report that inlining
happened because of the attribute.
Differential Revision: https://reviews.llvm.org/D110891
This removes repeated calls to m_Not, so hopefully a little
more efficient.
Also, we may need to enhance some of these blocks to allow
logical and/or (select of bools).
Some initially gathered nodes missed the check for the reused scalars,
which leads to high gather cost. Such nodes still can be represented as
m gathers + shuffle instead of n gathers, where m < n.
Differential Revision: https://reviews.llvm.org/D111153
The current way to detect hostcalls by looking for "ockl_hostcall_internal()" function in the module seems to be not reliable enough. The LTO may rename the "ockl_hostcall_internal()" function when an application is compiled with "-fgpu-rdc", and MetadataStreamer pass to fail to detect hostcalls, therefore it does not set the "hidden_hostcall_buffer" kernel argument.
This change adds a new module flag: hostcall that can be used to detect whether GPU functions use host calls for printf.
Differential revision: https://reviews.llvm.org/D110337
In SCCPSolver::markArgInFuncSpecialization, the ValueState map may be
reallocated *after* the initial ValueLatticeElement reference is grabbed, but
*before* its use in copy initialization. This causes a use-after-free. To fix
this, this commit changes the behavior to create the new ValueLatticeElement
before assigning the old one to it.
Patch by: https://github.com/duck-37/
Differential Revision: https://reviews.llvm.org/D111112
When we check if a load is loop invariant by finding a dominating
invariant.start call, we strip bitcasts until we get to an i8* Value,
and look for an invariant.start use of the i8* Value.
We may accidentally end up at an i8 global and look at a global's uses,
which we shouldn't do in a loop pass. Although we could make this
logic work with globals, that's not currently intended.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D111098
This splits out the logic from shouldChangeType() that
currently allows 8/16/32-bit transforms even if those
types are not listed as legal in the data layout.
This could be useful as a predicate for vector
insert/extract transforms.
Note that this leaves the subsequent checks in
shouldChangeType() unchanged. We may want to merge
the checks for i1 and/or "ToLegal" into "isDesirable",
but that may alter existing transforms.
In TargetLibraryInfoImpl::isValidProtoForLibFunc we no longer
need the IsSizeTTy lambda function and the SizeTTy object. Instead
we just follow the regular structure of checking for integer types
given an exepected number of bits.
This patch adds a new command line option `openmp-opt-max-iterations`
that controls the maximum number of iterations the attributor will run
for when compiling OpenMP target device code. This patch also adds a
remark to indicate when the attributor failed because it did not run
for enough iterations.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110749
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.
Differential Revision: https://reviews.llvm.org/D110807
The first two tries at this were reverted because they caused an
infinite loop in instcombine.
That should be fixed after a series of patches that ended with
removing the faulty opposing transform:
3fabd98e5b
Original commit message:
(masked) trunc (lshr X, C) --> (masked) lshr (trunc X), C
Narrowing the shift should be better for analysis and can lead
to follow-on transforms as shown.
Attempt at a general proof in Alive2:
https://alive2.llvm.org/ce/z/tRnnSF
Here are a couple of the specific tests:
https://alive2.llvm.org/ce/z/bCnTp-https://alive2.llvm.org/ce/z/TfaHnb
Differential Revision: https://reviews.llvm.org/D110170
This patch is changing the InsertElement's placeholder to poison without changing the LSV's behavior.
Regardless of whether `StoreTy` is FixedVectorType or not, the poison value will be overwritten with a different value.
Therefore, whether the InsertElement's placeholder is poison or undef will not affect the result of the program.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D111005
Added an additional check for constants after simplification of
"select _, true, false" pattern. We need to prevent attempts to unswitch constant
conditions for two reasons:
a) Doing that doesn't make any sense, in the best case it will just burn
some compile time.
b) SimpleLoopUnswitch isn't designed to unswitch constant conditions
(due to (a)), so attempting that can cause miscompiles. The attached
testcase is an example of such miscompile.
Also added an assertion that'll make sure we aren't trying to replace
constants, so it will help us prevent such bugs in future. The assertion
from D110751 is another layer of protection against such cases.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D110752
This is no-externally-visible-functional-difference-intended.
That is, the test diffs show identical instructions other than
name changes (those are included specifically to verify the logic).
The existing transforms created extra instructions and relied
on subsequent folds to get to the final result, but that could
conflict with other transforms like the proposed D110170 (and
caused that patch to be reverted twice so far because of infinite
combine loops).
We expose the fact that we rely on unsigned wrapping to iterate through
all indexes. This can be confusing. Rather, keeping it as an
implementation detail through an iterator is less confusing and is less
code.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D110885
D104809 changed `buildTree_rec` to check for extract element instructions
with scalable types. However, if the extract is extended or truncated,
these changes do not apply and we assert later on in isShuffle(), which
attempts to cast the type of the extract to FixedVectorType.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D110640
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
In situations where the coroutine function is not split we can just
replace the async.resume by null.
rdar://82591919
Differential Revision: https://reviews.llvm.org/D110191
This is NFCI because the pattern with 2 left-shifts should get
folded independently by smaller folds.
The motivation is to refine this block to avoid infinite loops
seen with D110170.
This patch improves the effectiveness of BDCE's debug info salvaging
by processing the instructions in reverse order and delaying
dropAllReferences until after debug info salvaging. This allows
salvaging of entire chains of deleted instructions!
Previously we would remove all references from an instruction, which
would make it impossible to use that instruction to salvage a later
instruction in the instruction stream, because its operands were
already removed.
This reapplies the previous patch with a fix for a use-after-free.
Differential Revision: https://reviews.llvm.org/D110568
After rG452714f8f8037ff37f9358317651d1652e231db2, the Function `F` retrieved in LoopPredication is not used.
Remove this unused variable to stop some buildbots (ASAN, clang-ppc) from failing.
This is analogous to D86156 (which preserves "lossy" BFI in loop
passes). Lossy means that the analysis preserved may not be up to date
with regards to new blocks that are added in loop passes, but BPI will
not contain stale pointers to basic blocks that are deleted by the loop
passes.
This is achieved through BasicBlockCallbackVH in BPI, which calls
eraseBlock that updates the data structures in BPI whenever a basic
block is deleted.
This patch does not have any changes in the upstream pipeline, since
none of the loop passes in the pipeline use BPI currently.
However, since BPI wasn't previously preserved in loop passes, the loop
predication pass was invoking BPI *on the entire
function* every time it ran in an LPM. This caused massive compile time
in our downstream LPM invocation which contained loop predication.
See updated test with an invocation of a loop-pipeline containing loop
predication and -debug-pass turned ON.
Reviewed-By: asbirlea, modimo
Differential Revision: https://reviews.llvm.org/D110438
Summary:
The RTL functions added in https://reviews.llvm.org/D110429 were
mistakenly left out from the list of safe runtime calls in AAKernelInfo.
This patch adds them in.
It can happen that after widening of the IV, flattening may not be possible,
e.g. when it is deemed unprofitable. We were not properly checking this, which
resulted in flattening being applied when it shouldn't, also leading to
incorrect results (miscompilation).
This should fix PR51980 (https://bugs.llvm.org/show_bug.cgi?id=51980)
Differential Revision: https://reviews.llvm.org/D110712
The test is from https://llvm.org/PR51351.
There are 2 related logic bugs from over-generalizing "lshr" to "any shr",
but I'm not sure how to expose the difference for "MaskC" because instsimplify
already folds ashr of -1.
I'll extend instsimplify to catch the MaskD pattern as a follow-up, but this
patch should be enough to avoid the miscompile.
This fixes an issue exposed by D71539, where IndVarSimplify tries
to access an invalid cached SCEV expression after making changes to the
underlying PHI instruction earlier.
When changing the incoming value of a PHI, forget the cached SCEV for
the PHI.
We generate symbols like `profc`/`profd` for each function, and put them into csects.
When there are weak functions, we generate weak symbols for the functions as well,
with ELF (and some others), linker (binder) will discard and only keep one copy of the weak symbols.
However, on AIX, the current binder can NOT discard the weak symbols if we put all of them into the same csect,
as binder can NOT discard a subset of a csect.
This creates a unique challenge for using those symbols to calculate some relative offsets.
This patch changed the linkage of `profc`/`profd` symbols to be private, so that all the profc/profd for each weak symbol will be *local* to objects, and all kept in the csect, so we won't have problem. Although only one of the counters will be used, all the pointer in the profd is correct.
The downside is that we won't be able to discard the duplicated counters and profile data,
but those can not be discarded even if we keep the weak linkage,
due to the binder limitation of not discarding a subsect of the csect either .
Reviewed By: Whitney, MaskRay
Differential Revision: https://reviews.llvm.org/D110422
This is NFCI (no-functional-change-intended), but there
are benign diffs possible with commutable ops as seen in
the test diffs.
The transforms were repeated for the commutative opcodes,
but that should not be necessary if we canonicalize the
patterns that we're matching. If both operands of the
binop match, that should get folded eventually.
The transform that starts with a mask op seems to
over-constrain the use checks, so that could be a
potential enhancement.
This reverts commit f6954bf804.
This breaks the test-suite O3 build:
/home/nikic/llvm-test-suite/build-O3/tools/timeit --summary Bitcode/Benchmarks/Halide/local_laplacian/CMakeFiles/halide_local_laplacian.dir/local_laplacian.bc.o.time /home/nikic/llvm-project/build/bin/clang++ -DNDEBUG -O3 -w -Werror=date-time -save-stats=obj -save-stats=obj -std=c++11 -MD -MT Bitcode/Benchmarks/Halide/local_laplacian/CMakeFiles/halide_local_laplacian.dir/local_laplacian.bc.o -MF Bitcode/Benchmarks/Halide/local_laplacian/CMakeFiles/halide_local_laplacian.dir/local_laplacian.bc.o.d -o Bitcode/Benchmarks/Halide/local_laplacian/CMakeFiles/halide_local_laplacian.dir/local_laplacian.bc.o -c ../Bitcode/Benchmarks/Halide/local_laplacian/local_laplacian.bc
While deleting: i64 %
Use still stuck around after Def is destroyed: %12620 = mul i64 %12619, <badref>
clang++: /home/nikic/llvm-project/llvm/lib/IR/Value.cpp:103: llvm::Value::~Value(): Assertion `materialized_use_empty() && "Uses remain when a value is destroyed!"' failed.
This patch improves the effectiveness of BDCE's debug info salvaging
by processing the instructions in reverse order and delaying
dropAllReferences until after debug info salvaging. This allows
salvaging of entire chains of deleted instructions!
Previously we would remove all references from an instruction, which
would make it impossible to use that instruction to salvage a later
instruction in the instruction stream, because its operands were
already removed.
Differential Revision: https://reviews.llvm.org/D110568
This patch improves the effectiveness of ADCE's debug info salvaging
by processing the instructions in reverse order and delaying
dropAllReferences until after debug info salvaging. This allows
salvaging of entire chains of deleted instructions!
Previously we would remove all references from an instruction, which
would make it impossible to use that instruction to salvage a later
instruction in the instruction stream, because its operands were
already removed.
Differential Revision: https://reviews.llvm.org/D110462
This patch enables debug info salvaging for truncating/extending ptr
int conversions. The testcase uncovered a bug in adce, which is
addressed separately.
rdar://80227769
Differential Revision: https://reviews.llvm.org/D110461
This commit is the InstCombine follow-up to the previous constant-folding
change that enables noticeable optimizations for CHERI-enabled targets.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D110247
Try to improve vectorization of the PHI nodes by trying to vectorize
similar instructions at the size of the widest possible vectors, then
aggregating with compatible type PHIs and trying to vectoriza again and
only if this failed, try smaller sizes of the vector factors for
compatible PHI nodes. This restores performance of several benchmarks
after tuning of the fp/int conversion instructions costs.
Differential Revision: https://reviews.llvm.org/D108740
In rG6a076fa9539e, a problem with updating the old/narrow phi nodes after IV
widening was introduced. If after widening of the IV the transformation is
*not* applied, the narrow phi node was incorrectly modified, which should only
happen if flattening happens. This can be seen in the added test widen-iv2.ll,
which incorrectly had 1 incoming value, but should have its original 2 incoming
values, which is now restored.
Differential Revision: https://reviews.llvm.org/D110234