A cost query for a vector instruction should return a cost even without
target vector support, and not trigger an assert.
VectorCombine does this with an input containing source code vectors.
Review: Ulrich Weigand
Summary:
There is a flaw in memory dependence analysis caching mechanism when memory accesses with TBAA are involved. Assume we first analysed and cached results for access with TBAA. Later we request dependence for the same memory but without TBAA (or different TBAA). By design these two queries should share one entry in the internal cache which corresponds to a general access (without TBAA). Thus upon second request internal cached is cleared and we continue analysis for access as if there is no TBAA.
The problem is that even though internal cache is cleared the set of visited nodes is not. That means we won't traverse visited nodes again and populate internal cache with the corresponding dependence results. So we end up with internal cache in an incomplete state. Current implementation tries to signal that situation by resetting CacheInfo->Pair at line 1104. But that doesn't actually help since later code ignores this invalidation and relies on 'Cache->empty()' property to decide on cache completeness.
Reviewers: reames, hfinkel, chandlerc, fedor.sergeev, asbirlea, fhahn, john.brawn, Prazek, sunfish
Reviewed By: john.brawn
Subscribers: DaniilSuchkov, kosarev, jfb, dantrushin, hiraditya, bmahjour, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73032
A question about this behavior came up on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-February/139003.html
...and as part of backend improvements in D73978, but this is an IR
change first because we already have fairly thorough tests in place
here.
We decided not to implement a more general change that would have
folded any FP binop with nearly arbitrary constant + undef operand
to undef because that is not theoretically correct (even if it is
practically correct).
Differential Revision: https://reviews.llvm.org/D74713
Summary:
As mentioned in D71974, it is useful for must-be-executed-context to explore CFG backwardly.
This patch is ported from parts of D64975. We use a dominator tree to find the previous context if
a dominator tree is available.
Reviewers: jdoerfert, hfinkel, baziotis, sstefan1
Reviewed By: jdoerfert
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74817
Summary:
This is the last functional patch affecting the representation of DDG.
Here we try to simplify the DDG to reduce the number of nodes and edges by
iteratively merging pairs of nodes that satisfy the following conditions,
until no such pair can be identified. A pair of nodes consisting of a and b
can be merged if:
1. the only edge from a is a def-use edge to b and
2. the only edge to b is a def-use edge from a and
3. there is no cyclic edge from b to a and
4. all instructions in a and b belong to the same basic block and
5. both a and b are simple (single or multi instruction) nodes.
These criteria allow us to fold many uninteresting def-use edges that
commonly exist in the graph while avoiding the risk of introducing
dependencies that didn't exist before.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto, ppc-slack
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72350
Summary:
This is second attempt to fix the problem with incorrect dependencies reported in presence of invariant load. Initial fix (https://reviews.llvm.org/D64405) was reverted due to a regression reported in https://reviews.llvm.org/D70516.
The original fix changed caching behavior for invariant loads. Namely such loads are not put into the second level cache (NonLocalDepInfo). The problem with that fix is the first level cache (CachedNonLocalPointerInfo) still works as if invariant loads were in the second level cache. The solution is in addition to not putting dependence results into the second level cache avoid putting info about invariant loads into the first level cache as well.
Reviewers: jdoerfert, reames, hfinkel, efriedma
Reviewed By: jdoerfert
Subscribers: DaniilSuchkov, hiraditya, bmahjour, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73027
Summary:
Bail out early for scalable vectors. As global variables are not expected
to be scalable.
Use explicit call of getFixedSize() to assert on places where scalable size
doesn't make sense.
Reviewers: sdesmalen, efriedma, apazos, huntergr, willlovett
Reviewed By: sdesmalen
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74424
With the fixed implementation of the "remainder" operation in
rG9d0956ebd471, we can now add support to folding calls to it.
Differential Revision: https://reviews.llvm.org/D69777
Summary:
Consider:
%r = call i32 @llvm.amdgcn.writelane(i32 0, i32 1, i32 2)
This produces a value that is 0 on lane 1, and 2 everywhere else; i.e.,
it is divergent.
Reported-by: Marek Olsak <Marek.Olsak@amd.com>
Reviewers: arsenm, foad, mareko
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74400
Summary:
Part of the changes in D44564 made BasicAA not CFG only due to it using
PhiAnalysisValues which may have values invalidated.
Subsequent patches (rL340613) appear to have addressed this limitation.
BasicAA should not be invalidated by non-CFG-altering passes.
A concrete example is MemCpyOpt which preserves CFG, but we are testing
it invalidates BasicAA.
llvm-dev RFC: https://groups.google.com/forum/#!topic/llvm-dev/eSPXuWnNfzM
Reviewers: john.brawn, sebpop, hfinkel, brzycki
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74353
LoopCacheAnalysis currently assumes the loop will be iterated over in
a forward direction. This patch addresses the issue by using the
absolute value of the stride when iterating backwards.
Note: this patch will treat negative and positive array access the
same, resulting in the same cost being calculated for single and
bi-directional access patterns. This should be improved in a
subsequent patch.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D73064
Do not iterate on scalable vector type in BitCastConstantVector.
Continuation work of D70985, D71147.
Support for folding bitcast into splat value is kept in D74095, as
it depends on D71637.
Differential Revision: https://reviews.llvm.org/D71389
The mask results of these should be uniform. The trickier part is the
dummy booleans used as IR glue need to be treated as divergent. This
should make the divergence analysis results correct for the IR the DAG
is constructed from.
This should allow us to eliminate requiresUniformRegister, which has
an expensive, recursive scan over all users looking for control flow
intrinsics. This should avoid recent compile time regressions.
Summary:
* Most of the simplifications in SimplifyShuffleVectorInst depend on the
concrete value of, or the length of the mask vector. For scalable
vectors, this cannot be known at compile time.
** for these tests, detect if the vector is scalable before attempting
the transformation
* The functions ShuffleVectorInst::getMaskValue and
ShuffleVectorInst::getShuffleMask access the value of the constant mask.
However, since the length of the mask is unknown at compile time, these
function do not work for scalable vectors. Add asserts to ensure that
the input mask is not scalable
Reviewers: efriedma, sdesmalen, apazos, chrisj, huihuiz
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73555
Side notes from D73669, no need to guard the iteration on vectors, as
it is explicitly looking for a ConstantVector/ConstantDataVector, which
is not expected to be scalable at the moment. So, add the test only.
Summary:
Similar to issue D71445. Scalable vector should not be evaluated element by element.
Add support to handle scalable vector UndefValue.
Reviewers: sdesmalen, efriedma, apazos, huntergr, willlovett
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73678
We seem to be inheriting the cost from sse4.1. But if we have 256-bit registers we should be able to do this with just one extract to split the 16i16 and two v8i16->v8i32 operations so our cost should be 3 not 4.
Differential Revision: https://reviews.llvm.org/D73646
Summary:
Treat scalable allocas as if they have storage size of 0, and
scalable-typed memory accesses as if their range is unlimited.
This is not a proper support of scalable vector types in the analysis -
we can do better, but not today.
Reviewers: vitalybuka
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73394
These bots failed for this several months ago and as a result, this
check was removed. If they still fail I'm going to try to see if I
can figure out why.
Summary:
Enable the new diveregence analysis by default for AMDGPU.
Resubmit with test updates since GPUDA was causing failures on Windows.
Reviewers: rampitec, nhaehnle, arsenm, thakis
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73315
Summary: Fixes crash that could occur when a divergent terminator has an unreachable parent.
Reviewers: rampitec, nhaehnle, arsenm
Subscribers: jvesely, wdng, hiraditya, jfb, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73323
This is a very basic MVE gather/scatter cost model, based roughly on the
code that we will currently produce. It does not handle truncating
scatters or extending gathers correctly yet, as it is difficult to tell
that they are going to be correctly extended/truncated from the limited
information in the cost function.
This can be improved as we extend support for these in the future.
Based on code originally written by David Sherwood.
Differential Revision: https://reviews.llvm.org/D73021
llvm.memset intrinsics do only write memory, but are missing
IntrWriteMem, so they doesNotReadMemory() returns false for them.
The test change is due to the test checking the fn attribute ids at the
call sites, which got bumped up due to a new combination with writeonly
appearing in the test file.
Reviewers: jdoerfert, reames, efriedma, nlopes, lebedev.ri
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D72789
If addrecexpr has nuw flag, the value should never be less than its
start value and start value does not required to be SCEVConstant.
Reviewed By: nikic, sanjoy
Differential Revision: https://reviews.llvm.org/D71690
Summary:
This patch associates ordinal numbers to the DDG Nodes allowing
the builder to order nodes within a pi-block in program order. The
algorithm works by simply assuming the order in which the BBList
is fed into the builder. The builder already relies on the blocks being
in program order so that it can compute the dependencies correctly.
Similarly the order of instructions in their parent basic blocks
determine their program order.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto, ppc-slack
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70986
In order to use assumptions, computeKnownBits needs a context
instruction. We can use the GEP, if it is an instruction. We already
pass the assumption cache, but it cannot be used without a context
instruction.
Reviewers: anemet, asbirlea, hfinkel, spatel
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D71264
If the pointer was loaded/stored before the null check, the check
is redundant and can be removed. For now the optimizers do not
remove the nullptr check, see https://gcc.godbolt.org/z/H2r5GG.
The patch allows to use more nonnull constraints. Also, it found
one more optimization in some PowerPC test. This is my first llvm
review, I am free to any comments.
Differential Revision: https://reviews.llvm.org/D71177
Summary:
The current da printer shows the dependence without indicating
which instructions are being considered as the src vs dst. It
also silently ignores call instructions, despite the fact that
they create confused dependence edges to other memory
instructions. This patch addresses these two issues plus a
couple of minor non-functional improvements.
Authored By: bmahjour
Reviewer: dmgreen, fhahn, philip.pfaffe, chandlerc
Reviewed By: dmgreen, fhahn
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71088
Don't try to fold away shuffles which can't be folded. Fix creation of
shufflevector constant expressions.
Differential Revision: https://reviews.llvm.org/D71147
This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
This adds some extra cost model tests for shifts, and does some minor
adjustments to some Neon code to make it clear as to what it applies to.
Both NFC.
This is a follow-up to D70607 where we made any
extract element on SLM more costly than default. But that is
pessimistic for extract from element 0 because that corresponds
to x86 movd/movq instructions. These generally have >1 cycle
latency, but they are probably implemented as single uop
instructions.
Note that no vectorization tests are affected by this change.
Also, no targets besides SLM are affected because those are
falling through to the default cost of 1 anyway. But this will
become visible/important if we add more specializations via cost
tables.
Differential Revision: https://reviews.llvm.org/D71023
Summary:
This fixes the memory leak in bec37c3fc7
and re-delivers the reverted patch.
In this patch the DDG DAG is sorted topologically to put the
nodes in the graph in the order that would satisfy all
dependencies. This helps transformations that would like to
generate code based on the DDG. Since the DDG is a DAG a
reverse-post-order traversal would give us the topological
ordering. This patch also sorts the basic blocks passed to
the builder based on program order to ensure that the
dependencies are computed in the correct direction.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto, ppc-slack
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70609
Summary: b19ec1eb3d has been reverted because of the test failures
with PowerPC targets. This patch addresses the issues from the previous
commit.
Test Plan: ninja check-all. Confirmed that CodeGen/PowerPC/pr36292.ll
and CodeGen/PowerPC/sms-cpy-1.ll pass
Subscribers: llvm-commits
The Power 9 CPU has some features that are unlikely to be passed on to future
versions of the CPU. This patch separates this out so that future CPU does not
inherit them.
Differential Revision: https://reviews.llvm.org/D70466
I'm not sure what the effect of this change will be on all of the affected
tests or a larger benchmark, but it fixes the horizontal add/sub problems
noted here:
https://reviews.llvm.org/D59710?vs=227972&id=228095&whitespace=ignore-most#toc
The costs are based on reciprocal throughput numbers in Agner's tables for
PEXTR*; these appear to be very slow ops on Silvermont.
This is a small step towards the larger motivation discussed in PR43605:
https://bugs.llvm.org/show_bug.cgi?id=43605
Also, it seems likely that insert/extract is the source of perf regressions on
other CPUs (up to 30%) that were cited as part of the reason to revert D59710,
so maybe we'll extend the table-based approach to other subtargets.
Differential Revision: https://reviews.llvm.org/D70607
Summary:
While updatePostDominatedByUnreachable attemps to find basic blocks that are post-domianted by unreachable blocks, it currently cannot handle loops precisely, because it doesn't use the actual post dominator tree analysis but relies on heuristics of visiting basic blocks in post-order. More precisely, when the entire loop is post-dominated by the unreachable block, current algorithm fails to detect the entire loop as post-dominated by the unreachable because when the algorithm reaches to the loop latch it fails to tell all its successors (including the loop header) will "eventually" be post-domianted by the unreachable block, because the algorithm hasn't visited the loop header yet. This makes BPI for the loop latch to assume that loop backedges are taken with 100% of probability. And because of this, block frequency info sometimes marks virtually dead loops (which are post dominated by unreachable blocks) super hot, because 100% backedge-taken probability makes the loop iteration count the max value. updatePostDominatedByColdCall has the exact same problem as well.
To address this problem, this patch makes PostDominatedByUnreachable/PostDominatedByColdCall to be computed with the actual post-dominator tree.
Reviewers: skatkov, chandlerc, manmanren
Reviewed By: skatkov
Subscribers: manmanren, vsk, apilipenko, Carrot, qcolombet, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70104
Summary:
In this patch the DDG DAG is sorted topologically to put the
nodes in the graph in the order that would satisfy all
dependencies. This helps transformations that would like to
generate code based on the DDG. Since the DDG is a DAG a
reverse-post-order traversal would give us the topological
ordering. This patch also sorts the basic blocks passed to
the builder based on program order to ensure that the
dependencies are computed in the correct direction.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto, ppc-slack
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70609
For the various trip-count tests, the classification isn't useful and makes the auto-generated tests super verbose. By skipping it, we make the auto-gen tests closer to the manually written ones. Up next: auto-genning a bunch of the existings tests.
Simple loop unswitch likes to leave around unsimplified and/or/xors. SCEV today bails out on these idioms which is unfortunate in general, and specifically for the unswitch interaction.
Differential Revision: https://reviews.llvm.org/D70459
Moving accesses in MemorySSA at InsertionPlace::End, when an instruction is
moved into a block, almost always means insert at the end of the block, but
before the block terminator. This matters when the block terminator is a
MemoryAccess itself (an invoke), and the insertion must be done before
the terminator for the update to be correct.
Insert an additional position: InsertionPlace:BeforeTerminator and update
current usages where this applies.
Resolves PR44027.
If we partially unswitch a loop, we leave around the (and i1 X, true) or (or i1 X, false) forms. At the moment, this inhibits SCEVs ability to compute trip counts, patch forthcoming.
Currently we miss folds with undef and identity values for binary ops
that do not fold to undef in general.
We can generalize the identity simplifications and do them before
checking for undef in particular.
Alive checks:
* OR - https://rise4fun.com/Alive/8OsK
* AND - https://rise4fun.com/Alive/e3tE
This will also allow us to remove some now redundant cases throughout
the function, but I would like to do this as follow-up. That should make
tracking down potential issues easier.
Reviewers: spatel, RKSimon, lebedev.ri
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D70169
Ensure the stride and trip count have the same type before multiplying them during reference cost calculation
Reviewed By: jdoefert
Differential Revision: https://reviews.llvm.org/D70192
This is no longer needed after widening legalization as we
custom legalize v8i8 ourselves.
Added entries to the cost model, but bumped the cost slightly
to account for the truncate shuffle that wasn't costed before.
Summary:
If there are any internal methods whose address was taken, conclude there is nothing known in relation of any other internal method and a global.
Reviewers: nlopes, sanjoy.google
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69690
Summary:
This patch adds Pi Blocks to the DDG. A pi-block represents a group of DDG
nodes that are part of a strongly-connected component of the graph.
Replacing all the SCCs with pi-blocks results in an acyclic representation
of the DDG. For example if we have:
{a -> b}, {b -> c, d}, {c -> a}
the cycle a -> b -> c -> a is abstracted into a pi-block "p" as follows:
{p -> d} with "p" containing: {a -> b}, {b -> c}, {c -> a}
In this implementation the edges between nodes that are part of the pi-block
are preserved. The crossing edges (edges where one end of the edge is in the
set of nodes belonging to an SCC and the other end is outside that set) are
replaced with corresponding edges to/from the pi-block node instead.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto, ppc-slack
Tag: #llvm
Differential Revision: https://reviews.llvm.org/D68827
ShuffleVectorInst::isExtractSubvectorMask, introduced in
[CostModel] Add SK_ExtractSubvector handling to getInstructionThroughput (PR39368)
erroneously thought that
%340 = shufflevector <4 x float> %339, <4 x float> undef, <3 x i32> <i32 2, i32 3, i32 undef>
is a subvector extract, even though it goes off the end of the parent
vector with the undef index. That then caused an assert in
BasicTTIImplBase::getExtractSubvectorOverhead.
This commit fixes that, by not considering the above a subvector
extract.
Differential Revision: https://reviews.llvm.org/D70005
Change-Id: I87b8b00b24bef19ffc9a1b82ef4eca3b8a246eaf
This better represents the kshift+binop we'd get for each stage
before the final extract. Its likely we'll do even better by
doing a kmov and a cmp with a GPR, but this is a good start.
The default handling was costing a worst case single source
permute shuffle of the vector before the binop. This worst
case assumes the shuffle might have to be emulated with
extracts and inserts. But since we know we're doing a reduction
we can assume we'll get kshift lowering.
There's still some room for improvement here, but this is
much better than it was.
Summary:
If a conditional branch is encountered we can try to find a join block
where the execution is known to continue. This means finding a suitable
block, e.g., the immediate post dominator of the conditional branch, and
proofing control will always reach that block.
This patch implements different techniques that work with and without
provided analysis.
Reviewers: uenoku, sstefan1, hfinkel
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68933
Summary:
Getelementptr has vector type if any of its operands are vectors
(the scalar operands being implicitly broadcast to all vector elements).
Extractelement applied to a vector getelementptr can be folded by
applying the extractelement in turn to all of the vector operands.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69379
This is a common idiom which arises after induction variables are widened, and we have two or more exit conditions. Interestingly, we don't have instcombine or instsimplify support for this either.
Differential Revision: https://reviews.llvm.org/D69006
llvm-svn: 375349