ShuffleVectorInst::isExtractSubvectorMask, introduced in
[CostModel] Add SK_ExtractSubvector handling to getInstructionThroughput (PR39368)
erroneously thought that
%340 = shufflevector <4 x float> %339, <4 x float> undef, <3 x i32> <i32 2, i32 3, i32 undef>
is a subvector extract, even though it goes off the end of the parent
vector with the undef index. That then caused an assert in
BasicTTIImplBase::getExtractSubvectorOverhead.
This commit fixes that, by not considering the above a subvector
extract.
Differential Revision: https://reviews.llvm.org/D70005
Change-Id: I87b8b00b24bef19ffc9a1b82ef4eca3b8a246eaf
This better represents the kshift+binop we'd get for each stage
before the final extract. Its likely we'll do even better by
doing a kmov and a cmp with a GPR, but this is a good start.
The default handling was costing a worst case single source
permute shuffle of the vector before the binop. This worst
case assumes the shuffle might have to be emulated with
extracts and inserts. But since we know we're doing a reduction
we can assume we'll get kshift lowering.
There's still some room for improvement here, but this is
much better than it was.
Summary:
If a conditional branch is encountered we can try to find a join block
where the execution is known to continue. This means finding a suitable
block, e.g., the immediate post dominator of the conditional branch, and
proofing control will always reach that block.
This patch implements different techniques that work with and without
provided analysis.
Reviewers: uenoku, sstefan1, hfinkel
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68933
Summary:
Getelementptr has vector type if any of its operands are vectors
(the scalar operands being implicitly broadcast to all vector elements).
Extractelement applied to a vector getelementptr can be folded by
applying the extractelement in turn to all of the vector operands.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69379
This is a common idiom which arises after induction variables are widened, and we have two or more exit conditions. Interestingly, we don't have instcombine or instsimplify support for this either.
Differential Revision: https://reviews.llvm.org/D69006
llvm-svn: 375349
Add specific scalar costs for CTLZ instructions, we can't discriminate between CTLZ and CTLZ_ZERO_UNDEF so we have to assume the worst. Given how BSR is often a microcoded nightmare on some older targets we might still be underestimating it.
For targets supporting LZCNT (Intel Haswell+ or AMD Fam10+), we provide overrides that assume 1cy costs.
llvm-svn: 374786
Add specific scalar costs for ctpop instructions, these are based on the llvm-mca's SLM throughput numbers (the oldest model we have).
For targets supporting POPCNT, we provide overrides that assume 1cy costs.
llvm-svn: 374775
I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2.
I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674.
llvm-svn: 374655
When simplifying a Phi to the unique value found incoming, check that
there wasn't a Phi already created to break a cycle. If so, remove it.
Resolves PR43541.
Some additional nits included.
llvm-svn: 374471
Summary:
Whenever we get the previous definition, the assumption is that the
recursion starts ina reachable block.
If the recursion starts in an unreachable block, we may recurse
indefinitely. Handle this case by returning LoE if the block is
unreachable.
Resolves PR43426.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68809
llvm-svn: 374447
Summary:
The rule for the moveAllAfterMergeBlocks API si for all instructions
from `From` to have been moved to `To`, while keeping the CFG edges (and
block terminators) unchanged.
Update all the callsites for moveAllAfterMergeBlocks to follow this.
Pending follow-up: since the same behavior is needed everytime, merge
all callsites into one. The common denominator may be the call to
`MergeBlockIntoPredecessor`.
Resolves PR43569.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68659
llvm-svn: 374177
MemoryPhis should be added in the IDF of the blocks newly gaining Defs.
This includes the blocks that gained a Phi and the block gaining a Def,
if the block did not have one before.
Resolves PR43427.
llvm-svn: 373505
Summary:
This patch adds Root Node to the DDG. The purpose of the root node is to create a single entry node that allows graph walk iterators to iterate through all nodes of the graph, making sure that no node is left unvisited during a graph walk (eg. SCC or DFS). Once the DDG is fully constructed it will have exactly one root node. Every node in the graph is reachable from the root. The algorithm for connecting the root node is based on depth-first-search that keeps track of visited nodes to try to avoid creating unnecessary edges.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto, ppc-slack
Tag: #llvm
Differential Revision: https://reviews.llvm.org/D67970
llvm-svn: 373386
If a single predecessor is found, still check if the block is
unreachable. The test that found this had a self loop unreachable block.
Resolves PR43493.
llvm-svn: 373383
The check for "was there an access in this block" should be: is the last
access in this block and is it not a newly inserted phi.
Resolves new test in PR43438.
Also fix a typo when simplifying trivial Phis to match the comment.
llvm-svn: 373380
This reverts r366419 because the analysis performed is within the context of
the loop and it's only valid to add wrapping flags to "global" expressions if
they're always correct.
llvm-svn: 373184
SLM is 2 x slower for <2 x i64> comparison ops than other vector types, we should account for this like we do for SLM <2 x i64> add/sub/mul costs.
This should remove some of the SLM codegen diffs in D43582
llvm-svn: 372954
Summary:
If a block has all incoming values with the same MemoryAccess (ignoring
incoming values from unreachable blocks), then use that incoming
MemoryAccess and do not create a Phi in the first place.
Revert IDF work-around added in rL372673; it should not be required unless
the Def inserted is the first in its block.
The patch also cleans up a series of tests, added during the many
iterations on insertDef.
The patch also fixes PR43438.
The same issue that occurs in insertDef with "adding phis, hence the IDF of
Phis is needed", can also occur in fixupDefs: the `getPreviousRecursive`
call only adds Phis walking on the predecessor edges, which means there
may be the case of a Phi added walking the CFG "backwards" which
triggers the needs for an additional Phi in successor blocks.
Such Phis are added during fixupDefs only in the presence of unreachable
blocks.
Hence this highlights the need to avoid adding Phis in blocks with
unreachable predecessors in the first place.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67995
llvm-svn: 372932
Summary:
MemoryPhis may be needed following a Def insertion inthe IDF of all the
new accesses added (phis + potentially a def). Ensure this also occurs when
only the new MemoryPhis are the defining accesses.
Note: The need for computing IDF here is because of new Phis added with
edges incoming from unreachable code, Phis that had previously been
simplified. The preferred solution is to not reintroduce such Phis.
This patch is the needed fix while working on the preferred solution.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67927
llvm-svn: 372673
We are missing costs for a lot of truncation cases, I'm hoping to address all the 'zero cost' cases in trunc.ll
I thought this was a vector widening side effect, but even before this we had some interesting LV decisions (notably over indvars) being made due to these zero costs.
llvm-svn: 372498
The recently announced IBM z15 processor implements the architecture
already supported as "arch13" in LLVM. This patch adds support for
"z15" as an alternate architecture name for arch13.
The patch also uses z15 in a number of places where we used arch13
as long as the official name was not yet announced.
llvm-svn: 372435
At present, `-scalar-evolution-max-iterations` is a `cl::Optional`
option, which means it demands to be passed exactly zero or one times.
Our build system makes it pretty tricky to guarantee this. We often
accidentally pass the flag more than once (but always with the same
value) which results in an error, after which compilation fails:
```
clang (LLVM option parsing): for the -scalar-evolution-max-iterations option: may only occur zero or one times!
```
It seems reasonable to allow -scalar-evolution-max-iterations to be
passed more than once. Quoting the [[ http://llvm.org/docs/CommandLine.html#controlling-the-number-of-occurrences-required-and-allowed | documentation ]]:
> The cl::ZeroOrMore modifier ... indicates that your program will allow the option to be specified zero or more times.
> ...
> If an option is specified multiple times for an option of the cl::opt class, only the last value will be retained.
Original patch by: Enrico Bern Hardy Tanuwidjaja <etanuwid@fb.com>
Differential Revision: https://reviews.llvm.org/D67512
llvm-svn: 372346
Summary:
This is the first patch in a series of patches that will implement data dependence graph in LLVM. Many of the ideas used in this implementation are based on the following paper:
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe (1981). DEPENDENCE GRAPHS AND COMPILER OPTIMIZATIONS.
This patch contains support for a basic DDGs containing only atomic nodes (one node for each instruction). The edges are two fold: def-use edges and memory-dependence edges.
The implementation takes a list of basic-blocks and only considers dependencies among instructions in those basic blocks. Any dependencies coming into or going out of instructions that do not belong to those basic blocks are ignored.
The algorithm for building the graph involves the following steps in order:
1. For each instruction in the range of basic blocks to consider, create an atomic node in the resulting graph.
2. For each node in the graph establish def-use edges to/from other nodes in the graph.
3. For each pair of nodes containing memory instruction(s) create memory edges between them. This part of the algorithm goes through the instructions in lexicographical order and creates edges in reverse order if the sink of the dependence occurs before the source of it.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur, fhahn, myhsu
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto
Tag: #llvm
Differential Revision: https://reviews.llvm.org/D65350
llvm-svn: 372238
Summary:
This fixes B42473 and B42706.
This patch makes the SDA propagate branch divergence until the end of the RPO traversal. Before, the SyncDependenceAnalysis propagated divergence only until the IPD in rpo order. RPO is incompatible with post dominance in the presence of loops. This made the SDA crash because blocks were missed in the propagation.
Reviewers: foad, nhaehnle
Reviewed By: foad
Subscribers: jvesely, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65274
llvm-svn: 372223
Summary:
This is the first patch in a series of patches that will implement data dependence graph in LLVM. Many of the ideas used in this implementation are based on the following paper:
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe (1981). DEPENDENCE GRAPHS AND COMPILER OPTIMIZATIONS.
This patch contains support for a basic DDGs containing only atomic nodes (one node for each instruction). The edges are two fold: def-use edges and memory-dependence edges.
The implementation takes a list of basic-blocks and only considers dependencies among instructions in those basic blocks. Any dependencies coming into or going out of instructions that do not belong to those basic blocks are ignored.
The algorithm for building the graph involves the following steps in order:
1. For each instruction in the range of basic blocks to consider, create an atomic node in the resulting graph.
2. For each node in the graph establish def-use edges to/from other nodes in the graph.
3. For each pair of nodes containing memory instruction(s) create memory edges between them. This part of the algorithm goes through the instructions in lexicographical order and creates edges in reverse order if the sink of the dependence occurs before the source of it.
Authored By: bmahjour
Reviewer: Meinersbur, fhahn, myhsu, xtian, dmgreen, kbarton, jdoerfert
Reviewed By: Meinersbur, fhahn, myhsu
Subscribers: ychen, arphaman, simoll, a.elovikov, mgorny, hiraditya, jfb, wuzish, llvm-commits, jsji, Whitney, etiotto
Tag: #llvm
Differential Revision: https://reviews.llvm.org/D65350
llvm-svn: 372162
Summary:
When inserting a Def, the current algorithm is walking edges backward
and inserting new Phis where needed. There may be additional Phis needed
in the IDF of the newly inserted Def and Phis.
Adding Phis in the IDF of the Def was added ina previous patch, but we
may also need other Phis in the IDF of the newly added Phis.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67637
llvm-svn: 372138
Summary:
Regularly when moving an instruction that may not read or write memory,
the instruction is not modelled in MSSA, so not action is necessary.
For a non-conventional AA pipeline, MSSA needs to explicitly check when
creating accesses, so as to not model instructions that may not read and
write memory.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67562
llvm-svn: 372137
We were failing to compute trip counts (both exact and maximum) for any loop which involved a comparison against either an umin or smin. It looks like this simply got missed when we added smin/umin to SCEV. (Note: umin was submitted separately earlier today. Turned out two folks hit this at the same time.)
Differential Revision: https://reviews.llvm.org/D67514
llvm-svn: 371776
Expanding the folding of `nearbyint()`, `rint()` and `trunc()` to library
functions, in addition to the current support for intrinsics.
Differential revision: https://reviews.llvm.org/D67468
llvm-svn: 371774
This patch adds support for SCEVUMinExpr to getRangeRef,
similar to the support for SCEVUMaxExpr.
Reviewers: sanjoy.google, efriedma, reames, nikic
Reviewed By: sanjoy.google
Differential Revision: https://reviews.llvm.org/D67177
llvm-svn: 371768
Summary:
Do not model debuginfo intrinsics in MemorySSA.
Regularly these are non-memory modifying instructions. With -disable-basicaa, they were being modelled as Defs.
Reviewers: george.burgess.iv
Subscribers: aprantl, Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67307
llvm-svn: 371565
Since NaN is very rare in normal programs, so the probability for floating point unordered comparison should be extremely small. Current probability is 3/8, it is too large, this patch changes it to a tiny number.
Differential Revision: https://reviews.llvm.org/D65303
llvm-svn: 371541
Summary:
We already use the fact that an object with known size X does not alias
another objection of size Y > X before. With this commit, we use
dereferenceability information to determine a lower bound for Y and not
only rely on the user provided query size.
The result for @global_and_deref_arg_2() and @local_and_deref_ret_2()
in test/Analysis/BasicAA/dereferenceable.ll improved with this patch.
Reviewers: asbirlea, chandlerc, hfinkel, sanjoy
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66157
llvm-svn: 369786
Given an instruction I, the MustBeExecutedContextExplorer allows to
easily traverse instructions that are guaranteed to be executed whenever
I is. For now, these instruction have to be statically "after" I, in
the same or different basic blocks.
This patch also adds a pass which prints the must-be-executed-context
for each instruction in a module. It is used to test the
MustBeExecutedContextExplorer, for now on the examples given in the
class comment of the MustBeExecutedIterator.
Differential Revision: https://reviews.llvm.org/D65186
llvm-svn: 369765
I don't really understand the costs we're using for fp_to_sint,
but prior to widening legalization we used 20 as the cost for this
via the v2i64->v2f64 entry. That number seems better than the 40
we got with widening legalization. So now we need either a
v2i32->v2f64 entry or a v4i32->v2f64 entry depending on whether
AVX is enabled or not since we skip the first SSE2 table look up
under AVX.
llvm-svn: 369628
Summary:
Add a flag to the FunctionToLoopAdaptor that allows enabling MemorySSA only for the loop pass managers that are known to preserve it.
If an LPM is known to have only loop transforms that *all* preserve MemorySSA, then use MemorySSA if `EnableMSSALoopDependency` is set.
If an LPM has loop passes that do not preserve MemorySSA, then the flag passed is `false`, regardless of the value of `EnableMSSALoopDependency`.
When using a custom loop pass pipeline via `passes=...`, use keyword `loop` vs `loop-mssa` to use MemorySSA in that LPM. If a loop that does not preserve MemorySSA is added while using the `loop-mssa` keyword, that's an error.
Add the new `loop-mssa` keyword to a few tests where a difference occurs when enabling MemorySSA.
Reviewers: chandlerc
Subscribers: mehdi_amini, Prazek, george.burgess.iv, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66376
llvm-svn: 369548
Summary:
When inserting a new Def, and inserting Phis in the IDF when needed,
also mark the already existing Phis in the IDF as non-optimized, since
these may need fixing as well.
In the test attached, there is a Phi in the IDF that happens to be
trivial, and is wrongfully removed by the call to getLastDef that
follows. This is a valid situation and the existing IDF Phis need to
marked as "may need fixing" as well.
Resolves PR43044.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66495
llvm-svn: 369464
Summary:
When inserting uses from outside the MemorySSA creation, we don't
normally need to rename uses, based on the assumption that there will be
no inserted Phis (if Def existed that required a Phi, that Phi already
exists). However, when dealing with unreachable blocks, MemorySSA will
optimize away Phis whose incoming blocks are unreachable, and these Phis end
up being re-added when inserting a Use.
There are two potential solutions here:
1. Analyze the inserted Phis and clean them up if they are unneeded
(current method for cleaning up trivial phis does not cover this)
2. Leave the Phi in place and rename uses, the same way as whe inserting
defs.
This patch use approach 2.
Resolves first test in PR42940.
Reviewers: george.burgess.iv
Subscribers: Prazek, sanjoy.google, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66033
llvm-svn: 369291
This adds some sext costs for MVE, taken from the length of assembly sequences
that we currently generate.
Differential Revision: https://reviews.llvm.org/D66010
llvm-svn: 369244
MVE also has some sext of loads, which will be free just as scalar
instructions are.
Differential Revision: https://reviews.llvm.org/D66008
llvm-svn: 369118
For indirect call sites having a small set of possible callees,
!callees metadata can be used to indicate what those callees are.
This patch updates the call graph and lazy call graph analyses so
that they consider this metadata when encountering call sites. For
the call graph, it adds a new external call graph node to the graph
for each unique !callees metadata node. A call graph edge connects
an indirect call site with the external node associated with the
!callees metadata that is attached to it. And there is an edge from
this external node to each of the callees indicated by the metadata.
Similarly, for the lazy call graph, the patch adds Ref edges from a
caller to the possible callees indicated by the metadata.
The primary purpose of the patch is to facilitate iterating over the
functions in a module such that all of the callees indicated by a
given !callees metadata node will be visited prior to the functions
containing call sites annotated by that node. This property is
required by optimizations performing a bottom-up traversal of the
SCC DAG. For example, the inliner can be made to inline through an
indirect call. If the call site is annotated with !callees metadata,
this patch ensures that the inliner will have visited all of the
callees prior to the caller, allowing it to reliably compute the
cost of inlining one or more of the potential callees.
Original patch by @mssimpso. I've made some small changes to get it
to apply, build, and pass tests on the top of tree, as well as
some minor tweaks to formatting and functionality.
Subscribers: mehdi_amini, hiraditya, llvm-commits, mssimpso
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D39339
llvm-svn: 369025
Now that we're using widening legalization. We need to improve our extract_subvector cost model for these types. This patch begins by modeling these as a subvector extract followed by a permute. I've left FIXMEs in the code for future improvements.
Differential Revision: https://reviews.llvm.org/D65892
llvm-svn: 369022
Now that we legalize by widening, the element types here won't change. Previously these were modeled as the elements being widened and then the instruction might become an AND or SHL/ASHR pair. But now they'll become something like a ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG.
For AVX2, when the destination type is legal its clear the cost should be 1 since we have extend instructions that can produce 256 bit vectors from less than 128 bit vectors. I'm a little less sure about AVX1 costs, but I think the ones I changed were definitely too high, but they might still be too high.
Differential Revision: https://reviews.llvm.org/D66169
llvm-svn: 368858
The MVE architecture has the idea of "beats", where a vector instruction can be
executed over several ticks of the architecture. This adds a similar system
into the Arm backend cost model, multiplying the cost of all vector
instructions by a factor.
This factor essentially becomes the expected difference between scalar code
and vector code, on average. MVE Vector instructions can also overlap so the a
true cost of them is often lower. But equally scalar instructions can in some
situations be dual issued, or have other optimisations such as unrolling or
make use of dsp instructions. The default is chosen as 2. This should not
prevent vectorisation is a most cases (as the vector instructions will still be
doing at least 4 times the work), but it will help prevent over vectorising in
cases where the benefits are less likely.
This adds things so far to the obvious places in ARMTargetTransformInfo, and
updates a few related costs like not treating float instructions as cost 2 just
because they are floats.
Differential Revision: https://reviews.llvm.org/D66005
llvm-svn: 368733
This teaches the cost model that the sext or zext of a load is going to be
free.
Differential Revision: https://reviews.llvm.org/D66006
llvm-svn: 368593
A VDUP will perform a vector broadcast in a single instruction. Update the cost
model for MVE accordingly.
Code originally by David Sherwood.
Differential Revision: https://reviews.llvm.org/D63448
llvm-svn: 368589
This puts some of the calls in ARMTargetTransformInfo.cpp behind hasNeon()
checks, now that we have MVE, and updates all the tests accordingly.
Differential Revision: https://reviews.llvm.org/D63447
llvm-svn: 368587
This adds a number of cost model tests for ARM, useful for MVE. It also re-jigs
some of the existing tests to make them easier to update and read.
llvm-svn: 368586
Summary: Implement a new analysis to estimate the number of cache lines
required by a loop nest.
The analysis is largely based on the following paper:
Compiler Optimizations for Improving Data Locality
By: Steve Carr, Katherine S. McKinley, Chau-Wen Tseng
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf
The analysis considers temporal reuse (accesses to the same memory
location) and spatial reuse (accesses to memory locations within a cache
line). For simplicity the analysis considers memory accesses in the
innermost loop in a loop nest, and thus determines the number of cache
lines used when the loop L in loop nest LN is placed in the innermost
position.
The result of the analysis can be used to drive several transformations.
As an example, loop interchange could use it determine which loops in a
perfect loop nest should be interchanged to maximize cache reuse.
Similarly, loop distribution could be enhanced to take into
consideration cache reuse between arrays when distributing a loop to
eliminate vectorization inhibiting dependencies.
The general approach taken to estimate the number of cache lines used by
the memory references in the inner loop of a loop nest is:
Partition memory references that exhibit temporal or spatial reuse into
reference groups.
For each loop L in the a loop nest LN: a. Compute the cost of the
reference group b. Compute the 'cache cost' of the loop nest by summing
up the reference groups costs
For further details of the algorithm please refer to the paper.
Authored By: etiotto
Reviewers: hfinkel, Meinersbur, jdoerfert, kbarton, bmahjour, anemet,
fhahn
Reviewed By: Meinersbur
Subscribers: reames, nemanjai, MaskRay, wuzish, Hahnfeld, xusx595,
venkataramanan.kumar.llvm, greened, dmgreen, steleman, fhahn, xblvaOO,
Whitney, mgorny, hiraditya, mgrang, jsji, llvm-commits
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D63459
llvm-svn: 368439
The assert that caused this to be reverted should be fixed now.
Original commit message:
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 368183
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 367901
Summary:
Verify that the incoming defs into phis are the last defs from the
respective incoming blocks.
When moving blocks, insertDef must RenameUses. Adding this verification
makes GVNHoist tests fail that uncovered this issue.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63147
llvm-svn: 367451
Summary:
LoopRotate may simplify instructions, leading to the new instructions not having memory accesses created for them.
Allow this behavior, by allowing the new access to be null when the template is null, and looking upwards for the proper defined access when dealing with simplified instructions.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65338
llvm-svn: 367352
Summary:
In D62801, new function attribute `willreturn` was introduced. In short, a function with `willreturn` is guaranteed to come back to the call site(more precise definition is in LangRef).
In this patch, willreturn is annotated for LLVM intrinsics.
Reviewers: jdoerfert
Reviewed By: jdoerfert
Subscribers: jvesely, nhaehnle, sstefan1, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64904
llvm-svn: 367184
Summary:
Allow IntToPtrInst to carry !dereferenceable metadata tag.
This is valid since !dereferenceable can be only be applied to
pointer type values.
Change-Id: If8a6e3c616f073d51eaff52ab74535c29ed497b4
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64954
llvm-svn: 366826
It is not safe in general to replace an alias in a GEP with its aliasee
if the alias can be replaced with another definition (i.e. via strong/weak
resolution (linkonce_odr) or via symbol interposition (default visibility
in ELF)) while the aliasee cannot. An example of how this can go wrong is
in the included test case.
I was concerned that this might be a load-bearing misoptimization (it's
possible for us to use aliases to share vtables between base and derived
classes, and on Windows, vtable symbols will always be aliases in RTTI
mode, so this change could theoretically inhibit trivial devirtualization
in some cases), so I built Chromium for Linux and Windows with and without
this change. The file sizes of the resulting binaries were identical, so it
doesn't look like this is going to be a problem.
Differential Revision: https://reviews.llvm.org/D65118
llvm-svn: 366754
Implement IR intrinsics for stack tagging. Generated code is very
unoptimized for now.
Two special intrinsics, llvm.aarch64.irg.sp and llvm.aarch64.tagp are
used to implement a tagged stack frame pointer in a virtual register.
Differential Revision: https://reviews.llvm.org/D64172
llvm-svn: 366360
This patch series adds support for the next-generation arch13
CPU architecture to the SystemZ backend.
This includes:
- Basic support for the new processor and its features.
- Assembler/disassembler support for new instructions.
- CodeGen for new instructions, including new LLVM intrinsics.
- Scheduler description for the new processor.
- Detection of arch13 as host processor.
Note: No currently available Z system supports the arch13
architecture. Once new systems become available, the
official system name will be added as supported -march name.
llvm-svn: 365932
Summary:
The map kept in loop rotate is used for instruction remapping, in order
to simplify the clones of instructions. Thus, if an instruction can be
simplified, its simplified value is placed in the map, even when the
clone is added to the IR. MemorySSA in contrast needs to know about that
clone, so it can add an access for it.
To resolve this: keep a different map for MemorySSA.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63680
llvm-svn: 365672
This patch adds a function attribute, nofree, to indicate that a function does
not, directly or indirectly, call a memory-deallocation function (e.g., free,
C++'s operator delete).
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D49165
llvm-svn: 365336
This reverts commit r365260 which broke the following tests:
Clang :: CodeGenCXX/cfi-mfcall.cpp
Clang :: CodeGenObjC/ubsan-nullability.m
LLVM :: Transforms/LoopVectorize/AArch64/pr36032.ll
llvm-svn: 365284
Without this, we have the unfortunate property that tests are dependent on the order of operads passed the CreateOr and CreateAnd functions. In actual usage, we'd promptly optimize them away, but it made tests slightly more verbose than they should have been.
llvm-svn: 365260
The previous output was next to useless if *any* exit was not computable. If we have more than one exit, show the exit count for each so that it's easier to see what's going from with SCEV analysis when debugging.
llvm-svn: 364579
This patch generalizes the UnrollLoop utility to support loops that exit
from the header instead of the latch. Usually, LoopRotate would take care
of must of those cases, but in some cases (e.g. -Oz), LoopRotate does
not kick in.
Codesize impact looks relatively neutral on ARM64 with -Oz + LTO.
Program master patch diff
External/S.../CFP2006/447.dealII/447.dealII 629060.00 627676.00 -0.2%
External/SPEC/CINT2000/176.gcc/176.gcc 1245916.00 1244932.00 -0.1%
MultiSourc...Prolangs-C/simulator/simulator 86100.00 86156.00 0.1%
MultiSourc...arks/Rodinia/backprop/backprop 66212.00 66252.00 0.1%
MultiSourc...chmarks/Prolangs-C++/life/life 67276.00 67312.00 0.1%
MultiSourc...s/Prolangs-C/compiler/compiler 69824.00 69788.00 -0.1%
MultiSourc...Prolangs-C/assembler/assembler 86672.00 86696.00 0.0%
Reviewers: efriedma, vsk, paquette
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D61962
llvm-svn: 364398
Summary:
The getClobberingMemoryAccess API checks for clobbering accesses in a loop by walking the backedge. This may check if a memory access is being
clobbered by the loop in a previous iteration, depending how smart AA got over the course of the updates in MemorySSA (it does not occur when built from scratch).
If no clobbering access is found inside the loop, it will optimize to an access outside the loop. This however does not mean that access is safe to sink.
Given:
```
for i
load a[i]
store a[i]
```
The access corresponding to the load can be optimized to outside the loop, and the load can be hoisted. But it is incorrect to sink it.
In order to sink the load, we'd need to check no Def clobbers the Use in the same iteration. With this patch we currently restrict sinking to either
Defs not existing in the loop, or Defs preceding the load in the same block. An easy extension is to ensure the load (Use) post-dominates all Defs.
Caught by PR42294.
This issue also shed light on the converse problem: hoisting stores in this same scenario would be illegal. With this patch we restrict
hoisting of stores to the case when their corresponding Defs are dominating all Uses in the loop.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63582
llvm-svn: 363982
Summary:
This is unfortunately needed for correctness, if we are to extend the tolerance of the update API to the way simple loop unswitch is doing cloning.
In simple loop unswitch (as opposed to loop unswitch), not all blocks are cloned. This can create unreachable cloned blocks (no predecessor), which are later cleaned up.
In MemorySSA, the APIs for supporting these kind of updates (clone + update exit blocks), make certain assumption on the integrity of the CFG. When cloning, if something was not cloned, it's values in MemorySSA default to LiveOnEntry. When updating exit blocks, it is safe to assume that we can first insert phis in the blocks merging two clones, then add additional phis in the IDF of the blocks that received phis. This no longer holds true if one of the clones being merged comes from an unreachable block. We'd conservatively need to add all phis before filling in their incoming definitions. In practice this restriction can be relaxed if we clean up trivial phis after the first round of insertion.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63354
llvm-svn: 363880
Summary:
This patch teaches ConstantFolding to constant fold
both scalar and vector variants of llvm.smul.fix and
llvm.smul.fix.sat.
As described in the LangRef rounding is unspecified for
these instrinsics. If the result cannot be represented
exactly the default behavior in ConstantFolding is to
round down towards negative infinity. If a target has a
preferred rounding that is different some kind of target
hook would be needed (same strategy as used by the
SelectionDAG legalizer).
Reviewers: nikic, leonardchan, RKSimon
Reviewed By: leonardchan
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63385
llvm-svn: 363811
Summary:
LoopRotate doesn't create a faithful clone of an instruction, it may
simplify it beforehand. Hence the clone of an instruction that has a
MemoryDef associated may not be a definition, but a use or not a memory
alternig instruction.
Don't rely on the template when the clone may be simplified.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63355
llvm-svn: 363597
Summary:
Add all MemoryPhis in IDF before filling in their incomign values.
Otherwise, a new Phi can be added that needs to become the incoming
value of another Phi.
Test fails the verification in verifyPrevDefInPhis.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63353
llvm-svn: 363590
Based on D59959, this switches SCEV to use unsigned/signed range
intersection based on the sign hint. This will prefer non-wrapping
ranges in the relevant domain. I've left the one intersection in
getRangeForAffineAR() to use the smallest intersection heuristic,
as there doesn't seem to be any obvious preference there.
Differential Revision: https://reviews.llvm.org/D60035
llvm-svn: 363490
This patch uses the mechanism from D62995 to strengthen the
definitions of the reduction intrinsics by letting the scalar
result/accumulator type be overloaded from the vector element type.
For example:
; The LLVM LangRef specifies that the scalar result must equal the
; vector element type, but this is not checked/enforced by LLVM.
declare i32 @llvm.experimental.vector.reduce.or.i32.v4i32(<4 x i32> %a)
This patch changes that into:
declare i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> %a)
Which has the type-constraint more explicit and causes LLVM to check
the result type with the vector element type.
Reviewers: RKSimon, arsenm, rnk, greened, aemerson
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D62996
llvm-svn: 363240
This case is slightly tricky, because loop distribution should be
allowed in some cases, and not others. As long as runtime dependency
checks don't need to be introduced, this should be OK. This is further
complicated by the fact that LoopDistribute partially ignores if LAA
says that vectorization is safe, and then does its own runtime pointer
legality checks.
Note this pass still does not handle noduplicate correctly, as this
should always be forbidden with it. I'm not going to bother trying to
fix it, as it would require more effort and I think noduplicate should
be removed.
https://reviews.llvm.org/D62607
llvm-svn: 363160
Summary: After applying a set of insert updates, there may be trivial Phis left over. Clean them up.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63033
llvm-svn: 363094
Summary: Dependence Analysis performs static checks to confirm validity
of delinearization. These checks often fail for 64-bit targets due to
type conversions and integer wrapping that prevent simplification of the
SCEV expressions. These checks would also fail at compile-time if the
lower bound of the loops are compile-time unknown.
Author: bmahjour
Reviewer: Meinersbur, jdoerfert, kbarton, dmgreen, fhahn
Reviewed By: Meinersbur, jdoerfert, dmgreen
Subscribers: fhahn, hiraditya, javed.absar, llvm-commits, Whitney,
etiotto
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D62610
llvm-svn: 362952
Types such as float and i64's do not have legal loads in Thumb1, but will still
be loaded with a LDR (or potentially multiple LDR's). As such we can treat the
cost of addressing mode calculations the same as an i32 and get some optimisation
benefits.
Differential Revision: https://reviews.llvm.org/D62968
llvm-svn: 362874
Now with MVE being added, we can add the vector addressing mode costs for it.
These are generally imm7 multiplied by the size of the type being loaded /
stored.
Differential Revision: https://reviews.llvm.org/D62967
llvm-svn: 362873
The fp16 version of VLDR takes a imm8 multiplied by 2. This updates the costs
to account for those, and adds extra testing. It is dependant upon hasFPRegs16
as this is what the load/store instructions require.
Differential Revision: https://reviews.llvm.org/D62966
llvm-svn: 362872
For some reason multiple places need to do this, and the variant the
loop unroller and inliner use was not handling it.
Also, introduce a new wrapper to be slightly more precise, since on
AMDGPU some addrspacecasts are free, but not no-ops.
llvm-svn: 362436
Summary:
This reuses the getArithmeticInstrCost, but passes dummy values of the second
operand flags.
The X86 costs are wrong and can be improved in a follow up. I just wanted to
stop it from reporting an unknown cost first.
Reviewers: RKSimon, spatel, andrew.w.kaylor, cameron.mcinally
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62444
llvm-svn: 361788
Summary:
for.outer:
br for.inner
for.inner:
LI <loop invariant load instruction>
for.inner.latch:
br for.inner, for.outer.latch
for.outer.latch:
br for.outer, for.outer.exit
LI is a loop invariant load instruction that post dominate for.outer, so LI should be able to move out of the loop nest. However, there is a bug in allLoopPathsLeadToBlock().
Current algorithm of allLoopPathsLeadToBlock()
1. get all the transitive predecessors of the basic block LI belongs to (for.inner) ==> for.outer, for.inner.latch
2. if any successors of any of the predecessors are not for.inner or for.inner's predecessors, then return false
3. return true
Although for.inner.latch is for.inner's predecessor, but for.inner dominates for.inner.latch, which means if for.inner.latch is ever executed, for.inner should be as well. It should not return false for cases like this.
Author: Whitney (committed by xingxue)
Reviewers: kbarton, jdoerfert, Meinersbur, hfinkel, fhahn
Reviewed By: jdoerfert
Subscribers: hiraditya, jsji, llvm-commits, etiotto, bmahjour
Tags: #LLVM
Differential Revision: https://reviews.llvm.org/D62418
llvm-svn: 361762
getUserCost() currently returns TCC_Free for any extend of a compare (i1)
result. It seems this is only true in a limited number of cases where for
example two compares are chained. Even in those types of cases it seems
unlikely that they are generally free, while they may be in some cases.
This patch therefore removes this special handling of cast of i1. No tests
are failing because of this.
If some target want the old behavior, it could override getUserCost().
Review: Hal Finkel, Chandler Carruth, Evgeny Astigeevich, Simon Pilgrim,
Ulrich Weigand
https://reviews.llvm.org/D54742/new/
llvm-svn: 360970
LoopSimplify can preserve MemorySSA after r360270.
But the MemorySSA analysis is retrieved and preserved only when the
EnableMSSALoopDependency is set to true. Use the same conditional to
mark the pass as preserved, otherwise subsequent passes will get an
invalid analysis.
Resolves PR41853.
llvm-svn: 360697
The original costs stopped at SSE42, I've added conservative estimates for everything down to SSE1/SSE2 and moved some of the SSE42 costs to SSE41 (really only the addition of PCMPGT makes any difference).
I've also added missing vXi8 costs (we use PHMINPOSUW for i8/i16 for scarily quick results) and 256-bit vector costs for AVX1.
llvm-svn: 360528
Summary:
Currently we express umin as `~umax(~x, ~y)`. However, this becomes
a problem for operands in non-integral pointer spaces, because `~x`
is not something we can compute for `x` non-integral. However, since
comparisons are generally still allowed, we are actually able to
express `umin(x, y)` directly as long as we don't try to express is
as a umax. Support this by adding an explicit umin/smin representation
to SCEV. We do this by factoring the existing getUMax/getSMax functions
into a new function that does all four. The previous two functions were
largely identical.
Reviewed By: sanjoy
Differential Revision: https://reviews.llvm.org/D50167
llvm-svn: 360159
Summary:
Originally the insertDef method was only used when building MemorySSA, and was limiting the number of Phi nodes that it created.
Now it's used for updates as well, and it can create additional Phis needed for correctness.
Make sure no Phis are created in unreachable blocks (condition met during MSSA build), otherwise the renamePass will find a null DTNode.
Resolves PR41640.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61410
llvm-svn: 359845
Summary:
MemorySSA keeps internal pointers of AA and DT.
If these get invalidated, so should MemorySSA.
Reviewers: george.burgess.iv, chandlerc
Subscribers: jlebar, Prazek, llvm-commits
Tags: LLVM
Differential Revision: https://reviews.llvm.org/D61043
llvm-svn: 359627
Summary:
This is a redo of D60914.
The objective is to not invalidate AAManager, which is stateless, unless
there is an explicit invalidate in one of the AAResults.
To achieve this, this patch adds an API to PAC, to check precisely this:
is this analysis not invalidated explicitly == is this analysis not abandoned == is this analysis stateless, so preserved without explicitly being marked as preserved by everyone
Reviewers: chandlerc
Subscribers: mehdi_amini, jlebar, george.burgess.iv, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61284
llvm-svn: 359622
Summary:
MemorySSA keeps internal pointers of AA and DT.
If these get invalidated, so should MemorySSA.
Reviewers: george.burgess.iv, chandlerc
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61043
........
This was causing windows build bot failures
llvm-svn: 359555
This implements TargetTransformInfo method getMemcpyCost, which estimates the
number of instructions to which a memcpy instruction expands to.
Differential Revision: https://reviews.llvm.org/D59787
llvm-svn: 359547
Summary:
MemorySSA keeps internal pointers of AA and DT.
If these get invalidated, so should MemorySSA.
Reviewers: george.burgess.iv, chandlerc
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61043
llvm-svn: 359519
The PPC vector cost model values for insert/extract element reflect older
processors that lacked vector insert/extract and move-to/move-from VSR
instructions. Update getVectorInstrCost to give appropriate values for when
the newer instructions are present.
Differential Revision: https://reviews.llvm.org/D60160
llvm-svn: 359313
Summary:
If two arguments are both readonly, then they have no memory dependency
that would violate noalias, even if they do actually overlap.
Reviewers: hfinkel, efriedma
Reviewed By: efriedma
Subscribers: efriedma, hiraditya, llvm-commits, tstellar
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60239
llvm-svn: 359047
This does two main things, firstly adding some at least basic addressing modes
for i64 types, and secondly treats floats and doubles sensibly when there is no
fpu. The floating point change can help codesize in some cases, especially with
D60294.
Most backends seems to not consider the exact VT in isLegalAddressingMode,
instead switching on type size. That is now what this does when the target does
not have an fpu (as the float data will be loaded using LDR's). i64's currently
use the address range of an LDRD (even though they may be legalised and loaded
with an LDR). This is at least better than marking them all as illegal
addressing modes.
I have not attempted to do much with vectors yet. That will need changing once
MVE is added.
Differential Revision: https://reviews.llvm.org/D60677
llvm-svn: 358845
Summary:
The immediate post dominator of the loop header may be part of the divergent loop.
Since this /was/ the divergence propagation bound the SDA would not detect joins of divergent paths outside the loop.
Reviewers: nhaehnle
Reviewed By: nhaehnle
Subscribers: mmasten, arsenm, jvesely, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59042
llvm-svn: 358681
On pre-AVX512 targets we can use MOVMSK to extract reduced boolean results. This is properly optimized, annoyingly AVX512 isn't and produces code that is almost as bad as the (unchanged) costs suggest......
Differential Revision: https://reviews.llvm.org/D60403
llvm-svn: 358574
Summary:
When inserting a new Def, MemorySSA may be have non-minimal number of Phis.
While inserting, the walk to find the previous definition may cleanup minimal Phis.
When the last definition is trivial to obtain, we do not cache it.
It is possible while getting the previous definition for a Def to get two different answers:
- one that was straight-forward to find when walking the first path (a trivial phi in this case), and
- another that follows a cleanup of the trivial phi, it determines it may need additional Phi nodes, it inserts them and returns a new phi in the same position as the former trivial one.
While the Phis added for the second path are all redundant, they are not complete (the walk is only done upwards), and they are not properly cleaned up afterwards.
A way to fix this problem is to cache the straight-forward answer we got on the first walk.
The caching is only kept for the duration of a getPreviousDef call, and for Phis we use TrackingVH, so removing the trivial phi will lead to replacing it with the next dominating phi in the cache.
Resolves PR40749.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60634
llvm-svn: 358313
Summary:
After introducing the limit for clobber walking, `walkToPhiOrClobber` would assert that the limit is at least 1 on entry.
The test included triggered that assert.
The callsite in `tryOptimizePhi` making the calls to `walkToPhiOrClobber` is structured like this:
```
while (true) {
if (getBlockingAccess()) { // calls walkToPhiOrClobber
}
for (...) {
walkToPhiOrClobber();
}
}
```
The cleanest fix is to check if the limit was reached inside `walkToPhiOrClobber`, and give an allowence of 1.
This approach not make any alias() calls (no calls to instructionClobbersQuery), so the performance condition is enforced.
The limit is set back to 0 if not used, as this provides info on the fact that we stopped before reaching a true clobber.
Reviewers: george.burgess.iv
Subscribers: jlebar, Prazek, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60479
llvm-svn: 358303
Current LCG doesn't check aliased functions. So if an internal function has a public alias it will not be added to CG SCC, but it is still reachable from outside through the alias.
So this patch adds aliased functions to SCC.
Differential Revision: https://reviews.llvm.org/D59898
llvm-svn: 357795
Summary:
This lets us avoid e.g. checking if A >=s B in getSMaxExpr(A, B) if we've
already established that (A smax B) is the best we can do.
Fixes PR41225.
Reviewers: asbirlea
Subscribers: mcrosier, jlebar, bixia, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D60010
llvm-svn: 357320
Just as as llvm IR supports explicitly specifying numeric value ids
for instructions, and emits them by default in textual output, now do
the same for blocks.
This is a slightly incompatible change in the textual IR format.
Previously, llvm would parse numeric labels as string names. E.g.
define void @f() {
br label %"55"
55:
ret void
}
defined a label *named* "55", even without needing to be quoted, while
the reference required quoting. Now, if you intend a block label which
looks like a value number to be a name, you must quote it in the
definition too (e.g. `"55":`).
Previously, llvm would print nameless blocks only as a comment, and
would omit it if there was no predecessor. This could cause confusion
for readers of the IR, just as unnamed instructions did prior to the
addition of "%5 = " syntax, back in 2008 (PR2480).
Now, it will always print a label for an unnamed block, with the
exception of the entry block. (IMO it may be better to print it for
the entry-block as well. However, that requires updating many more
tests.)
Thus, the following is supported, and is the canonical printing:
define i32 @f(i32, i32) {
%3 = add i32 %0, %1
br label %4
4:
ret i32 %3
}
New test cases covering this behavior are added, and other tests
updated as required.
Differential Revision: https://reviews.llvm.org/D58548
llvm-svn: 356789
This adds new function getMemcpyCost to TTI so that the cost of a memcpy can be
modeled and queried. The default implementation returns Expensive, but targets
can override this function to model the cost more accurately.
Differential Revision: https://reviews.llvm.org/D59252
llvm-svn: 356555
There are a few different issues, mostly stemming from using
generation based checks for anything instead of subtarget
features. Stop adding flat-address-space as a feature for HSA, as it
should only be a device property. This was incorrectly allowing flat
instructions to select for SI.
Increase the default generation for HSA to avoid the encoding error
when emitting objects. This has some other side effects from various
checks which probably should be separate subtarget features (in the
cost model and for dealing with the DS offset folding issue).
Partial fix for bug 41070. It should probably be an error to try using
amdhsa without flat support.
llvm-svn: 356347
AMDGPU would like to have MVTs for v3i32, v3f32, v5i32, v5f32. This
commit does not add them, but makes preparatory changes:
* Fixed assumptions of power-of-2 vector type in kernel arg handling,
and added v5 kernel arg tests and v3/v5 shader arg tests.
* Added v5 tests for cost analysis.
* Added vec3/vec5 arg test cases.
Some of this patch is from Matt Arsenault, also of AMD.
Differential Revision: https://reviews.llvm.org/D58928
Change-Id: I7279d6b4841464d2080eb255ef3c589e268eabcd
llvm-svn: 356342
Summary:
This fixes an extremely long compile time caused by recursive analysis
of truncs, which were not previously subject to any depth limits unlike
some of the other ops. I decided to use the same control used for
sext/zext, since the routines analyzing these are sometimes mutually
recursive with the trunc analysis.
Reviewers: mkazantsev, sanjoy
Subscribers: sanjoy, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58994
llvm-svn: 355949
Summary:
This adds support for 64 bit buffer atomic arithmetic instructions but does not include
cmpswap as that depends on a fix to the way the register pairs are handled
Change-Id: Ib207ea65fb69487ccad5066ea647ae8ddfe2ce61
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, jfb, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58918
llvm-svn: 355520
In some cases, MaxBECount can be less precise than ExactBECount for AND
and OR (the AND case was PR26207). In the OR test case, both ExactBECounts are
undef, but MaxBECount are different, so we hit the assertion below. This
patch uses the same solution the AND case already uses.
Assertion failed:
((isa<SCEVCouldNotCompute>(ExactNotTaken) || !isa<SCEVCouldNotCompute>(MaxNotTaken))
&& "Exact is not allowed to be less precise than Max"), function ExitLimit
This patch also consolidates test cases for both AND and OR in a single
test case.
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13245
Reviewers: sanjoy, efriedma, mkazantsev
Reviewed By: sanjoy
Differential Revision: https://reviews.llvm.org/D58853
llvm-svn: 355259
Summary:
The original assumption for the insertDef method was that it would not
materialize Defs out of no-where, hence it will not insert phis needed
after inserting a Def.
However, when cloning an instruction (use case used in LICM), we do
materialize Defs "out of no-where". If the block receiving a Def has at
least one other Def, then no processing is needed. If the block just
received its first Def, we must check where Phi placement is needed.
The only new usage of insertDef is in LICM, hence the trigger for the bug.
But the original goal of the method also fails to apply for the move()
method. If we move a Def from the entry point of a diamond to either the
left or right blocks, then the merge block must add a phi.
While this usecase does not currently occur, or may be viewed as an
incorrect transformation, MSSA must behave corectly given the scenario.
Resolves PR40749 and PR40754.
Reviewers: george.burgess.iv
Subscribers: sanjoy, jlebar, Prazek, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58652
llvm-svn: 355040
Based on an IR equivalent of target lowering's generic expansion - target specific costs will typically be lower (IR doesn't have a good mull/mulh equivalent) but we need a baseline.
Differential Revision: https://reviews.llvm.org/D57925
llvm-svn: 354774
The correct edge being deleted is not to the unswitched exit block, but to the
original block before it was split. That's the key in the map, not the
value.
The insert is correct. The new edge is to the .split block.
The splitting turns OriginalBB into:
OriginalBB -> OriginalBB.split.
Assuming the orignal CFG edge: ParentBB->OriginalBB, we must now delete
ParentBB->OriginalBB, not ParentBB->OriginalBB.split.
llvm-svn: 354656
Constant hoisting may have hidden a constant behind a bitcast so that
it isn't folded into its users. However, this prevents BPI from
calculating some of its heuristics that are based upon constant
values. So, I've added a simple helper function to look through these
casts.
Differential Revision: https://reviews.llvm.org/D58166
llvm-svn: 354119
Summary:
This verification may fail after certain transformations due to
BasicAA's fragility. Added a small explanation and a testcase that
triggers the assert in checkClobberSanity (before its removal).
Addresses PR40509.
Reviewers: george.burgess.iv
Subscribers: sanjoy, jlebar, llvm-commits, Prazek
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D57973
llvm-svn: 353739
Currently, SCEV creates SCEVUnknown for every node of unreachable code. If we
have a huge amounts of such code, we will be littering SE with these nodes. We could
just state that they all are undef and save some memory.
Differential Revision: https://reviews.llvm.org/D57567
Reviewed By: sanjoy
llvm-svn: 353017
Summary:
The analysis result of DA caches pointers to AA, SCEV, and LI, but it
never checks for their invalidation. Fix that.
Reviewers: chandlerc, dmgreen, bogner
Reviewed By: dmgreen
Subscribers: hiraditya, bollu, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D56381
llvm-svn: 352986
Currently SCEV attempts to limit transformations so that they do not work with
big SCEVs (that may take almost infinite compile time). But for this, it uses heuristics
such as recursion depth and number of operands, which do not give us a guarantee
that we don't actually have big SCEVs. This situation is still possible, though it is not
likely to happen. However, the bug PR33494 showed a bunch of simple corner case
tests where we still produce huge SCEVs, even not reaching big recursion depth etc.
This patch introduces a concept of 'huge' SCEVs. A SCEV is huge if its expression
size (intoduced in D35989) exceeds some threshold value. We prohibit optimizing
transformations if any of SCEVs we are dealing with is huge. This gives us a reliable
check that we don't spend too much time working with them.
As the next step, we can possibly get rid of old limiting mechanisms, such as recursion
depth thresholds.
Differential Revision: https://reviews.llvm.org/D35990
Reviewed By: reames
llvm-svn: 352728
The code of AddRec simplification is using wrong loop when it creates a new
AddRecExpr. It should be using AddRecLoop which we have saved and against which
all gate checks are made, and not calling AddRec->getLoop() over and over
again because AddRec may change and become an AddRecurrency from outer loop
during the transform iterations.
Considering this change trivial, commiting for postcommit review.
llvm-svn: 352451
Followup to D56636, this time handling the UADDSAT case by expanding
uadd.sat(a, b) to umin(a, ~b) + b.
Differential Revision: https://reviews.llvm.org/D56869
llvm-svn: 352409
Add generic costs calculation for SADDSAT/SSUBSAT intrinsics, this uses generic costs for sadd_with_overflow/ssub_with_overflow, an extra sign comparison + a selects based on the sign/overflow.
This completes PR40316
Differential Revision: https://reviews.llvm.org/D57239
llvm-svn: 352315
For the power9 CPU, vector operations consume a pair of execution units rather
than one execution unit like a scalar operation. Update the target transform
cost functions to reflect the higher cost of vector operations when targeting
Power9.
Patch by RolandF.
Differential revision: https://reviews.llvm.org/D55461
llvm-svn: 352261
Add generic costs calculation for UADDSAT/USUBSAT intrinsics, this fallbacks to using generic costs for uadd_with_overflow/usub_with_overflow + a select.
Differential Revision: https://reviews.llvm.org/D56907
llvm-svn: 352044
This patch replaces the existing LLVMVectorSameWidth matcher with LLVMScalarOrSameVectorWidth.
The matching args must be either scalars or vectors with the same number of elements, but in either case the scalar/element type can differ, specified by LLVMScalarOrSameVectorWidth.
I've updated the _overflow intrinsics to demonstrate this - allowing it to return a i1 or <N x i1> overflow result, matching the scalar/vectorwidth of the other (add/sub/mul) result type.
The masked load/store/gather/scatter intrinsics have also been updated to use this, although as we specify the reference type to be llvm_anyvector_ty we guarantee the mask will be <N x i1> so no change in behaviour
Differential Revision: https://reviews.llvm.org/D57090
llvm-svn: 351957
First step towards PR40376, this patch adds support for getCmpSelInstrCost to use the (optional) Instruction CmpInst predicate to indicate the type of integer comparison we're performing and alter the costs accordingly.
Differential Revision: https://reviews.llvm.org/D57013
llvm-svn: 351810
Prior to SSE41 (and sometimes on AVX1), vector select has to be performed as a ((X & C)|(Y & ~C)) bit select.
Exposes a couple of issues with the min/max reduction costs (which only go down to SSE42 for some reason).
The increase pre-SSE41 selection costs also prevent a couple of tests from firing any longer, so I've either tweaked the target or added AVX tests as well to the existing SSE2 tests.
llvm-svn: 351685
This commit adds some missing intrinsics into the isAlwaysUniform list
for the AMDGPU backend.
Differential Revision: https://reviews.llvm.org/D56845
llvm-svn: 351562
compiler identification lines in test-cases.
(Doing so only because it's then easier to search for references which
are actually important and need fixing.)
llvm-svn: 351200
This fixes https://bugs.llvm.org/show_bug.cgi?id=40110.
This implements handling of undef operands for integer intrinsics in
ConstantFolding, in particular for the bitcounting intrinsics (ctpop,
cttz, ctlz), the with.overflow intrinsics, the saturating math
intrinsics and the funnel shift intrinsics.
The undef behavior follows what InstSimplify does for the general cas
e of non-constant operands. For the bitcount intrinsics (where
InstSimplify doesn't do undef handling -- there cannot be a combination
of an undef + non-constant operand) I'm using a 0 result if the intrinsic
is defined for zero and undef otherwise.
Differential Revision: https://reviews.llvm.org/D55950
llvm-svn: 350971
The new-pm version of DA is untested. Testing requires a printer, so
add that and use it in the existing DA tests.
Differential Revision: https://reviews.llvm.org/D56386
llvm-svn: 350624
Noticed in D56011 - handle the case that scalar fp ops are quicker on P3 than P4
Add the other costs so that we're not relying on the default "is legal/custom" cost logic.
llvm-svn: 350403
GetPointerBaseWithConstantOffset include this code, where ByteOffset
and GEPOffset are both of type llvm::APInt :
ByteOffset += GEPOffset.getSExtValue();
The problem with this line is that getSExtValue() returns an int64_t, but
the += matches an overload for uint64_t. The problem is that the resulting
APInt is no longer considered to be signed. That in turn causes assertion
failures later on if the relevant pointer type is > 64 bits in width and
the GEPOffset was negative.
Changing it to
ByteOffset += GEPOffset.sextOrTrunc(ByteOffset.getBitWidth());
resolves the issue and explicitly performs the sign-extending
or truncation. Additionally, instead of asserting later if the result
is > 64 bits, it breaks out of the loop in that case.
See also
https://reviews.llvm.org/D24729https://reviews.llvm.org/D24772
This commit must be merged after D38662 in order for the test to pass.
Patch by Michael Ferguson <mpfergu@gmail.com>.
Reviewers: reames, sanjoy, hfinkel
Reviewed By: hfinkel
Differential Revision: https://reviews.llvm.org/D38501
llvm-svn: 350395
Motivated by the discussion in D38499, this patch updates BasicAA to support
arbitrary pointer sizes by switching most remaining non-APInt calculations to
use APInt. The size of these APInts is set to the maximum pointer size (maximum
over all address spaces described by the data layout string).
Most of this translation is straightforward, but this patch contains a fix for
a bug that revealed itself during this translation process. In order for
test/Analysis/BasicAA/gep-and-alias.ll to pass, which is run with 32-bit
pointers, the intermediate calculations must be performed using 64-bit
integers. This is because, as noted in the patch, when GetLinearExpression
decomposes an expression into C1*V+C2, and we then multiply this by Scale, and
distribute, to get (C1*Scale)*V + C2*Scale, it can be the case that, even
through C1*V+C2 does not overflow for relevant values of V, (C2*Scale) can
overflow. If this happens, later logic will draw invalid conclusions from the
(base) offset value. Thus, when initially applying the APInt conversion,
because the maximum pointer size in this test is 32 bits, it started failing.
Suspicious, I created a 64-bit version of this test (included here), and that
failed (miscompiled) on trunk for a similar reason (the multiplication can
overflow).
After fixing this overflow bug, the first test case (at least) in
Analysis/BasicAA/q.bad.ll started failing. This is also a 32-bit test, and was
relying on having 64-bit intermediate values to have BasicAA return an accurate
result. In order to fix this problem, and because I believe that it is not
uncommon to use i64 indexing expressions in 32-bit code (especially portable
code using int64_t), it seems reasonable to always use at least 64-bit
integers. In this way, we won't regress our analysis capabilities (and there's
a command-line option added, so experimenting with this should be easy).
As pointed out by Eli during the review, there are other potential overflow
conditions that this patch does not address. Fixing those is left to follow-up
work.
Patch by me with contributions from Michael Ferguson (mferguson@cray.com).
Differential Revision: https://reviews.llvm.org/D38662
llvm-svn: 350220
The current llvm.mem.parallel_loop_access metadata has a problem in that
it uses LoopIDs. LoopID unfortunately is not loop identifier. It is
neither unique (there's even a regression test assigning the some LoopID
to multiple loops; can otherwise happen if passes such as LoopVersioning
make copies of entire loops) nor persistent (every time a property is
removed/added from a LoopID's MDNode, it will also receive a new LoopID;
this happens e.g. when calling Loop::setLoopAlreadyUnrolled()).
Since most loop transformation passes change the loop attributes (even
if it just to mark that a loop should not be processed again as
llvm.loop.isvectorized does, for the versioned and unversioned loop),
the parallel access information is lost for any subsequent pass.
This patch unlinks LoopIDs and parallel accesses.
llvm.mem.parallel_loop_access metadata on instruction is replaced by
llvm.access.group metadata. llvm.access.group points to a distinct
MDNode with no operands (avoiding the problem to ever need to add/remove
operands), called "access group". Alternatively, it can point to a list
of access groups. The LoopID then has an attribute
llvm.loop.parallel_accesses with all the access groups that are parallel
(no dependencies carries by this loop).
This intentionally avoid any kind of "ID". Loops that are clones/have
their attributes modifies retain the llvm.loop.parallel_accesses
attribute. Access instructions that a cloned point to the same access
group. It is not necessary for each access to have it's own "ID" MDNode,
but those memory access instructions with the same behavior can be
grouped together.
The behavior of llvm.mem.parallel_loop_access is not changed by this
patch, but should be considered deprecated.
Differential Revision: https://reviews.llvm.org/D52116
llvm-svn: 349725
This is split from D55452 with the correct patch this time.
Pairwise reductions require two shuffles on every level but the last. On the last level the two shuffles are <1, u, u, u...> and <0, u, u, u...>, but <0, u, u, u...> will be dropped by InstCombine/DAGCombine as being an identity shuffle.
Differential Revision: https://reviews.llvm.org/D55615
llvm-svn: 349072
Attribute 'dso_local' generated in bitcode from compiling
original C file but isn't needed.
Differential Revision: https://reviews.llvm.org/D55521
llvm-svn: 348835
Summary: The comment says we need 3 extracts and a select at the end. But didn't we just account for the select in the vector cost above. Aren't we just extracting the single element after taking the min/max in the vector register?
Reviewers: RKSimon, spatel, ABataev
Reviewed By: RKSimon
Subscribers: javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D55480
llvm-svn: 348739
We were overcounting the number of arithmetic operations needed at each level before we reach a legal type. We were using the full vector type for that level, but we are going to split the input vector at that level in half. So the effective arithmetic operation cost at that level is half the width.
So for example on 8i32 on an sse target. Were were calculating the cost of an 8i32 op which is likely 2 for basic integer. Then after the loop we count 2 more v4i32 ops. For a total arith cost of 4. But if you look at the assembly there would only be 3 arithmetic ops.
There are still more bugs in this code that I'm going to work on next. The non pairwise code shouldn't count extract subvectors in the loop. There are no extracts, the types are split in registers. For pairwise we need to use 2 two src permute shuffles.
Differential Revision: https://reviews.llvm.org/D55397
llvm-svn: 348621
DemandedBits and BDCE currently only support scalar integers. This
patch extends them to also handle vector integer operations. In this
case bits are not tracked for individual vector elements, instead a
bit is demanded if it is demanded for any of the elements. This matches
the behavior of computeKnownBits in ValueTracking and
SimplifyDemandedBits in InstCombine.
Unlike the previous iteration of this patch, getDemandedBits() can now
again be called on arbirary (sized) instructions, even if they don't
have integer or vector of integer type. (For vector types the size of the
returned mask will now be the scalar size in bits though.)
The added LoopVectorize test case shows a case which triggered an
assertion failure with the previous attempt, because getDemandedBits()
was called on a pointer-typed instruction.
Differential Revision: https://reviews.llvm.org/D55297
llvm-svn: 348602
In some cases different alignments for function might be used to save
space e.g. thumb mode with -Oz will try to use 2 byte function
alignment. Similar patch that fixed this in other areas exists here
https://reviews.llvm.org/D46110
This was approved previously https://reviews.llvm.org/D55115 (r348215)
but when committed it caused failures on the sanitizer buildbots when
building llvm with clang (containing this patch). This is now fixed
because I've added a check to see if getting the parent module returns
null if it does then set the alignment to 0.
Differential Revision: https://reviews.llvm.org/D55115
llvm-svn: 348571
DemandedBits and BDCE currently only support scalar integers. This
patch extends them to also handle vector integer operations. In this
case bits are not tracked for individual vector elements, instead a
bit is demanded if it is demanded for any of the elements. This matches
the behavior of computeKnownBits in ValueTracking and
SimplifyDemandedBits in InstCombine.
The getDemandedBits() method can now only be called on instructions that
have integer or vector of integer type. Previously it could be called on
any sized instruction (even if it was not particularly useful). The size
of the return value is now always the scalar size in bits (while
previously it was the type size in bits).
Differential Revision: https://reviews.llvm.org/D55297
llvm-svn: 348549
In some cases different alignments for function might be used to save
space e.g. thumb mode with -Oz will try to use 2 byte function
alignment. Similar patch that fixed this in other areas exists here
https://reviews.llvm.org/D46110
Differential Revision: https://reviews.llvm.org/D55115
llvm-svn: 348215
A loaded value with multiple users compared with 0 will become a load and
test single instruction. The load is not folded in this case (multiple
users), but the compare instruction is eliminated.
This patch returns 0 cost for the icmp in these cases.
Review: Ulrich Weigand
https://reviews.llvm.org/D55111
llvm-svn: 348141
Fix ScalarEvolution/solve-quadratic.ll test to account for __func__
output listing the complete function prototype rather than just its
name, as it does on NetBSD.
Example Linux output:
GetQuadraticEquation: addrec coeff bw: 4
GetQuadraticEquation: equation -2x^2 + -2x + -4, coeff bw: 5, multiplied by 2
Example NetBSD output:
llvm::Optional<std::tuple<llvm::APInt, llvm::APInt, llvm::APInt, llvm::APInt, unsigned int> > GetQuadraticEquation(const llvm::SCEVAddRecExpr*): addrec coeff bw: 4
llvm::Optional<std::tuple<llvm::APInt, llvm::APInt, llvm::APInt, llvm::APInt, unsigned int> > GetQuadraticEquation(const llvm::SCEVAddRecExpr*): equation -2x^2 + -2x + -4, coeff bw: 5, multiplied by 2
Differential Revision: https://reviews.llvm.org/D55162
llvm-svn: 348096
We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type.
For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed.
Fixes PR37731
Rebased and reapplied after being reverted in rL347541 due to PR39774 - which was fixed by D54955/rL347759 and D55017/rL347997
Differential Revision: https://reviews.llvm.org/D54585
llvm-svn: 348076
Summary:
This is patch #3 of the new DivergenceAnalysis
<https://lists.llvm.org/pipermail/llvm-dev/2018-May/123606.html>
The GPUDivergenceAnalysis is intended to eventually supersede the existing
LegacyDivergenceAnalysis. The existing LegacyDivergenceAnalysis produces
incorrect results on unstructured Control-Flow Graphs:
<https://bugs.llvm.org/show_bug.cgi?id=37185>
This patch adds the option -use-gpu-divergence-analysis to the
LegacyDivergenceAnalysis to turn it into a transparent wrapper for the
GPUDivergenceAnalysis.
Reviewers: nhaehnle
Reviewed By: nhaehnle
Subscribers: jholewinski, jvesely, jfb, llvm-commits, alex-t, sameerds, arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D53493
llvm-svn: 348048
Three minor changes to these extra costs:
* For ICmp instructions, instead of adding 2 all the time for extending each
operand, this is only done if that operand is neither a load or an
immediate.
* The operands extension costs for divides removed, because we now use a high
cost already for the divide (20).
* The costs for lhsr/ashr extra costs removed as this did not seem useful.
Review: Ulrich Weigand
https://reviews.llvm.org/D55053
llvm-svn: 347961
Unlike most cost model functions this code makes a lot of table lookups without using the results from getTypeLegalizationCost. This means 512-bit vectors can be looked up even when the type isn't legal.
This patch adds a check around the two tables that contain 512-bit types to make sure that neither of the types would be split by type legalization. Meaning 512 bit types are illegal. I wanted to write this in a somewhat generic way that uses type legalization query hooks. But if prefered, I can switch to just using is512BitVector and the subtarget feature.
Differential Revision: https://reviews.llvm.org/D54984
llvm-svn: 347786
This fixes some of scalarization costs reported for sext/zext using avx512bw. This does not fix all scalarization costs being reported. Just the worst.
I've restricted this only to combinations of types that are legal with avx512bw like v32i1/v64i1/v32i16/v64i8 and conversions between vXi1 and vXi8/vXi16 with legal vXi8/vXi16 result types.
Differential Revision: https://reviews.llvm.org/D54979
llvm-svn: 347785
Expansion of SIGN_EXTEND_INREG can create a VSRAI instruction. If there is already a VSRAI after it, we should combine them into a larger VSRAI
Differential Revision: https://reviews.llvm.org/D54959
llvm-svn: 347784
CGF/CLGF compares an i64 register with a sign/zero extended loaded i32 value
in memory.
This patch makes such a load considered foldable and so gets a 0 cost.
Review: Ulrich Weigand
https://reviews.llvm.org/D54944
llvm-svn: 347735
AH, SH and MH costs are already covered in the cases where LHS is 32 bits and
RHS is 16 bits of memory sign-extended to i32.
As these instructions are also used when LHS is i16, this patch recognizes
that the loads will get folded then as well.
Review: Ulrich Weigand
https://reviews.llvm.org/D54940
llvm-svn: 347734
Single instructions exist for i8 and i16 comparisons of memory against a
small immediate.
This patch makes sure that if the load in these cases has a single user (the
ICmp), it gets a 0 cost (folded), and also that the ICmp gets a cost of 1.
Review: Ulrich Weigand
https://reviews.llvm.org/D54897
llvm-svn: 347733
Since byte-swapping loads and stores are supported, a 'load -> bswap' or
'bswap -> store' sequence should have the cost of one.
Review: Ulrich Weigand
https://reviews.llvm.org/D54870
llvm-svn: 347732
The check lines marked AVX256 in the zext256/sext256 functions should be closer to the AVX values which would take into account a splitting cost.
llvm-svn: 347722
Our sext/zext cost modeling was somewhat incomplete. And had no coverage for the fact that avx512bw v32i16/v64i8 types return a scalarization cost.
Truncates are a whole different mess because isTruncateFree is returning true for vectors when it shouldn't and that's the fall back for anything not in the tables.
llvm-svn: 347719
Summary:
IPA is implemented as module pass which produce map from Function or Alias to
StackSafetyInfo for a single function.
From prototype by Evgenii Stepanov and Vlad Tsyrklevich.
Reviewers: eugenis, vlad.tsyrklevich, pcc, glider
Subscribers: hiraditya, mgrang, llvm-commits
Differential Revision: https://reviews.llvm.org/D54543
llvm-svn: 347611
Summary:
Analysis produces StackSafetyInfo which contains information with how allocas
and parameters were used in functions.
From prototype by Evgenii Stepanov and Vlad Tsyrklevich.
Reviewers: eugenis, vlad.tsyrklevich, pcc, glider
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D54504
llvm-svn: 347603
Add support for funnel shifts to the DemandedBits analysis. The
demanded bits of the first two operands can be determined if the
shift amount is constant. The demanded bits of the third operand
(shift amount) can be determined if the bitwidth is a power of two.
This is basically the same functionality as implemented in D54869
and D54478, but for DemandedBits rather than InstCombine.
Differential Revision: https://reviews.llvm.org/D54876
llvm-svn: 347561
This reverts commit r346970.
It was causing PR39774, a crash in slp-vectorizer on a rather simple loop
with just a bunch of 'and's in the body.
llvm-svn: 347541
Implement getIntrinsicInstrCost() and return costs reflecting that bswap can
be done with a vperm per vector register.
Review: Ulrich Weigand
https://reviews.llvm.org/D54789
llvm-svn: 347445
LVI was symbolically executing binary operators only when the RHS was
constant, missing the case where we have a ConstantRange for the RHS,
but not an actual constant. Tested using check-all and by
bootstrapping. Compile time is not impacted measurably.
Differential Revision: https://reviews.llvm.org/D19859
llvm-svn: 347379
Support saturating add/sub in constant folding, based on the APInt methods introduced in D54332.
Patch by: @nikic (Nikita Popov)
Differential Revision: https://reviews.llvm.org/D54531
llvm-svn: 347328
Summary:
Currently, when vectorizing stores to uniform addresses, the only
instance we prevent vectorization is if there are multiple stores to the
same uniform address causing an unsafe dependency.
This patch teaches LAA to avoid vectorizing loops that have an unsafe
cross-iteration dependency between a load and a store to the same uniform address.
Fixes PR39653.
Reviewers: Ayal, efriedma
Subscribers: rkruppe, llvm-commits
Differential Revision: https://reviews.llvm.org/D54538
llvm-svn: 347220
We were adding the entire scalarization extraction cost for reductions, which returns the total cost of extracting every element of a vector type.
For reductions we don't need to do this - we just need to extract the 0'th element after the reduction pattern has completed.
Fixes PR37731
Differential Revision: https://reviews.llvm.org/D54585
llvm-svn: 346970
Add support for the expansion of funnelshift/rotates to getIntrinsicInstrCost.
This also required us to move the X86 fshl/fshr costs to the same place as the rotates to avoid expansion and get correct scalarization vs vectorization costs.
llvm-svn: 346854
When we repeat the 2 shifting operands then this is a bit rotation - annoyingly this has to be done in the other getIntrinsicInstrCost than most intrinsics as we need to check the operands are the same.
llvm-svn: 346688
Improve getCastInstrCost() by respecting the different types of Src and Dst
for vector integer <-> fp conversions.
This means that extracting from integer becomes more expensive (by the
extraction penalty), and the extraction from fp becomes cheaper (no longer
has a false extraction penalty).
Review: Ulrich Weigand
https://reviews.llvm.org/D54423
llvm-svn: 346663
Instead of defaulting to a cost = 1, expand to element extract/insert like we do for other shuffles.
This exposes an issue in LoopVectorize which could call SK_ExtractSubvector with a scalar subvector type.
llvm-svn: 346656
The patch has been reverted because it ended up prohibiting propagation
of a constant to exit value. For such values, we should skip all checks
related to hard uses because propagating a constant is always profitable.
Differential Revision: https://reviews.llvm.org/D53691
llvm-svn: 346397
This reverts commit 2f425e9c7946b9d74e64ebbfa33c1caa36914402.
It seems that the check that we still should do the transform if we
know the result is constant is missing in this code. So the logic that
has been deleted by this change is still sometimes accidentally useful.
I revert the change to see what can be done about it. The motivating
case is the following:
@Y = global [400 x i16] zeroinitializer, align 1
define i16 @foo() {
entry:
br label %for.body
for.body: ; preds = %entry, %for.body
%i = phi i16 [ 0, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds [400 x i16], [400 x i16]* @Y, i16 0, i16 %i
store i16 0, i16* %arrayidx, align 1
%inc = add nuw nsw i16 %i, 1
%cmp = icmp ult i16 %inc, 400
br i1 %cmp, label %for.body, label %for.end
for.end: ; preds = %for.body
%inc.lcssa = phi i16 [ %inc, %for.body ]
ret i16 %inc.lcssa
}
We should be able to figure out that the result is constant, but the patch
breaks it.
Differential Revision: https://reviews.llvm.org/D51584
llvm-svn: 346198
Let i8/i16 uint/sint to fp conversions cost 1 if operand is a load.
Since the load already does the extension, there is no extra cost (previously
returned 2).
Review: Ulrich Weigand
https://reviews.llvm.org/D54028
llvm-svn: 346009
Summary:
The hot and cold count thresholds are derived from the summary, but for
debugging purposes it is convenient to provide the actual thresholds.
Reviewers: davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54040
llvm-svn: 346005
Scalar i1 to fp conversions are done with a branch sequence, so it should
have a higher cost.
Review: Ulrich Weigand
https://reviews.llvm.org/D53924
llvm-svn: 345818
This factors out a new method getBoolVecToIntConversionCost() containing the
code for vector sext/zext of i1, in order to reuse it for i1 to double vector
conversions.
Review: Ulrich Weigand
https://reviews.llvm.org/D53923
llvm-svn: 345817
When rewriting loop exit values, IndVars considers this transform not profitable if
the loop instruction has a loop user which it believes cannot be optimized away.
In current implementation only calls that immediately use the instruction are considered
as such.
This patch extends the definition of "hard" users to any side-effecting instructions
(which usually cannot be optimized away from the loop) and also allows handling
of not just immediate users, but use chains.
Differentlai Revision: https://reviews.llvm.org/D51584
Reviewed By: etherzhhb
llvm-svn: 345814
When we calculate a product of 2 AddRecs, we end up making quite massive
computations to deduce the operands of resulting AddRec. This process can
be optimized by computing all args of intermediate sum and then calling
`getAddExpr` once rather than calling `getAddExpr` with intermediate
result every time a new argument is computed.
Differential Revision: https://reviews.llvm.org/D53189
Reviewed By: rtereshin
llvm-svn: 345813
Correct costings of SK_ExtractSubvector requires the SubTy argument to indicate the type/size of the extracted subvector.
Unlike the rest of the shuffle kinds this means that the main Ty argument represents the source vector type not the destination!
I've done my best to fix a number of vectorizer uses:
SLP - the reduction epilogue costs should be using a SK_PermuteSingleSrc shuffle as these all occur at the hardware vector width - we're not extracting (illegal) subvector types. This is causing the cost model diffs as SK_ExtractSubvector costs are poorly handled and tend to just return 1 at the moment.
LV - I'm not clear on what the SK_ExtractSubvector should represents for recurrences - I've used a <1 x ?> subvector extraction as that seems to match the VF delta.
Differential Revision: https://reviews.llvm.org/D53573
llvm-svn: 345617
Sub, SDiv and UDiv are not commutative, so only the RHS operand can fold a
load. This patch adds a check for this.
Review: Ulrich Weigand
https://reviews.llvm.org/D53791
llvm-svn: 345596
The SystemZ backend can do arithmetic of memory by loading and then extending
one of the operands. Similarly, a load + truncate can be folded into an
operand.
This patch improves the SystemZ TTI cost function to recognize this.
Review: Ulrich Weigand
https://reviews.llvm.org/D52692
llvm-svn: 345327
Enable the DAG optimization that converts vector div/rem with constants into
multiply+shifts sequences by expanding them early. This is needed since
ISD::SMUL_LOHI is 'Custom' lowered on SystemZ, and will therefore not be
available to BuildSDIV after legalization.
Better cost values for these instructions based on how they will be
implemented (a constant divisor is cheaper).
Review: Ulrich Weigand
https://reviews.llvm.org/D53196
llvm-svn: 345321