The only difference between these two is that VectorizerReport adds a
vectorizer-specific prefix to its messages. When LAA is used in the
vectorizer context the prefix is added when we promote the
LoopAccessReport into a VectorizerReport via one of the constructors.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229897
When I split out LoopAccessReport from this, I need to create some temps
so constness becomes necessary.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229896
Also add pass name as an argument to VectorizationReport::emitAnalysis.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229894
This is a function pass that runs the analysis on demand. The analysis
can be initiated by querying the loop access info via LAA::getInfo. It
either returns the cached info or runs the analysis.
Symbolic stride information continues to reside outside of this analysis
pass. We may move it inside later but it's not a priority for me right
now. The idea is that Loop Distribution won't support run-time stride
checking at least initially.
This means that when querying the analysis, symbolic stride information
can be provided optionally. Whether stride information is used can
invalidate the cache entry and rerun the analysis. Note that if the
loop does not have any symbolic stride, the entry should be preserved
across Loop Distribution and LV.
Since currently the only user of the pass is LV, I just check that the
symbolic stride information didn't change when using a cached result.
On the LV side, LoopVectorizationLegality requests the info object
corresponding to the loop from the analysis pass. A large chunk of the
diff is due to LAI becoming a pointer from a reference.
A test will be added as part of the -analyze patch.
Also tested that with AVX, we generate identical assembly output for the
testsuite (including the external testsuite) before and after.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229893
LAA will be an on-demand analysis pass, so we need to cache the result
of the analysis. canVectorizeMemory is renamed to analyzeLoop which
computes the result. canVectorizeMemory becomes the query function for
the cached result.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229892
The transformation passes will query this and then emit them as part of
their own report. The currently only user LV is modified to do just
that.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229891
As LAA is becoming a pass, we can no longer pass the params to its
constructor. This changes the command line flags to have external
storage. These can now be accessed both from LV and LAA.
VectorizerParams is moved out of LoopAccessInfo in order to shorten the
code to access it.
This commits also has the fix (D7731) to the break dependence cycle
between the analysis and vector libraries.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229890
This reverts commit r229651.
I'd like to ultimately revert r229650 but this reformat stands in the
way. I'll reformat the affected files once the the loop-access pass is
fully committed.
llvm-svn: 229889
r229622: "[LoopAccesses] Make VectorizerParams global"
r229623: "[LoopAccesses] Stash the report from the analysis rather than emitting it"
r229624: "[LoopAccesses] Cache the result of canVectorizeMemory"
r229626: "[LoopAccesses] Create the analysis pass"
r229628: "[LoopAccesses] Change debug messages from LV to LAA"
r229630: "[LoopAccesses] Add canAnalyzeLoop"
r229631: "[LoopAccesses] Add missing const to APIs in VectorizationReport"
r229632: "[LoopAccesses] Split out LoopAccessReport from VectorizerReport"
r229633: "[LoopAccesses] Add -analyze support"
r229634: "[LoopAccesses] Change LAA:getInfo to return a constant reference"
r229638: "Analysis: fix buildbots"
llvm-svn: 229650
The only difference between these two is that VectorizerReport adds a
vectorizer-specific prefix to its messages. When LAA is used in the
vectorizer context the prefix is added when we promote the
LoopAccessReport into a VectorizerReport via one of the constructors.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229632
When I split out LoopAccessReport from this, I need to create some temps
so constness becomes necessary.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229631
Also add pass name as an argument to VectorizationReport::emitAnalysis.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229628
This is a function pass that runs the analysis on demand. The analysis
can be initiated by querying the loop access info via LAA::getInfo. It
either returns the cached info or runs the analysis.
Symbolic stride information continues to reside outside of this analysis
pass. We may move it inside later but it's not a priority for me right
now. The idea is that Loop Distribution won't support run-time stride
checking at least initially.
This means that when querying the analysis, symbolic stride information
can be provided optionally. Whether stride information is used can
invalidate the cache entry and rerun the analysis. Note that if the
loop does not have any symbolic stride, the entry should be preserved
across Loop Distribution and LV.
Since currently the only user of the pass is LV, I just check that the
symbolic stride information didn't change when using a cached result.
On the LV side, LoopVectorizationLegality requests the info object
corresponding to the loop from the analysis pass. A large chunk of the
diff is due to LAI becoming a pointer from a reference.
A test will be added as part of the -analyze patch.
Also tested that with AVX, we generate identical assembly output for the
testsuite (including the external testsuite) before and after.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229626
blockNeedsPredication is in LoopAccess in order to share it with the
vectorizer. It's a utility needed by LoopAccess not strictly provided
by it but it's a good place to share it. This makes the function static
so that it no longer required to create an LoopAccessInfo instance in
order to access it from LV.
This was actually causing problems because it would have required
creating LAI much earlier that LV::canVectorizeMemory().
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229625
LAA will be an on-demand analysis pass, so we need to cache the result
of the analysis. canVectorizeMemory is renamed to analyzeLoop which
computes the result. canVectorizeMemory becomes the query function for
the cached result.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229624
The transformation passes will query this and then emit them as part of
their own report. The currently only user LV is modified to do just
that.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229623
As LAA is becoming a pass, we can no longer pass the params to its
constructor. This changes the command line flags to have external
storage. These can now be accessed both from LV and LAA.
VectorizerParams is moved out of LoopAccessInfo in order to shorten the
code to access it.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229622
LoopAccessAnalysis will be used as the name of the pass.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229621
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
llvm-svn: 229202
LLVM's include tree and the use of using declarations to hide the
'legacy' namespace for the old pass manager.
This undoes the primary modules-hostile change I made to keep
out-of-tree targets building. I sent an email inquiring about whether
this would be reasonable to do at this phase and people seemed fine with
it, so making it a reality. This should allow us to start bootstrapping
with modules to a certain extent along with making it easier to mix and
match headers in general.
The updates to any code for users of LLVM are very mechanical. Switch
from including "llvm/PassManager.h" to "llvm/IR/LegacyPassManager.h".
Qualify the types which now produce compile errors with "legacy::". The
most common ones are "PassManager", "PassManagerBase", and
"FunctionPassManager".
llvm-svn: 229094
Apparently some code finally started to tickle this after my
canonicalization changes to instcombine.
The bug stems from trying to form a vector type out of scalars that
aren't compatible at all. In this example, from x86_mmx values. The code
in the vectorizer that checks for reasonable types whas checking for
aggregates or vectors, but there are lots of other types that should
just never reach the vectorizer.
Debugging this was made more confusing by the lie in an assert in
VectorType::get() -- it isn't that the types are *primitive*. The types
must be integer, pointer, or floating point types. No other types are
allowed.
I've improved the assert and added a helper to the vectorizer to handle
the element type validity checks. It now re-uses the VectorType static
function and then further excludes weird target-specific types that we
probably shouldn't be touching here (x86_fp80 and ppc_fp128). Neither of
these are really reachable anyways (neither 80-bit nor 128-bit things
will get vectorized) but it seems better to just eagerly exclude such
nonesense.
I've added a test case, but while it definitely covers two of the paths
through this code there may be more paths that would benefit from test
coverage. I'm not familiar enough with the SLP vectorizer to synthesize
test cases for all of these, but was able to update the code itself by
inspection.
llvm-svn: 228899
This allows IDEs to recognize the entire set of header files for
each of the core LLVM projects.
Differential Revision: http://reviews.llvm.org/D7526
Reviewed By: Chris Bieneman
llvm-svn: 228798
Summary:
The alias.scope metadata represents sets of things an instruction might
alias with. When generically combining the metadata from two
instructions the result must be the union of the original sets, because
the new instruction might alias with anything any of the original
instructions aliased with.
Reviewers: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D7490
llvm-svn: 228525
This will allow it to be shared with the new Loop Distribution pass.
getFirstInst is currently duplicated across LoopVectorize.cpp and
LoopAccessAnalysis.cpp. This is a short-term work-around until we figure out
a better solution.
NFC. (The code moved is adjusted a bit for the name of the Loop member and
that PtrRtCheck is now a reference rather than a pointer.)
llvm-svn: 228418
I've noticed this while trying to move addRuntimeCheck to LoopAccessAnalysis.
I think that the intention was to early exit from the overflow checking before
the code for the memchecks. This is the entire reason why we compute
FirstCheckInst but then we don't use that as the splitting instruction but the
final check. Looks like an oversight.
llvm-svn: 228056
The commit r225977 uncovered this bug. The problem was that the vectorizer tried to
read the second operand of an already deleted instruction.
The bug didn't show up before r225977 because the freed memory still contained a non-null pointer.
With r225977 deletion of instructions is delayed and the read operand pointer is always null.
llvm-svn: 227800
Other than moving code and adding the boilerplate for the new files, the code
being moved is unchanged.
There are a few global functions that are shared with the rest of the
LoopVectorizer. I moved these to the new module as well (emitLoopAnalysis,
stripIntegerCast, replaceSymbolicStrideSCEV) along with the Report class used
by emitLoopAnalysis. There is probably room for further improvement in this
area.
I kept DEBUG_TYPE "loop-vectorize" because it's used as the PassName with
emitOptimizationRemarkAnalysis. This will obviously have to change.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227756
This class needs to remain public because it's used by
LoopVectorizationLegality::addRuntimeCheck.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227755
Rather than using globals use a structure to pass parameters from the
vectorizer. This prepares the class to be moved outside the LoopVectorizer.
It's not great how all this is passed through in LoopAccessAnalysis but this
is all expected to change once the class start servicing the Loop Distribution
pass as well where some of these parameters make no sense.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227754
Move the canVectorizeMemory functionality from LoopVectorizationLegality to a
new class LoopAccessAnalysis and forward users.
Currently the collection of the symbolic stride information is kept with
LoopVectorizationLegality and it becomes an input to LoopAccessAnalysis.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227751
These members are moving to LoopAccessAnalysis. The accessors help to hide
this.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227750
This class will become public in the new LoopAccessAnalysis header so the name
needs to be more global.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227749
The logic in emitAnalysis is duplicated across multiple functions. This
splits it into a function. Another use will be added by the patchset.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227748
RuntimePointerCheck will be used through LoopAccessAnalysis in
LoopVectorizationLegality. Later in the patchset it will become a local class
of LoopAccessAnalysis.
NFC. This is part of the patchset that splits out the memory dependence logic
from LoopVectorizationLegality into a new class LoopAccessAnalysis.
LoopAccessAnalysis will be used by the new Loop Distribution pass.
llvm-svn: 227747
getTTI method used to get an actual TTI object.
No functionality changed. This just threads the argument and ensures
code like the inliner can correctly look up the callee's TTI rather than
using a fixed one.
The next change will use this to implement per-function subtarget usage
by TTI. The changes after that should eliminate the need for FTTI as that
will have become the default.
llvm-svn: 227730
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
Previously, only -1 and +1 step values are supported for induction variables. This patch extends LV to support
arbitrary constant steps.
Initial patch by Alexey Volkov. Some bug fixes are added in the following version.
Differential Revision: http://reviews.llvm.org/D6051 and http://reviews.llvm.org/D7193
llvm-svn: 227557
Even with the current limit on the number of alias checks, the containing loop has quadratic complexity.
This begins to hurt for blocks containing > 1K load/store instructions.
This commit introduces a limit for the loop count. It reduces the runtime for such very large blocks.
llvm-svn: 226792
This patch fixes 2 issues in reorderInputsAccordingToOpcode
1) AllSameOpcodeLeft and AllSameOpcodeRight was being calculated incorrectly resulting in code not being vectorized in few cases.
2) Adds logic to reorder operands if we get longer chain of consecutive loads enabling vectorization. Handled the same for cases were we have AltOpcode.
Thanks Michael for inputs and review.
Review: http://reviews.llvm.org/D6677
llvm-svn: 226547
In case of blocks with many memory-accessing instructions, alias checking can take lot of time
(because calculating the memory dependencies has quadratic complexity).
I chose a limit which resulted in no changes when running the benchmarks.
llvm-svn: 226439
cleaner to derive from the generic base.
Thise removes a ton of boiler plate code and somewhat strange and
pointless indirections. It also remove a bunch of the previously needed
friend declarations. To fully remove these, I also lifted the verify
logic into the generic LoopInfoBase, which seems good anyways -- it is
generic and useful logic even for the machine side.
llvm-svn: 226385
a LoopInfoWrapperPass to wire the object up to the legacy pass manager.
This switches all the clients of LoopInfo over and paves the way to port
LoopInfo to the new pass manager. No functionality change is intended
with this iteration.
llvm-svn: 226373
This patch was generated by a clang tidy checker that is being open sourced.
The documentation of that checker is the following:
/// The emptiness of a container should be checked using the empty method
/// instead of the size method. It is not guaranteed that size is a
/// constant-time function, and it is generally more efficient and also shows
/// clearer intent to use empty. Furthermore some containers may implement the
/// empty method but not implement the size method. Using empty whenever
/// possible makes it easier to switch to another container in the future.
Patch by Gábor Horváth!
llvm-svn: 226161
The pass is really just a means of accessing a cached instance of the
TargetLibraryInfo object, and this way we can re-use that object for the
new pass manager as its result.
Lots of delta, but nothing interesting happening here. This is the
common pattern that is developing to allow analyses to live in both the
old and new pass manager -- a wrapper pass in the old pass manager
emulates the separation intrinsic to the new pass manager between the
result and pass for analyses.
llvm-svn: 226157
This speeds up the dependency calculations for blocks with many load/store/call instructions.
Beside the improved runtime, there is no functional change.
Compared to the original commit, this re-applied commit contains a bug fix which ensures that there are
no incorrect collisions in the alias cache.
llvm-svn: 225977
The issue was introduced in r214638:
+ for (auto &BSIter : BlocksSchedules) {
+ scheduleBlock(BSIter.second.get());
+ }
Because BlocksSchedules is a DenseMap with BasicBlock* keys, blocks are
scheduled in non-deterministic order, resulting in unpredictable IR.
Patch by Daniel Reynaud!
llvm-svn: 225821
The alias cache has a problem of incorrect collisions in case a new instruction is allocated at the same address as a previously deleted instruction.
llvm-svn: 225790
This speeds up the dependency calculations for blocks with many load/store/call instructions.
Beside the improved runtime, there is no functional change.
llvm-svn: 225786
{code}
// loop body
... = a[i] (1)
... = a[i+1] (2)
.......
a[i+1] = .... (3)
a[i] = ... (4)
{code}
The algorithm tries to collect memory access candidates from AliasSetTracker, and then check memory dependences one another. The memory accesses are unique in AliasSetTracker, and a single memory access in AliasSetTracker may map to multiple entries in AccessAnalysis, which could cover both 'read' and 'write'. Originally the algorithm only checked 'write' entry in Accesses if only 'write' exists. This is incorrect and the consequence is it ignored all read access, and finally some RAW and WAR dependence are missed.
For the case given above, if we ignore two reads, the dependence between (1) and (3) would not be able to be captured, and finally this loop will be incorrectly vectorized.
The fix simply inserts a new loop to find all entries in Accesses. Since it will skip most of all other memory accesses by checking the Value pointer at the very beginning of the loop, it should not increase compile-time visibly.
llvm-svn: 225159
a cache of assumptions for a single function, and an immutable pass that
manages those caches.
The motivation for this change is two fold. Immutable analyses are
really hacks around the current pass manager design and don't exist in
the new design. This is usually OK, but it requires that the core logic
of an immutable pass be reasonably partitioned off from the pass logic.
This change does precisely that. As a consequence it also paves the way
for the *many* utility functions that deal in the assumptions to live in
both pass manager worlds by creating an separate non-pass object with
its own independent API that they all rely on. Now, the only bits of the
system that deal with the actual pass mechanics are those that actually
need to deal with the pass mechanics.
Once this separation is made, several simplifications become pretty
obvious in the assumption cache itself. Rather than using a set and
callback value handles, it can just be a vector of weak value handles.
The callers can easily skip the handles that are null, and eventually we
can wrap all of this up behind a filter iterator.
For now, this adds boiler plate to the various passes, but this kind of
boiler plate will end up making it possible to port these passes to the
new pass manager, and so it will end up factored away pretty reasonably.
llvm-svn: 225131
The loop vectorizer optimizes loops containing conditional memory
accesses by generating masked load and store intrinsics.
This decision is target dependent.
http://reviews.llvm.org/D6527
llvm-svn: 224334
Split `Metadata` away from the `Value` class hierarchy, as part of
PR21532. Assembly and bitcode changes are in the wings, but this is the
bulk of the change for the IR C++ API.
I have a follow-up patch prepared for `clang`. If this breaks other
sub-projects, I apologize in advance :(. Help me compile it on Darwin
I'll try to fix it. FWIW, the errors should be easy to fix, so it may
be simpler to just fix it yourself.
This breaks the build for all metadata-related code that's out-of-tree.
Rest assured the transition is mechanical and the compiler should catch
almost all of the problems.
Here's a quick guide for updating your code:
- `Metadata` is the root of a class hierarchy with three main classes:
`MDNode`, `MDString`, and `ValueAsMetadata`. It is distinct from
the `Value` class hierarchy. It is typeless -- i.e., instances do
*not* have a `Type`.
- `MDNode`'s operands are all `Metadata *` (instead of `Value *`).
- `TrackingVH<MDNode>` and `WeakVH` referring to metadata can be
replaced with `TrackingMDNodeRef` and `TrackingMDRef`, respectively.
If you're referring solely to resolved `MDNode`s -- post graph
construction -- just use `MDNode*`.
- `MDNode` (and the rest of `Metadata`) have only limited support for
`replaceAllUsesWith()`.
As long as an `MDNode` is pointing at a forward declaration -- the
result of `MDNode::getTemporary()` -- it maintains a side map of its
uses and can RAUW itself. Once the forward declarations are fully
resolved RAUW support is dropped on the ground. This means that
uniquing collisions on changing operands cause nodes to become
"distinct". (This already happened fairly commonly, whenever an
operand went to null.)
If you're constructing complex (non self-reference) `MDNode` cycles,
you need to call `MDNode::resolveCycles()` on each node (or on a
top-level node that somehow references all of the nodes). Also,
don't do that. Metadata cycles (and the RAUW machinery needed to
construct them) are expensive.
- An `MDNode` can only refer to a `Constant` through a bridge called
`ConstantAsMetadata` (one of the subclasses of `ValueAsMetadata`).
As a side effect, accessing an operand of an `MDNode` that is known
to be, e.g., `ConstantInt`, takes three steps: first, cast from
`Metadata` to `ConstantAsMetadata`; second, extract the `Constant`;
third, cast down to `ConstantInt`.
The eventual goal is to introduce `MDInt`/`MDFloat`/etc. and have
metadata schema owners transition away from using `Constant`s when
the type isn't important (and they don't care about referring to
`GlobalValue`s).
In the meantime, I've added transitional API to the `mdconst`
namespace that matches semantics with the old code, in order to
avoid adding the error-prone three-step equivalent to every call
site. If your old code was:
MDNode *N = foo();
bar(isa <ConstantInt>(N->getOperand(0)));
baz(cast <ConstantInt>(N->getOperand(1)));
bak(cast_or_null <ConstantInt>(N->getOperand(2)));
bat(dyn_cast <ConstantInt>(N->getOperand(3)));
bay(dyn_cast_or_null<ConstantInt>(N->getOperand(4)));
you can trivially match its semantics with:
MDNode *N = foo();
bar(mdconst::hasa <ConstantInt>(N->getOperand(0)));
baz(mdconst::extract <ConstantInt>(N->getOperand(1)));
bak(mdconst::extract_or_null <ConstantInt>(N->getOperand(2)));
bat(mdconst::dyn_extract <ConstantInt>(N->getOperand(3)));
bay(mdconst::dyn_extract_or_null<ConstantInt>(N->getOperand(4)));
and when you transition your metadata schema to `MDInt`:
MDNode *N = foo();
bar(isa <MDInt>(N->getOperand(0)));
baz(cast <MDInt>(N->getOperand(1)));
bak(cast_or_null <MDInt>(N->getOperand(2)));
bat(dyn_cast <MDInt>(N->getOperand(3)));
bay(dyn_cast_or_null<MDInt>(N->getOperand(4)));
- A `CallInst` -- specifically, intrinsic instructions -- can refer to
metadata through a bridge called `MetadataAsValue`. This is a
subclass of `Value` where `getType()->isMetadataTy()`.
`MetadataAsValue` is the *only* class that can legally refer to a
`LocalAsMetadata`, which is a bridged form of non-`Constant` values
like `Argument` and `Instruction`. It can also refer to any other
`Metadata` subclass.
(I'll break all your testcases in a follow-up commit, when I propagate
this change to assembly.)
llvm-svn: 223802
Remove an unnecessary `MDNode::replaceAllUsesWith()`. In the preceding
line, `TheLoop->setLoopID()` visits all backedges and sets the new loop
ID. This sufficiently updates the loop metadata.
Metadata RAUW is going away as part of PR21532.
llvm-svn: 223210
This reverts commit r222632 (and follow-up r222636), which caused a host
of LNT failures on an internal bot. I'll respond to the commit on the
list with a reproduction of one of the failures.
Conflicts:
lib/Target/X86/X86TargetTransformInfo.cpp
llvm-svn: 222936
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
llvm-svn: 222632
This is to be consistent with StringSet and ultimately with the standard
library's associative container insert function.
This lead to updating SmallSet::insert to return pair<iterator, bool>,
and then to update SmallPtrSet::insert to return pair<iterator, bool>,
and then to update all the existing users of those functions...
llvm-svn: 222334
Hide the fact that `MDString`'s string is stored in `Value::Name` --
that's going to change soon. Update the only in-tree client that was
using it instead of `Value::getString()`.
Part of PR21532.
llvm-svn: 221951
Instead, we're going to separate metadata from the Value hierarchy. See
PR21532.
This reverts commit r221375.
This reverts commit r221373.
This reverts commit r221359.
This reverts commit r221167.
This reverts commit r221027.
This reverts commit r221024.
This reverts commit r221023.
This reverts commit r220995.
This reverts commit r220994.
llvm-svn: 221711
A pointer's pointee might not be sized: the pointee could be a function.
Report this as IK_NoInduction when calculating isInductionVariable.
This fixes PR21508.
llvm-svn: 221501
Change `Instruction::getMetadata()` to return `Value` as part of
PR21433.
Update most callers to use `Instruction::getMDNode()`, which wraps the
result in a `cast_or_null<MDNode>`.
llvm-svn: 221024
These are named following the IEEE-754 names for these
functions, rather than the libm fmin / fmax to avoid
possible ambiguities. Some languages may implement something
resembling fmin / fmax which return NaN if either operand is
to propagate errors. These implement the IEEE-754 semantics
of returning the other operand if either is a NaN representing
missing data.
llvm-svn: 220341
The SLP vectorizer should not vectorize ephemeral values. These are used to
express information to the optimizer, and vectorizing them does not lead to
faster code (because the ephemeral values are dropped prior to code generation,
vectorized or not), and obscures the information the instructions are
attempting to communicate (the logic that interprets the arguments to
@llvm.assume generically does not understand vectorized conditions).
Also, uses by ephemeral values are free (because they, and the necessary
extractelement instructions, will be dropped prior to code generation).
llvm-svn: 219816
A few minor changes to prevent @llvm.assume from interfering with loop
vectorization. First, treat @llvm.assume like the lifetime intrinsics, which
are scalarized (but don't otherwise interfere with the legality checking).
Second, ignore the cost of ephemeral instructions in the loop (these will go
away anyway during CodeGen).
Alignment assumptions and other uses of @llvm.assume can often end up inside of
loops that should be vectorized (this is not uncommon for assumptions generated
by __attribute__((align_value(n))), for example).
llvm-svn: 219741
getSmallConstantTripCount even when it isn't the exiting block.
I missed this in my first audit, very sorry. This was found in LNT and
elsewhere. I don't have a test case, but it was completely obvious from
inspection that this was the problem. I'll see if I can reduce a test
case, but I'm not really hopeful, and the value seems quite low.
llvm-svn: 219562
routines and fix all of the bugs they expose.
I hit a test case that crashed even without these asserts due to passing
a non-exiting latch to the ExitingBlock parameter of the trip count
computation machinery. However, when I add the nice asserts, it turns
out we have plenty of coverage of these bugs, they just didn't manifest
in crashers.
The core problem seems to stem from an assumption that the latch *is*
the exiting block. While this is often true, and somewhat the "normal"
way to think about loops, it isn't necessarily true. The correct way to
call the trip count routines in a *generic* fashion (that is, without
a particular exit in mind) is to just use the loop's single exiting
block if it has one. The trip count can't be computed generically unless
it does. This works great for the loop vectorizer. The loop unroller
actually *wants* to select the latch when it has to chose between
multiple exits because for unrolling it is the latch trips that matter.
But if this is the desire, it needs to explicitly guard for non-exiting
latches and check for the generic trip count in that case.
I've added the asserts, and added convenience APIs for querying the trip
count generically that check for a single exit block. I've kept the APIs
consistent between computing trip count and trip multiples.
Thansk to Mark for the help debugging and tracking down the *right* fix
here!
llvm-svn: 219550
"Unroll" is not the appropriate name for this variable. Clang already uses
the term "interleave" in pragmas and metadata for this.
Differential Revision: http://reviews.llvm.org/D5066
llvm-svn: 217528
The SLP vectorizer should propagate IR-level optimization hints/flags (nsw, nuw, exact, fast-math)
when converting scalar instructions into vectors. But this isn't a simple copy - we need to take
the intersection (the logical 'and') of the sets of flags on the scalars.
The solution is further complicated because we can have non-uniform (non-SIMD) vector ops after:
http://reviews.llvm.org/D4015http://llvm.org/viewvc/llvm-project?view=revision&revision=211339
The vast majority of changed files are existing tests that were not propagating IR flags, but I've
also added a new test file for focused testing of IR flag possibilities.
Differential Revision: http://reviews.llvm.org/D5172
llvm-svn: 217051
Adding 'IR' to the names in an attempt to be less ambiguous about the flags we're dealing with here.
The 'and' method is needed by the SLPVectorizer (PR20802) and possibly other passes.
llvm-svn: 217004
The loop vectorizer preserves wrapping, exact, and fast-math properties of scalar instructions.
This patch adds a convenience method to make that operation easier because we need to do this
in the loop vectorizer, SLP vectorizer, and possibly other places.
Although this is a 'no functional change' patch, I've added a testcase to verify that the exact
flag is preserved by the loop vectorizer. The wrapping and fast-math flags are already checked
in existing testcases.
Differential Revision: http://reviews.llvm.org/D5138
llvm-svn: 216886
Previously, the hint mechanism relied on clean up passes to remove redundant
metadata, which still showed up if running opt at low levels of optimization.
That also has shown that multiple nodes of the same type, but with different
values could still coexist, even if temporary, and cause confusion if the
next pass got the wrong value.
This patch makes sure that, if metadata already exists in a loop, the hint
mechanism will never append a new node, but always replace the existing one.
It also enhances the algorithm to cope with more metadata types in the future
by just adding a new type, not a lot of code.
Re-applying again due to MSVC 2013 being minimum requirement, and this patch
having C++11 that MSVC 2012 didn't support.
Fixes PR20655.
llvm-svn: 216870
For a detailed description of the problem see the comment in the test file.
The problematic moveBefore() calls are not required anymore because the new
scheduling algorithm ensures a correct ordering anyway.
llvm-svn: 216656
This patch adds support to recognize division by uniform power of 2 and modifies the cost table to vectorize division by uniform power of 2 whenever possible.
Updates Cost model for Loop and SLP Vectorizer.The cost table is currently only updated for X86 backend.
Thanks to Hal, Andrea, Sanjay for the review. (http://reviews.llvm.org/D4971)
llvm-svn: 216371
In unreachable blocks it's legal to have instructions like "%x = op %x".
Such instuctions are not schedulable. Therefore the SLPVectorizer has to check for
unreachable blocks and ignore them.
Fixes bug 20646.
llvm-svn: 216256
If we have a scalar reduction, we can increase the critical path length if the loop we're unrolling is inside another loop. Limit, by default to 2, so the critical path only gets increased by one reduction operation.
llvm-svn: 216140
Previously, the hint mechanism relied on clean up passes to remove redundant
metadata, which still showed up if running opt at low levels of optimization.
That also has shown that multiple nodes of the same type, but with different
values could still coexist, even if temporary, and cause confusion if the
next pass got the wrong value.
This patch makes sure that, if metadata already exists in a loop, the hint
mechanism will never append a new node, but always replace the existing one.
It also enhances the algorithm to cope with more metadata types in the future
by just adding a new type, not a lot of code.
llvm-svn: 215994
Some types, such as 128-bit vector types on AArch64, don't have any callee-saved registers. So if a value needs to stay live over a callsite, it must be spilled and refilled. This cost is now taken into account.
llvm-svn: 214859
When the cost model determines vectorization is not possible/profitable these remarks print an analysis of that decision.
Note that in selectVectorizationFactor() we can assume that OptForSize and ForceVectorization are mutually exclusive.
Reviewed by Arnold Schwaighofer
llvm-svn: 214599
The current remark is ambiguous and makes it sounds like explicitly specifying vectorization will allow the loop to be vectorized. This is not the case. The improved remark directs the user to -Rpass-analysis=loop-vectorize to determine the cause of the pass-miss.
Reviewed by Arnold Schwaighofer`
llvm-svn: 214445
This commit adds scoped noalias metadata. The primary motivations for this
feature are:
1. To preserve noalias function attribute information when inlining
2. To provide the ability to model block-scope C99 restrict pointers
Neither of these two abilities are added here, only the necessary
infrastructure. In fact, there should be no change to existing functionality,
only the addition of new features. The logic that converts noalias function
parameters into this metadata during inlining will come in a follow-up commit.
What is added here is the ability to generally specify noalias memory-access
sets. Regarding the metadata, alias-analysis scopes are defined similar to TBAA
nodes:
!scope0 = metadata !{ metadata !"scope of foo()" }
!scope1 = metadata !{ metadata !"scope 1", metadata !scope0 }
!scope2 = metadata !{ metadata !"scope 2", metadata !scope0 }
!scope3 = metadata !{ metadata !"scope 2.1", metadata !scope2 }
!scope4 = metadata !{ metadata !"scope 2.2", metadata !scope2 }
Loads and stores can be tagged with an alias-analysis scope, and also, with a
noalias tag for a specific scope:
... = load %ptr1, !alias.scope !{ !scope1 }
... = load %ptr2, !alias.scope !{ !scope1, !scope2 }, !noalias !{ !scope1 }
When evaluating an aliasing query, if one of the instructions is associated
with an alias.scope id that is identical to the noalias scope associated with
the other instruction, or is a descendant (in the scope hierarchy) of the
noalias scope associated with the other instruction, then the two memory
accesses are assumed not to alias.
Note that is the first element of the scope metadata is a string, then it can
be combined accross functions and translation units. The string can be replaced
by a self-reference to create globally unqiue scope identifiers.
[Note: This overview is slightly stylized, since the metadata nodes really need
to just be numbers (!0 instead of !scope0), and the scope lists are also global
unnamed metadata.]
Existing noalias metadata in a callee is "cloned" for use by the inlined code.
This is necessary because the aliasing scopes are unique to each call site
(because of possible control dependencies on the aliasing properties). For
example, consider a function: foo(noalias a, noalias b) { *a = *b; } that gets
inlined into bar() { ... if (...) foo(a1, b1); ... if (...) foo(a2, b2); } --
now just because we know that a1 does not alias with b1 at the first call site,
and a2 does not alias with b2 at the second call site, we cannot let inlining
these functons have the metadata imply that a1 does not alias with b2.
llvm-svn: 213864
In order to enable the preservation of noalias function parameter information
after inlining, and the representation of block-level __restrict__ pointer
information (etc.), additional kinds of aliasing metadata will be introduced.
This metadata needs to be carried around in AliasAnalysis::Location objects
(and MMOs at the SDAG level), and so we need to generalize the current scheme
(which is hard-coded to just one TBAA MDNode*).
This commit introduces only the necessary refactoring to allow for the
introduction of other aliasing metadata types, but does not actually introduce
any (that will come in a follow-up commit). What it does introduce is a new
AAMDNodes structure to hold all of the aliasing metadata nodes associated with
a particular memory-accessing instruction, and uses that structure instead of
the raw MDNode* in AliasAnalysis::Location, etc.
No functionality change intended.
llvm-svn: 213859
Prior to this change, the loop vectorizer did not make use of the alias
analysis infrastructure. Instead, it performed memory dependence analysis using
ScalarEvolution-based linear dependence checks within equivalence classes
derived from the results of ValueTracking's GetUnderlyingObjects.
Unfortunately, this meant that:
1. The loop vectorizer had logic that essentially duplicated that in BasicAA
for aliasing based on identified objects.
2. The loop vectorizer could not partition the space of dependency checks
based on information only easily available from within AA (TBAA metadata is
currently the prime example).
This means, for example, regardless of whether -fno-strict-aliasing was
provided, the vectorizer would only vectorize this loop with a runtime
memory-overlap check:
void foo(int *a, float *b) {
for (int i = 0; i < 1600; ++i)
a[i] = b[i];
}
This is suboptimal because the TBAA metadata already provides the information
necessary to show that this check unnecessary. Of course, the vectorizer has a
limit on the number of such checks it will insert, so in practice, ignoring
TBAA means not vectorizing more-complicated loops that we should.
This change causes the vectorizer to use an AliasSetTracker to keep track of
the pointers in the loop. The resulting alias sets are then used to partition
the space of dependency checks, and potential runtime checks; this results in
more-efficient vectorizations.
When pointer locations are added to the AliasSetTracker, two things are done:
1. The location size is set to UnknownSize (otherwise you'd not catch
inter-iteration dependencies)
2. For instructions in blocks that would need to be predicated, TBAA is
removed (because the metadata might have a control dependency on the condition
being speculated).
For non-predicated blocks, you can leave the TBAA metadata. This is safe
because you can't have an iteration dependency on the TBAA metadata (if you
did, and you unrolled sufficiently, you'd end up with the same pointer value
used by two accesses that TBAA says should not alias, and that would yield
undefined behavior).
llvm-svn: 213486
Summary: This patch introduces two new iterator ranges and updates existing code to use it. No functional change intended.
Test Plan: All tests (make check-all) still pass.
Reviewers: dblaikie
Reviewed By: dblaikie
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D4481
llvm-svn: 213474
IRBuilder has CreateAligned(Load|Store) functions; use them and we don't need
to make a second call to setAlignment.
No functionality change intended.
llvm-svn: 213453
There are some kinds of metadata that are safe to propagate from the scalar
instructions to the vector instructions (fpmath and tbaa currently).
Regarding TBAA, one might worry about propagating it on if-converted loads and
stores, because the metadata might have had a control dependency on the
condition, and thus actually aliased with some other non-speculated memory
access when the condition was false. However, this would be caught by the
runtime overlap checks.
llvm-svn: 213452
This patch modifies the existing DiagnosticInfo system to create a generic base
class that is inherited to produce diagnostic-based warnings. This is used by
the loop vectorizer to trigger a warning when vectorization is forced and
fails. Several tests have been added to verify this behavior.
Reviewed by: Arnold Schwaighofer
llvm-svn: 213110
string_ostream is a safe and efficient string builder that combines opaque
stack storage with a built-in ostream interface.
small_string_ostream<bytes> additionally permits an explicit stack storage size
other than the default 128 bytes to be provided. Beyond that, storage is
transferred to the heap.
This convenient class can be used in most places an
std::string+raw_string_ostream pair or SmallString<>+raw_svector_ostream pair
would previously have been used, in order to guarantee consistent access
without byte truncation.
The patch also converts much of LLVM to use the new facility. These changes
include several probable bug fixes for truncated output, a programming error
that's no longer possible with the new interface.
llvm-svn: 211749
[LLVM part]
These patches rename the loop unrolling and loop vectorizer metadata
such that they have a common 'llvm.loop.' prefix. Metadata name
changes:
llvm.vectorizer.* => llvm.loop.vectorizer.*
llvm.loopunroll.* => llvm.loop.unroll.*
This was a suggestion from an earlier review
(http://reviews.llvm.org/D4090) which added the loop unrolling
metadata.
Patch by Mark Heffernan.
llvm-svn: 211710
This patch adds support to recognize patterns such as fadd,fsub,fadd,fsub.../add,sub,add,sub... and
vectorizes them as vector shuffles if they are profitable.
These patterns of vector shuffle can later be converted to instructions such as addsubpd etc on X86.
Thanks to Arnold and Hal for the reviews. http://reviews.llvm.org/D4015
llvm-svn: 211339
If we have common uses on separate paths in the tree; process the one with greater common depth first.
This makes sure that we do not assume we need to extract a load when it is actually going to be part of a vectorized tree.
Review: http://reviews.llvm.org/D3800
llvm-svn: 210310
This patch adds support to vectorize intrinsics such as powi, cttz and ctlz in Vectorizer. These intrinsics are different from other
intrinsics as second argument to these function must be same in order to vectorize them and it should be represented as a scalar.
Review: http://reviews.llvm.org/D3851#inline-32769 and http://reviews.llvm.org/D3937#inline-32857
llvm-svn: 209873
The loop vectorizer instantiates be-taken-count + 1 as the loop iteration count.
If this expression overflows the generated code was invalid.
In case of overflow the code now jumps to the scalar loop.
Fixes PR17288.
llvm-svn: 209854
Summary:
This adds two new diagnostics: -pass-remarks-missed and
-pass-remarks-analysis. They take the same values as -pass-remarks but
are intended to be triggered in different contexts.
-pass-remarks-missed is used by LLVMContext::emitOptimizationRemarkMissed,
which passes call when they tried to apply a transformation but
couldn't.
-pass-remarks-analysis is used by LLVMContext::emitOptimizationRemarkAnalysis,
which passes call when they want to inform the user about analysis
results.
The patch also:
1- Adds support in the inliner for the two new remarks and a
test case.
2- Moves emitOptimizationRemark* functions to the llvm namespace.
3- Adds an LLVMContext argument instead of making them member functions
of LLVMContext.
Reviewers: qcolombet
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3682
llvm-svn: 209442
Turns out that there is a very cheap way of testing whether a block is dead,
just look it up in the DomTree. We have to do this anyways so just ignore
unreachable blocks before sorting by domination. This restores a proper
ordering for std::stable_sort when dead code is present.
Covered by existing tests & buildbots running in STL debug mode (MSVC).
llvm-svn: 208492
There is no total ordering if the CFG is disconnected. We don't care if we
catch all CSE opportunities in dead code either so just exclude ignore them in
the assert.
PR19646
llvm-svn: 208461
1) Fix for printing debug locations for absolute paths.
2) Location printing is moved into public method DebugLoc::print() to avoid re-inventing the wheel.
Differential Revision: http://reviews.llvm.org/D3513
llvm-svn: 208177
When can't assume a vectorized tree is rooted in an instruction. The IRBuilder
could have constant folded it. When we rebuild the build_vector (the series of
InsertElement instructions) use the last original InsertElement instruction. The
vectorized tree root is guaranteed to be before it.
Also, we can't assume that the n-th InsertElement inserts the n-th element into
a vector.
This reverts r207746 which reverted the revert of the revert of r205018 or so.
Fixes the test case in PR19621.
llvm-svn: 207939
There is no point in creating it if we're not going to vectorize
anything. Creating the map is expensive as it creates large values.
No functionality change.
llvm-svn: 207916
There are public functions that mutate various members as well as
another private member already, so make all the members private to
avoid the discontinuity and add accessors for the values. Should
be no functional change.
llvm-svn: 207868
=[
Turns out that this was the root cause of PR19621. We found a crasher
only recently (likely due to improvements elsewhere in the SLP
vectorizer) but the reduced test case failed all the way back to here.
I've confirmed that reverting this patch both fixes the reduced test
case in PR19621 and the actual source file that led to it, so it seems
to really be rooted here. I've replied to the commit thread with
discussion of my (feeble) attempts to debug this. Didn't make it very
far, so reverting now that we have a good test case so that things can
get back to healthy while the debugging carries on.
llvm-svn: 207746
This patch changes the vectorization remarks to also inform when
vectorization is possible but not beneficial.
Added tests to exercise some loop remarks.
llvm-svn: 207574
Summary:
This calls emitOptimizationRemark from the loop unroller and vectorizer
at the point where they make a positive transformation. For the
vectorizer, it reports vectorization and interleave factors. For the
loop unroller, it reports all the different supported types of
unrolling.
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D3456
llvm-svn: 207528
Use -stats to see how many loops were analyzed for possible vectorization and how many of them were actually vectorized.
Patch by Zinovy Nis
Differential Revision: http://reviews.llvm.org/D3438
llvm-svn: 206956
definition below all of the header #include lines, lib/Transforms/...
edition.
This one is tricky for two reasons. We again have a couple of passes
that define something else before the includes as well. I've sunk their
name macros with the DEBUG_TYPE.
Also, InstCombine contains headers that need DEBUG_TYPE, so now those
headers #define and #undef DEBUG_TYPE around their code, leaving them
well formed modular headers. Fixing these headers was a large motivation
for all of these changes, as "leaky" macros of this form are hard on the
modules implementation.
llvm-svn: 206844
Some Intrinsics are overloaded to the extent that return type equality (all
that's been checked up to now) does not guarantee that the arguments are the
same. In these cases SLP vectorizer should not recurse into the operands, which
can be achieved by comparing them as "Function *" rather than simply the ID.
llvm-svn: 205424
For the purpose of calculating the cost of the loop at various vectorization
factors, we need to count dependencies of consecutive pointers as uniforms
(which means that the VF = 1 cost is used for all overall VF values).
For example, the TSVC benchmark function s173 has:
...
%3 = add nsw i64 %indvars.iv, 16000
%arrayidx8 = getelementptr inbounds %struct.GlobalData* @global_data, i64 0, i32 0, i64 %3
...
and we must realize that the add will be a scalar in order to correctly deduce
it to be profitable to vectorize this on PowerPC with VSX enabled. In fact, all
dependencies of a consecutive pointer must be a scalar (uniform), and so we
simply need to add all consecutive pointers to the worklist that currently
detects collects uniforms.
Fixes PR19296.
llvm-svn: 205387
This reverts commit r205018.
Conflicts:
lib/Transforms/Vectorize/SLPVectorizer.cpp
test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
This is breaking libclc build.
llvm-svn: 205260
Extracts coming from phis were being hoisted, while all others were
sunk to their uses. This was inconsistent and didn't seem to serve a
purpose. Changing all extracts to be sunk to uses is a prerequisite
for adding block frequency to the SLP vectorizer's cost model.
I benchmarked the change in isolation (without block frequency). I
only saw noise on x86 and some potentially significant improvements on
ARM. No major regressions is good enough for me.
llvm-svn: 204699
noise.
Original commit log:
Replace some dead code with an assert. When I first ported this pass
from a loop pass to a function pass I did so in the naive, recursive
way. It doesn't actually work, we need a worklist instead. When
I switched to the worklist I didn't delete the naive recursion. That
recursion was also buggy because it was dead and never really exercised.
llvm-svn: 204187
pass from a loop pass to a function pass I did so in the naive,
recursive way. It doesn't actually work, we need a worklist instead.
When I switched to the worklist I didn't delete the naive recursion.
That recursion was also buggy because it was dead and never really
exercised.
llvm-svn: 204184
This requires a number of steps.
1) Move value_use_iterator into the Value class as an implementation
detail
2) Change it to actually be a *Use* iterator rather than a *User*
iterator.
3) Add an adaptor which is a User iterator that always looks through the
Use to the User.
4) Wrap these in Value::use_iterator and Value::user_iterator typedefs.
5) Add the range adaptors as Value::uses() and Value::users().
6) Update *all* of the callers to correctly distinguish between whether
they wanted a use_iterator (and to explicitly dig out the User when
needed), or a user_iterator which makes the Use itself totally
opaque.
Because #6 requires churning essentially everything that walked the
Use-Def chains, I went ahead and added all of the range adaptors and
switched them to range-based loops where appropriate. Also because the
renaming requires at least churning every line of code, it didn't make
any sense to split these up into multiple commits -- all of which would
touch all of the same lies of code.
The result is still not quite optimal. The Value::use_iterator is a nice
regular iterator, but Value::user_iterator is an iterator over User*s
rather than over the User objects themselves. As a consequence, it fits
a bit awkwardly into the range-based world and it has the weird
extra-dereferencing 'operator->' that so many of our iterators have.
I think this could be fixed by providing something which transforms
a range of T&s into a range of T*s, but that *can* be separated into
another patch, and it isn't yet 100% clear whether this is the right
move.
However, this change gets us most of the benefit and cleans up
a substantial amount of code around Use and User. =]
llvm-svn: 203364
Move the test for this class into the IR unittests as well.
This uncovers that ValueMap too is in the IR library. Ironically, the
unittest for ValueMap is useless in the Support library (honestly, so
was the ValueHandle test) and so it already lives in the IR unittests.
Mmmm, tasty layering.
llvm-svn: 202821
I am really sorry for the noise, but the current state where some parts of the
code use TD (from the old name: TargetData) and other parts use DL makes it
hard to write a patch that changes where those variables come from and how
they are passed along.
llvm-svn: 201827
'OK_NonUniformConstValue' to identify operands which are constants but
not constant splats.
The cost model now allows returning 'OK_NonUniformConstValue'
for non splat operands that are instances of ConstantVector or
ConstantDataVector.
With this change, targets are now able to compute different costs
for instructions with non-uniform constant operands.
For example, On X86 the cost of a vector shift may vary depending on whether
the second operand is a uniform or non-uniform constant.
This patch applies the following changes:
- The cost model computation now takes into account non-uniform constants;
- The cost of vector shift instructions has been improved in
X86TargetTransformInfo analysis pass;
- BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish
between non-uniform and uniform constant operands.
Added a new test to verify that the output of opt
'-cost-model -analyze' is valid in the following configurations: SSE2,
SSE4.1, AVX, AVX2.
llvm-svn: 201272
Before conditional store vectorization/unrolling we had only one
vectorized/unrolled basic block. After adding support for conditional store
vectorization this will not only be one block but multiple basic blocks. The
last block would have the back-edge. I updated the code to use a vector of basic
blocks instead of a single basic block and fixed the users to use the last entry
in this vector. But, I forgot to add the basic blocks to this vector!
Fixes PR18724.
llvm-svn: 201028
Ideally only those transform passes that run at -O0 remain enabled,
in reality we get as close as we reasonably can.
Passes are responsible for disabling themselves, it's not the job of
the pass manager to do it for them.
llvm-svn: 200892
unrolling heuristic per default
Benchmarking on x86_64 (thanks Chandler!) and ARM has shown those options speed
up some benchmarks while not causing any interesting regressions.
llvm-svn: 200621
This reverts commit r200576. It broke 32-bit self-host builds by
vectorizing two calls to @llvm.bswap.i64, which we then fail to expand.
llvm-svn: 200602
transform accordingly. Based on similar code from Loop vectorization.
Subsequent commits will include vectorization of function calls to
vector intrinsics and form function calls to vector library calls.
Patch by Raul Silvera! (Much delayed due to my not running dcommit)
llvm-svn: 200576
loop vectorizer to not do so when runtime pointer checks are needed and
share code with the new (not yet enabled) load/store saturation runtime
unrolling. Also ensure that we only consider the runtime checks when the
loop hasn't already been vectorized. If it has, the runtime check cost
has already been paid.
I've fleshed out a test case to cover the scalar unrolling as well as
the vector unrolling and comment clearly why we are or aren't following
the pattern.
llvm-svn: 200530
When estimating register pressure, don't count the induction variable mulitple
times. It is unlikely to be unrolled. This is currently disabled and hidden
behind a flag ("enable-ind-var-reg-heur").
llvm-svn: 200371
vectorizer, placing it behind an off-by-default flag.
It turns out that block frequency isn't what we want at all, here or
elsewhere. This has been I think a nagging feeling for several of us
working with it, but Arnold has given some really nice simple examples
where the results are so comprehensively wrong that they aren't useful.
I'm planning to email the dev list with a summary of why its not really
useful and a couple of ideas about how to better structure these types
of heuristics.
llvm-svn: 200294
The vectorizer takes a loop like this and widens all instructions except for the
store. The stores are scalarized/unrolled and hidden behind an "if" block.
for (i = 0; i < 128; ++i) {
if (a[i] < 10)
a[i] += val;
}
for (i = 0; i < 128; i+=2) {
v = a[i:i+1];
v0 = (extract v, 0) + 10;
v1 = (extract v, 1) + 10;
if (v0 < 10)
a[i] = v0;
if (v1 < 10)
a[i] = v1;
}
The vectorizer relies on subsequent optimizations to sink instructions into the
conditional block where they are anticipated.
The flag "vectorize-num-stores-pred" controls whether and how many stores to
handle this way. Vectorization of conditional stores is disabled per default for
now.
This patch also adds a change to the heuristic when the flag
"enable-loadstore-runtime-unroll" is enabled (off by default). It unrolls small
loops until load/store ports are saturated. This heuristic uses TTI's
getMaxUnrollFactor as a measure for load/store ports.
I also added a second flag -enable-cond-stores-vec. It will enable vectorization
of conditional stores. But there is no cost model for vectorization of
conditional stores in place yet so this will not do good at the moment.
rdar://15892953
Results for x86-64 -O3 -mavx +/- -mllvm -enable-loadstore-runtime-unroll
-vectorize-num-stores-pred=1 (before the BFI change):
Performance Regressions:
Benchmarks/Ptrdist/yacr2/yacr2 7.35% (maze3() is identical but 10% slower)
Applications/siod/siod 2.18%
Performance improvements:
mesa -4.42%
libquantum -4.15%
With a patch that slightly changes the register heuristics (by subtracting the
induction variable on both sides of the register pressure equation, as the
induction variable is probably not really unrolled):
Performance Regressions:
Benchmarks/Ptrdist/yacr2/yacr2 7.73%
Applications/siod/siod 1.97%
Performance Improvements:
libquantum -13.05% (we now also unroll quantum_toffoli)
mesa -4.27%
llvm-svn: 200270
cold loops as-if they were being optimized for size.
Nothing fancy here. Simply test case included. The nice thing is that we
can now incrementally build on top of this to drive other heuristics.
All of the infrastructure work is done to get the profile information
into this layer.
The remaining work necessary to make this a fully general purpose loop
unroller for very hot loops is to make it a fully general purpose loop
unroller. Things I know of but am not going to have time to benchmark
and fix in the immediate future:
1) Don't disable the entire pass when the target is lacking vector
registers. This really doesn't make any sense any more.
2) Teach the unroller at least and the vectorizer potentially to handle
non-if-converted loops. This is trivial for the unroller but hard for
the vectorizer.
3) Compute the relative hotness of the loop and thread that down to the
various places that make cost tradeoffs (very likely only the
unroller makes sense here, and then only when dealing with loops that
are small enough for unrolling to not completely blow out the LSD).
I'm still dubious how useful hotness information will be. So far, my
experiments show that if we can get the correct logic for determining
when unrolling actually helps performance, the code size impact is
completely unimportant and we can unroll in all cases. But at least
we'll no longer burn code size on cold code.
One somewhat unrelated idea that I've had forever but not had time to
implement: mark all functions which are only reachable via the global
constructors rigging in the module as optsize. This would also decrease
the impact of any more aggressive heuristics here on code size.
llvm-svn: 200219
object and fewer pointless variables.
Also, add a clarifying comment and a FIXME because the code which
disables *all* vectorization if we can't use implicit floating point
instructions just makes no sense at all.
llvm-svn: 200214
powers of two. This is essentially always the correct thing given the
impact on alignment, scaling factors that can be used in addressing
modes, etc. Also, fix the management of the unroll vs. small loop cost
to more accurately model things with this world.
Enhance a test case to actually exercise more of the unroll machinery if
using synthetic constants rather than a specific target model. Before
this change, with the added flags this test will unroll 3 times instead
of either 2 or 4 (the two sensible answers).
While I don't expect this to make a huge difference, if there are lots
of loops sitting right on the edge of hitting the 'small unroll' factor,
they might change behavior. However, I've benchmarked moving the small
loop cost up and down in many various ways and by a huge factor (2x)
without seeing more than 0.2% code size growth. Small adjustments such
as the series that led up here have led to about 1% improvement on some
benchmarks, but it is very close to the noise floor so I mostly checked
that nothing regressed. Let me know if you see bad behavior on other
targets but I don't expect this to be a sufficiently dramatic change to
trigger anything.
llvm-svn: 200213
with the unrolling behavior in the loop vectorizer. No functionality
changed at this point.
These are a bit hack-y, but talking with Hal, there doesn't seem to be
a cleaner way to easily experiment with different thresholds here and he
was also interested in them so I wanted to commit them. Suggestions for
improvement are very welcome here.
llvm-svn: 200212
number of vector registers rather than toggling between vector and
scalar register number based on VF. I don't have a test case as
I spotted this by inspection and on X86 it only makes a difference if
your target is lacking SSE and thus has *no* vector registers.
If someone wants to add a test case for this for ARM or somewhere else
where this is more significant, that would be awesome.
Also made the variable name a bit more sensible while I'm here.
llvm-svn: 200211
LoopVectorize pass.
The logic here doesn't make much sense. We *only* unrolled if the
unvectorized loop was a reduction loop with a single basic block *and*
small loop body. The reduction part in particular doesn't make much
sense. Instead, if we just fall through to the vectorized unroll logic
it makes more sense of unrolling if there is a vectorized reduction that
could be hacked on by the SLP vectorizer *or* if the loop is small.
This is mostly a cleanup and nothing in the test suite really exercises
this, but I did run benchmarks across this change and saw no really
significant changes.
llvm-svn: 200198
a FunctionPass. With this change the loop vectorizer no longer is a loop
pass and can readily depend on function analyses. In particular, with
this change we no longer have to form a loop pass manager to run the
loop vectorizer which simplifies the entire pass management of LLVM.
The next step here is to teach the loop vectorizer to leverage profile
information through the profile information providing analysis passes.
llvm-svn: 200074
a reduction.
Really. Under certain circumstances (the use list of an instruction has to be
set up right - hence the extra pass in the test case) we would not recognize
when a value in a potential reduction cycle was used multiple times by the
reduction cycle.
Fixes PR18526.
radar://15851149
llvm-svn: 199570
can be used by both the new pass manager and the old.
This removes it from any of the virtual mess of the pass interfaces and
lets it derive cleanly from the DominatorTreeBase<> template. In turn,
tons of boilerplate interface can be nuked and it turns into a very
straightforward extension of the base DominatorTree interface.
The old analysis pass is now a simple wrapper. The names and style of
this split should match the split between CallGraph and
CallGraphWrapperPass. All of the users of DominatorTree have been
updated to match using many of the same tricks as with CallGraph. The
goal is that the common type remains the resulting DominatorTree rather
than the pass. This will make subsequent work toward the new pass
manager significantly easier.
Also in numerous places things became cleaner because I switched from
re-running the pass (!!! mid way through some other passes run!!!) to
directly recomputing the domtree.
llvm-svn: 199104
directory. These passes are already defined in the IR library, and it
doesn't make any sense to have the headers in Analysis.
Long term, I think there is going to be a much better way to divide
these matters. The dominators code should be fully separated into the
abstract graph algorithm and have that put in Support where it becomes
obvious that evn Clang's CFGBlock's can use it. Then the verifier can
manually construct dominance information from the Support-driven
interface while the Analysis library can provide a pass which both
caches, reconstructs, and supports a nice update API.
But those are very long term, and so I don't want to leave the really
confusing structure until that day arrives.
llvm-svn: 199082
for (i = 0; i < N; ++i)
A[i * Stride1] += B[i * Stride2];
We take loops like this and check that the symbolic strides 'Strided1/2' are one
and drop to the scalar loop if they are not.
This is currently disabled by default and hidden behind the flag
'enable-mem-access-versioning'.
radar://13075509
llvm-svn: 198950
subsequent changes are easier to review. About to fix some layering
issues, and wanted to separate out the necessary churn.
Also comment and sink the include of "Windows.h" in three .inc files to
match the usage in Memory.inc.
llvm-svn: 198685
A phi node operand or an instruction operand could be a constant expression that
can trap (division). Check that we don't vectorize such cases.
PR16729
radar://15653590
llvm-svn: 197449
The intended behaviour is to force vectorization on the presence
of the flag (either turn on or off), and to continue the behaviour
as expected in its absence. Tests were added to make sure the all
cases are covered in opt. No tests were added in other tools with
the assumption that they should use the PassManagerBuilder in the
same way.
This patch also removes the outdated -late-vectorize flag, which was
on by default and not helping much.
The pragma metadata is being attached to the same place as other loop
metadata, but nothing forbids one from attaching it to a function
(to enable #pragma optimize) or basic blocks (to hint the basic-block
vectorizers), etc. The logic should be the same all around.
Patches to Clang to produce the metadata will be produced after the
initial implementation is agreed upon and committed. Patches to other
vectorizers (such as SLP and BB) will be added once we're happy with
the pass manager changes.
llvm-svn: 196537
We use CSEBlocks to initialize a worklist:
SmallVector<BasicBlock *, 8> CSEWorkList(CSEBlocks.begin(), CSEBlocks.end());
so it must have a deterministic order.
llvm-svn: 196520
We were creating external uses for scalar values in MustGather entries that also
had a ScalarToTreeEntry (they also are present in a vectorized tuple). This
meant we would keep a value 'alive' as a scalar and vectorized causing havoc.
This is not necessary because when we create a MustGather vector we explicitly
create external uses entries for the insertelement instructions of the
MustGather vector elements.
Fixes PR18129.
radar://15582184
llvm-svn: 196508
In signed arithmetic we could end up with an i64 trip count for an i32 phi.
Because it is signed arithmetic we know that this is only defined if the i32
does not wrap. It is therefore safe to truncate the i64 trip count to a i32
value.
Fixes PR18049.
llvm-svn: 195787
we generate PHI nodes with multiple entries from the same basic block but
with different values. Enabling CSE on ExtractElement instructions make sure
that all of the RAUWed instructions are the same.
llvm-svn: 195773
SLP vectorization. Based on the code in BBVectorizer.
Fixes PR17741.
Patch by Raul Silvera, reviewed by Hal and Nadav. Reformatted by my
driving of clang-format. =]
llvm-svn: 195528