As LAA is becoming a pass, we can no longer pass the params to its
constructor. This changes the command line flags to have external
storage. These can now be accessed both from LV and LAA.
VectorizerParams is moved out of LoopAccessInfo in order to shorten the
code to access it.
This commits also has the fix (D7731) to the break dependence cycle
between the analysis and vector libraries.
This is part of the patchset that converts LoopAccessAnalysis into an
actual analysis pass.
llvm-svn: 229890
This reverts commit r229651.
I'd like to ultimately revert r229650 but this reformat stands in the
way. I'll reformat the affected files once the the loop-access pass is
fully committed.
llvm-svn: 229889
Use long long for the epi64 argument, like the other intrinsics.
NFC since this is only defined in 64-bit mode, not in 32-bit.
Fix suggested by H. J. Lu!
llvm-svn: 229886
This is true in clang, and let's us remove the problematic code that
waits around for the original file and then times out if it doesn't get
created in short order. This caused any 'dead' lock file or legitimate
time out to cause a cascade of timeouts in any processes waiting on the
same lock (even if they only just showed up).
llvm-svn: 229881
Summary:
this also gets rid of a compiler warning in release builds by using a dynamically allocated
buffer. Therefore, a size assertion is not necessary (and probably should have been an error in
the first place).
Reviewers: tberghammer
Subscribers: lldb-commits
Differential Revision: http://reviews.llvm.org/D7751
llvm-svn: 229878
X86 load folding is fragile; eg, the tests here
don't work without AVX even though they should. This
is because we have a mix of tablegen patterns that have
been added over time, and we have a load folding table
used by the peephole optimizer that has to be kept in
sync with the ever-changing ISA and tablegen defs.
llvm-svn: 229870
systematic lowering of v8i16.
This required a slight strategy shift to prefer unpack lowerings in more
places. While this isn't a cut-and-dry win in every case, it is in the
overwhelming majority. There are only a few places where the old
lowering would probably be a touch faster, and then only by a small
margin.
In some cases, this is yet another significant improvement.
llvm-svn: 229859
addition to lowering to trees rooted in an unpack.
This saves shuffles and or registers in many various ways, lets us
handle another class of v4i32 shuffles pre SSE4.1 without domain
crosses, etc.
llvm-svn: 229856
terribly complex partial blend logic.
This code path was one of the more complex and bug prone when it first
went in and it hasn't faired much better. Ultimately, with the simpler
basis for unpack lowering and support bit-math blending, this is
completely obsolete. In the worst case without this we generate
different but equivalent instructions. However, in many cases we
generate much better code. This is especially true when blends or pshufb
is available.
This does expose one (minor) weakness of the unpack lowering that I'll
try to address.
In case you were wondering, this is actually a big part of what I've
been trying to pull off in the recent string of commits.
llvm-svn: 229853
needed, and significantly improve the SSSE3 path.
This makes the new strategy much more clear. If we can blend, we just go
with that. If we can't blend, we try to permute into an unpack so
that we handle cases where the unpack doing the blend also simplifies
the shuffle. If that fails and we've got SSSE3, we now call into
factored-out pshufb lowering code so that we leverage the fact that
pshufb can set up a blend for us while shuffling. This generates great
code, especially because we *know* we don't have a fast blend at this
point. Finally, we fall back on decomposing into permutes and blends
because we do at least have a bit-math-based blend if we need to use
that.
This pretty significantly improves some of the v8i16 code paths. We
never need to form pshufb for the single-input shuffles because we have
effective target-specific combines to form it there, but we were missing
its effectiveness in the blends.
llvm-svn: 229851
+ separate bug report for "Free alloca()" error to be able to customize checkers responsible for this error.
+ Muted "Free alloca()" error for NewDelete checker that is not responsible for c-allocated memory, turned on for unix.MismatchedDeallocator checker.
+ RefState for alloca() - to be able to detect usage of zero-allocated memory by upcoming ZeroAllocDereference checker.
+ AF_Alloca family to handle alloca() consistently - keep proper family in RefState, handle 'alloca' by getCheckIfTracked() facility, etc.
+ extra tests.
llvm-svn: 229850
them into permutes and a blend with the generic decomposition logic.
This works really well in almost every case and lets the code only
manage the expansion of a single input into two v8i16 vectors to perform
the actual shuffle. The blend-based merging is often much nicer than the
pack based merging that this replaces. The only place where it isn't we
end up blending between two packs when we could do a single pack. To
handle that case, just teach the v2i64 lowering to handle these blends
by digging out the operands.
With this we're down to only really random permutations that cause an
explosion of instructions.
llvm-svn: 229849
v16i8 shuffles, and replace it with new facilities.
This uses precise patterns to match exact unpacks, and the new
generalized unpack lowering only when we detect a case where we will
have to shuffle both inputs anyways and they terminate in exactly
a blend.
This fixes all of the blend horrors that I uncovered by always lowering
blends through the vector shuffle lowering. It also removes *sooooo*
much of the crazy instruction sequences required for v16i8 lowering
previously. Much cleaner now.
The only "meh" aspect is that we sometimes use pshufb+pshufb+unpck when
it would be marginally nicer to use pshufb+pshufb+por. However, the
difference there is *tiny*. In many cases its a win because we re-use
the pshufb mask. In others, we get to avoid the pshufb entirely. I've
left a FIXME, but I'm dubious we can really do better than this. I'm
actually pretty happy with this lowering now.
For SSE2 this exposes some horrors that were really already there. Those
will have to fixed by changing a different path through the v16i8
lowering.
llvm-svn: 229846
on things not being marked as either custom or legal, but we now do
custom lowering of more VSELECT nodes. To cope with this, manually
replicate the legality tests here. These have to stay in sync with the
set of tests used in the custom lowering of VSELECT.
Ideally, we wouldn't do any of this combine-based-legalization when we
have an actual custom legalization step for VSELECT, but I'm not going
to be able to rewrite all of that today.
I don't have a test case for this currently, but it was found when
compiling a number of the test-suite benchmarks. I'll try to reduce
a test case and add it.
This should at least fix the test-suite fallout on build bots.
llvm-svn: 229844
lowering paths. I'm going to be leveraging this to simplify a lot of the
overly complex lowering of v8 and v16 shuffles in pre-SSSE3 modes.
Sadly, this isn't profitable on v4i32 and v2i64. There, the float and
double blending instructions for pre-SSE4.1 are actually pretty good,
and we can't beat them with bit math. And once SSE4.1 comes around we
have direct blending support and this ceases to be relevant.
Also, some of the test cases look odd because the domain fixer
canonicalizes these to floating point domain. That's OK, it'll use the
integer domain when it matters and some day I may be able to update
enough of LLVM to canonicalize the other way.
This restores almost all of the regressions from teaching x86's vselect
lowering to always use vector shuffle lowering for blends. The remaining
problems are because the v16 lowering path is still doing crazy things.
I'll be re-arranging that strategy in more detail in subsequent commits
to finish recovering the performance here.
llvm-svn: 229836