Also, add a scalar test to demonstrate one of the intermediate folds that
is necessary to accomplish the existing, multi-step test. And simplify
the vector tests to only check the final piece of that multi-step transform.
llvm-svn: 278995
I have audited all the callers of concatenate and none require duplicate
entries to service concatenation.
These duplicates serve no purpose but to needlessly embiggen the IR.
N.B. Layering getMostGenericAliasScope on top of concatenate makes it
O(nlogn + mlogm) instead of O(n*m).
llvm-svn: 278836
Hal pointed out that the semantic of our intrinsic and the libc
call are slightly different. Add a comment while I'm here to
explain why we can't emit an intrinsic. Thanks Hal!
llvm-svn: 278200
Summary:
Turn (select C, (sext A), B) into (sext (select C, A, B')) when A is i1 and
B is a compatible constant, also for zext instead of sext. This will then be
further folded into logical operations.
The transformation would be valid for non-i1 types as well, but other parts of
InstCombine prefer to have sext from non-i1 as an operand of select.
Motivated by the shader compiler frontend in Mesa for AMDGPU, which emits i32
for boolean operations. With this change, the boolean logic is fully
recovered.
Reviewers: majnemer, spatel, tstellarAMD
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D22747
llvm-svn: 277801
Summary:
InstCombine unfolds expressions of the form `zext(or(icmp, icmp))` to `or(zext(icmp), zext(icmp))` such that in a later iteration of InstCombine the exposed `zext(icmp)` instructions can be optimized. We now combine this unfolding and the subsequent `zext(icmp)` optimization to be performed together. Since the unfolding doesn't happen separately anymore, we also again enable the folding of `logic(cast(icmp), cast(icmp))` expressions to `cast(logic(icmp, icmp))` which had been disabled due to its interference with the unfolding transformation.
Tested via `make check` and `lnt`.
Background
==========
For a better understanding on how it came to this change we subsequently summarize its history. In commit r275989 we've already tried to enable the folding of `logic(cast(icmp), cast(icmp))` to `cast(logic(icmp, icmp))` which had to be reverted in r276106 because it could lead to an endless loop in InstCombine (also see http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160718/374347.html). The root of this problem is that in `visitZExt()` in InstCombineCasts.cpp there also exists a reverse of the above folding transformation, that unfolds `zext(or(icmp, icmp))` to `or(zext(icmp), zext(icmp))` in order to expose `zext(icmp)` operations which would then possibly be eliminated by subsequent iterations of InstCombine. However, before these `zext(icmp)` would be eliminated the folding from r275989 could kick in and cause InstCombine to endlessly switch back and forth between the folding and the unfolding transformation. This is the reason why we now combine the `zext`-unfolding and the elimination of the exposed `zext(icmp)` to happen at one go because this enables us to still allow the cast-folding in `logic(cast(icmp), cast(icmp))` without entering an endless loop again.
Details on the submitted changes
================================
- In `visitZExt()` we combine the unfolding and optimization of `zext` instructions.
- In `transformZExtICmp()` we have to use `Builder->CreateIntCast()` instead of `CastInst::CreateIntegerCast()` to make sure that the new `CastInst` is inserted in a `BasicBlock`. The new calls to `transformZExtICmp()` that we introduce in `visitZExt()` would otherwise cause according assertions to be triggered (in our case this happend, for example, with lnt for the MultiSource/Applications/sqlite3 and SingleSource/Regression/C++/EH/recursive-throw tests). The subsequent usage of `replaceInstUsesWith()` is necessary to ensure that the new `CastInst` replaces the `ZExtInst` accordingly.
- In InstCombineAndOrXor.cpp we again allow the folding of casts on `icmp` instructions.
- The instruction order in the optimized IR for the zext-or-icmp.ll test case is different with the introduced changes.
- The test cases in zext.ll have been adopted from the reverted commits r275989 and r276105.
Reviewers: grosser, majnemer, spatel
Subscribers: eli.friedman, majnemer, llvm-commits
Differential Revision: https://reviews.llvm.org/D22864
Contributed-by: Matthias Reisinger <d412vv1n@gmail.com>
llvm-svn: 277635
This removes the restriction for the icmp constant, but as noted by the FIXME comments,
we still need to change individual checks for binop operand constants.
llvm-svn: 277629
An undef vector element can be treated as if it had any value. Folding
such a vector element to 0 in a bitcast can open up further folding
opportunities.
llvm-svn: 277104
ConstantExpr::getWithOperands does much of the hard work that
ConstantFoldInstOperandsImpl tries to do but more completely.
This lets us fold ExtractValue/InsertValue expressions.
llvm-svn: 277100
A ConstantVector can have ConstantExpr operands and vice versa.
However, the folder had no ability to fold ConstantVectors which, in
some cases, was an optimization barrier.
Instead, rephrase the folder in terms of Constants instead of
ConstantExprs and teach callers how to deal with failure.
llvm-svn: 277099
Summary:
Asan stack-use-after-scope check should poison alloca even if there is
no access between start and end.
This is possible for code like this:
for (int i = 0; i < 3; i++) {
int x;
p = &x;
}
"Loop Invariant Code Motion" will move "p = &x;" out of the loop, making
start/end range empty.
PR27453
Reviewers: eugenis
Differential Revision: https://reviews.llvm.org/D22842
llvm-svn: 277072
Summary:
Asan stack-use-after-scope check should poison alloca even if there is
no access between start and end.
This is possible for code like this:
for (int i = 0; i < 3; i++) {
int x;
p = &x;
}
"Loop Invariant Code Motion" will move "p = &x;" out of the loop, making
start/end range empty.
PR27453
Reviewers: eugenis
Differential Revision: https://reviews.llvm.org/D22842
llvm-svn: 277068
Summary:
The llvm.invariant.start and llvm.invariant.end intrinsics currently
support specifying invariant memory objects only in the default address
space.
With this change, these intrinsics are overloaded for any adddress space
for memory objects
and we can use these llvm invariant intrinsics in non-default address
spaces.
Example: llvm.invariant.start.p1i8(i64 4, i8 addrspace(1)* %ptr)
This overloaded intrinsic is needed for representing final or invariant
memory in managed languages.
Reviewers: apilipenko, reames
Subscribers: llvm-commits
llvm-svn: 276447
Just because we can constant fold the result of an instruction does not
imply that we can delete the instruction. It may have side effects.
This fixes PR28655.
llvm-svn: 276389
Summary:
The llvm.invariant.start and llvm.invariant.end intrinsics currently
support specifying invariant memory objects only in the default address space.
With this change, these intrinsics are overloaded for any adddress space for memory objects
and we can use these llvm invariant intrinsics in non-default address spaces.
Example: llvm.invariant.start.p1i8(i64 4, i8 addrspace(1)* %ptr)
This overloaded intrinsic is needed for representing final or invariant memory in managed languages.
Reviewers: tstellarAMD, reames, apilipenko
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D22519
llvm-svn: 276316
Summary: In r275989 we enabled the folding of `logic(cast(icmp), cast(icmp))` to `cast(logic(icmp, icmp))`. Here we add more test cases to assure this folding works for all logical operations `and`/`or`/`xor`.
Reviewers: grosser
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D22561
Contributed-by: Matthias Reisinger
llvm-svn: 276105
The pattern may look more obviously like a sext if written as:
define i32 @g(i16 %x) {
%zext = zext i16 %x to i32
%xor = xor i32 %zext, 32768
%add = add i32 %xor, -32768
ret i32 %add
}
We already have that fold in visitAdd().
Differential Revision: https://reviews.llvm.org/D22477
llvm-svn: 276035
Summary:
Currently, InstCombine is already able to fold expressions of the form `logic(cast(A), cast(B))` to the simpler form `cast(logic(A, B))`, where logic designates one of `and`/`or`/`xor`. This transformation is implemented in `foldCastedBitwiseLogic()` in InstCombineAndOrXor.cpp. However, this optimization will not be performed if both `A` and `B` are `icmp` instructions. The decision to preclude casts of `icmp` instructions originates in r48715 in combination with r261707, and can be best understood by the title of the former one:
> Transform (zext (or (icmp), (icmp))) to (or (zext (cimp), (zext icmp))) if at least one of the (zext icmp) can be transformed to eliminate an icmp.
Apparently, it introduced a transformation that is a reverse of the transformation that is done in `foldCastedBitwiseLogic()`. Its purpose is to expose pairs of `zext icmp` that would subsequently be optimized by `transformZExtICmp()` in InstCombineCasts.cpp. Therefore, in order to avoid an endless loop of switching back and forth between these two transformations, the one in `foldCastedBitwiseLogic()` has been restricted to exclude `icmp` instructions which is mirrored in the responsible check:
`if ((!isa<ICmpInst>(Cast0Src) || !isa<ICmpInst>(Cast1Src)) && ...`
This check seems to sort out more cases than necessary because:
- the reverse transformation is obviously done for `or` instructions only
- and also not every `zext icmp` pair is necessarily the result of this reverse transformation
Therefore we now remove this check and replace it by a more finegrained one in `shouldOptimizeCast()` that now rejects only those `logic(zext(icmp), zext(icmp))` that would be able to be optimized by `transformZExtICmp()`, which also avoids the mentioned endless loop. That means we are now able to also simplify expressions of the form `logic(cast(icmp), cast(icmp))` to `cast(logic(icmp, icmp))` (`cast` being an arbitrary `CastInst`).
As an example, consider the following IR snippet
```
%1 = icmp sgt i64 %a, %b
%2 = zext i1 %1 to i8
%3 = icmp slt i64 %a, %c
%4 = zext i1 %3 to i8
%5 = and i8 %2, %4
```
which would now be transformed to
```
%1 = icmp sgt i64 %a, %b
%2 = icmp slt i64 %a, %c
%3 = and i1 %1, %2
%4 = zext i1 %3 to i8
```
This issue became apparent when experimenting with the programming language Julia, which makes use of LLVM. Currently, Julia lowers its `Bool` datatype to LLVM's `i8` (also see https://github.com/JuliaLang/julia/pull/17225). In fact, the above IR example is the lowered form of the Julia snippet `(a > b) & (a < c)`. Like shown above, this may introduce `zext` operations, casting between `i1` and `i8`, which could for example hinder ScalarEvolution and Polly on certain code.
Reviewers: grosser, vtjnash, majnemer
Subscribers: majnemer, llvm-commits
Differential Revision: https://reviews.llvm.org/D22511
Contributed-by: Matthias Reisinger
llvm-svn: 275989
This is a partial implementation of a general fold for associative+commutative operators:
(op (cast (op X, C2)), C1) --> (cast (op X, op (C1, C2)))
(op (cast (op X, C2)), C1) --> (op (cast X), op (C1, C2))
There are 7 associative operators and 13 cast types, so this could potentially go a lot further.
Differential Revision: https://reviews.llvm.org/D22421
llvm-svn: 275684
We were able to fold masked loads with an all-ones mask to a normal
load. However, we couldn't turn a masked load with a mask with mixed
ones and undefs into a normal load.
llvm-svn: 275380
In D21740, we discussed trying to make this a more general matcher. However, I didn't see a clean
way to handle the regular m_Not cases and these non-splat vector patterns, so I've opted for the
direct approach here. If there are other potential uses of areInverseVectorBitmasks(), we could
move that helper function to a higher level.
There is an open question as to which is of these forms should be considered the canonical IR:
%sel = select <4 x i1> <i1 true, i1 false, i1 false, i1 true>, <4 x i32> %a, <4 x i32> %b
%shuf = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
Differential Revision: http://reviews.llvm.org/D22114
llvm-svn: 275289
If a function is known to return one of its arguments, we can use that in order
to compute known bits of the return value.
Differential Revision: http://reviews.llvm.org/D9397
llvm-svn: 275036
This isn't a sure thing (are 2 extra bitcasts less expensive than a logic op?),
but we'll try to err on the conservative side by going with the case that has
less IR instructions.
Note: This question came up in http://reviews.llvm.org/D22114 , but this part is
independent of that patch proposal, so I'm making this small change ahead of that
one.
See also:
http://reviews.llvm.org/rL274926
llvm-svn: 274932
We can fold truncs whose operand feeds from a load, if the trunc value
is available through a prior load/store.
This change is from: http://reviews.llvm.org/D21246, which folded the
trunc but missed the bitcast or ptrtoint/inttoptr required in the RAUW
call, when the load type didnt match the prior load/store type.
Differential Revision: http://reviews.llvm.org/D21791
llvm-svn: 274853
By replacing dyn_cast of ConstantInt with m_Zero/m_One/m_AllOnes, we
allow these transforms for splat vectors.
Differential Revision: http://reviews.llvm.org/D21899
llvm-svn: 274696
Summary: This adds tests for covering all cases that FoldAndOfFCmps and FoldOrOfFCmps handle.
Reviewers: spatel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D21844
llvm-svn: 274144
This is breaking an optimizaton remark test in clang. I've identified a couple fixes for that, but want to understand it better before I commit to anything.
llvm-svn: 274102
If a operation for a recurrence is an addition with no signed wrap and both input sign bits are 0, then the result sign bit must also be 0. Similar for the negative case.
I found this deficiency while playing around with a loop in the x86 backend that contained a signed division that could be optimized into an unsigned division if we could prove both inputs were positive. One of them being the loop induction variable. With this patch we can perform the conversion for this case. One of the test cases here is a contrived variation of the loop I was looking at.
Differential revision: http://reviews.llvm.org/D21493
llvm-svn: 274098
Revert "[InstCombine] Combine A->B->A BitCast"
as this appears to cause PR27996 and as discussed in http://reviews.llvm.org/D20847
This reverts commits r270135 and r263734.
llvm-svn: 274094
This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 274043
This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 273892
By putting all the possible commutations together, we simplify the code.
Note that this is NFCI, but I'm adding tests that actually exercise each
commutation pattern because we don't have this anywhere else.
llvm-svn: 273702
Summary:
This instcombine rule folds away trunc operations that have value available from a prior load or store.
This kind of code can be generated as a result of GVN widening the load or from source code as well.
Reviewers: reames, majnemer, sanjoy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D21246
llvm-svn: 273608
By moving this transform to InstSimplify from InstCombine, we sidestep the problem/question
raised by PR27869:
https://llvm.org/bugs/show_bug.cgi?id=27869
...where InstCombine turns an icmp+zext into a shift causing us to miss the fold.
Credit to David Majnemer for a draft patch of the changes to InstructionSimplify.cpp.
Differential Revision: http://reviews.llvm.org/D21512
llvm-svn: 273200
The motivating example for this transform is similar to D20774 where bitcasts interfere
with a single cmp/select sequence, but in this case we have 2 uses of each bitcast to
produce min and max ops:
define void @minmax_bc_store(<4 x float> %a, <4 x float> %b, <4 x float>* %ptr1, <4 x float>* %ptr2) {
%cmp = fcmp olt <4 x float> %a, %b
%bc1 = bitcast <4 x float> %a to <4 x i32>
%bc2 = bitcast <4 x float> %b to <4 x i32>
%sel1 = select <4 x i1> %cmp, <4 x i32> %bc1, <4 x i32> %bc2
%sel2 = select <4 x i1> %cmp, <4 x i32> %bc2, <4 x i32> %bc1
%bc3 = bitcast <4 x float>* %ptr1 to <4 x i32>*
store <4 x i32> %sel1, <4 x i32>* %bc3
%bc4 = bitcast <4 x float>* %ptr2 to <4 x i32>*
store <4 x i32> %sel2, <4 x i32>* %bc4
ret void
}
With this patch, we move the selects up to use the input args which allows getting rid of
all of the bitcasts:
define void @minmax_bc_store(<4 x float> %a, <4 x float> %b, <4 x float>* %ptr1, <4 x float>* %ptr2) {
%cmp = fcmp olt <4 x float> %a, %b
%sel1.v = select <4 x i1> %cmp, <4 x float> %a, <4 x float> %b
%sel2.v = select <4 x i1> %cmp, <4 x float> %b, <4 x float> %a
store <4 x float> %sel1.v, <4 x float>* %ptr1, align 16
store <4 x float> %sel2.v, <4 x float>* %ptr2, align 16
ret void
}
The asm for x86 SSE then improves from:
movaps %xmm0, %xmm2
cmpltps %xmm1, %xmm2
movaps %xmm2, %xmm3
andnps %xmm1, %xmm3
movaps %xmm2, %xmm4
andnps %xmm0, %xmm4
andps %xmm2, %xmm0
orps %xmm3, %xmm0
andps %xmm1, %xmm2
orps %xmm4, %xmm2
movaps %xmm0, (%rdi)
movaps %xmm2, (%rsi)
To:
movaps %xmm0, %xmm2
minps %xmm1, %xmm2
maxps %xmm0, %xmm1
movaps %xmm2, (%rdi)
movaps %xmm1, (%rsi)
The TODO comments show that we're limiting this transform only to vectors and only to bitcasts
because we need to improve other transforms or risk creating worse codegen.
Differential Revision: http://reviews.llvm.org/D21190
llvm-svn: 273011
We would fail to validate the type of the tan function which would cause
downstream users of isValidProtoForLibFunc to assert.
This fixes PR28143.
llvm-svn: 272802
Nearly all the changes to this pass have been done while maintaining and
updating other parts of LLVM. LLVM has had another pass, SROA, which
has superseded ScalarReplAggregates for quite some time.
Differential Revision: http://reviews.llvm.org/D21316
llvm-svn: 272737
Unlike native shifts, the AVX2 per-element shift instructions VPSRAV/VPSRLV/VPSLLV handle out of range shift values (logical shifts set the result to zero, arithmetic shifts splat the sign bit).
If the shift amount is constant we can sometimes convert these instructions to native shifts:
1 - if all shift amounts are in range then the conversion is trivial.
2 - out of range arithmetic shifts can be clamped to the (bitwidth - 1) (a legal shift amount) before conversion.
3 - logical shifts just return zero if all elements have out of range shift amounts.
In addition, UNDEF shift amounts are handled - either as an UNDEF shift amount in a native shift or as an UNDEF in the logical 'all out of range' zero constant special case for logical shifts.
Differential Revision: http://reviews.llvm.org/D19675
llvm-svn: 271996
This patch adds support for folding undef/zero/constant inputs to MOVMSK instructions.
The SSE/AVX versions can be fully folded, but the MMX version can only handle undef inputs.
Differential Revision: http://reviews.llvm.org/D20998
llvm-svn: 271990
scalarizePHI only looked for phis that have exactly two uses - the "latch"
use, and an extract. Unfortunately, we can not assume all equivalent extracts
are CSE'd, since InstCombine itself may create an extract which is a duplicate
of an existing one. This extends it to handle several distinct extracts from
the same index.
This should fix at least some of the performance regressions from PR27988.
Differential Revision: http://reviews.llvm.org/D20983
llvm-svn: 271961
In r271810 ( http://reviews.llvm.org/rL271810 ), I loosened the check
above this to work for any Constant rather than ConstantInt. AFAICT,
that part makes sense if we can determine that the shrunken/extended
constant remained equal. But it doesn't make sense for this later
transform where we assume that the constant DID change.
This could assert for a ConstantExpr:
https://llvm.org/bugs/show_bug.cgi?id=28011
And it could be wrong for a vector as shown in the added regression test.
llvm-svn: 271908
Since FoldOpIntoPhi speculates the binary operation to potentially each
of the predecessors of the PHI node (pulling it out of arbitrary control
dependence in the process), we can FoldOpIntoPhi only if we know the
operation doesn't have UB.
This also brings up an interesting profitability question -- the way it
is written today, commonIRemTransforms will hoist out work from
dynamically dead code into code that will execute at runtime. Perhaps
that isn't the best canonicalization?
Fixes PR27968.
llvm-svn: 271857
Add the MMX implementation to the SimplifyDemandedUseBits SSE/AVX MOVMSK support added in D19614
Requires a minor tweak as llvm.x86.mmx.pmovmskb takes a x86_mmx argument - so we have to be explicit about the implied v8i8 vector type.
llvm-svn: 271789
There was concern that creating bitcasts for the simpler potential select pattern:
define <2 x i64> @vecBitcastOp1(<4 x i1> %cmp, <2 x i64> %a) {
%a2 = add <2 x i64> %a, %a
%sext = sext <4 x i1> %cmp to <4 x i32>
%bc = bitcast <4 x i32> %sext to <2 x i64>
%and = and <2 x i64> %a2, %bc
ret <2 x i64> %and
}
might lead to worse code for some targets, so this patch is matching the larger
patterns seen in the test cases.
The motivating example for this patch is this IR produced via SSE intrinsics in C:
define <2 x i64> @gibson(<2 x i64> %a, <2 x i64> %b) {
%t0 = bitcast <2 x i64> %a to <4 x i32>
%t1 = bitcast <2 x i64> %b to <4 x i32>
%cmp = icmp sgt <4 x i32> %t0, %t1
%sext = sext <4 x i1> %cmp to <4 x i32>
%t2 = bitcast <4 x i32> %sext to <2 x i64>
%and = and <2 x i64> %t2, %a
%neg = xor <4 x i32> %sext, <i32 -1, i32 -1, i32 -1, i32 -1>
%neg2 = bitcast <4 x i32> %neg to <2 x i64>
%and2 = and <2 x i64> %neg2, %b
%or = or <2 x i64> %and, %and2
ret <2 x i64> %or
}
For an AVX target, this is currently:
vpcmpgtd %xmm1, %xmm0, %xmm2
vpand %xmm0, %xmm2, %xmm0
vpandn %xmm1, %xmm2, %xmm1
vpor %xmm1, %xmm0, %xmm0
retq
With this patch, it becomes:
vpmaxsd %xmm1, %xmm0, %xmm0
Differential Revision: http://reviews.llvm.org/D20774
llvm-svn: 271676
The original tests were intended to show a missing transform that would
be solved by D20774:
http://reviews.llvm.org/D20774
But it's not clear that the transform for the simpler tests is a win for
all targets. Make the tests show a larger pattern that should be a win
regardless of the cost of bitcast instructions.
llvm-svn: 271603