Checking whether a number has a certain number of trailing / leading
zeros means checking whether it is of the form XXXX1000 / 0001XXXX,
which can be done with an and+icmp.
Related to https://bugs.llvm.org/show_bug.cgi?id=28668. As a next
step, this can be extended to non-equality predicates.
Differential Revision: https://reviews.llvm.org/D55745
llvm-svn: 349530
As the FIXME indicates, this has the potential to go
overboard. So I'm not sure if it's even worth keeping
this vs. iteratively doing simple matches, but we might
as well clean it up.
llvm-svn: 349523
The problem is shown specifically for a case with vector multiply here:
https://bugs.llvm.org/show_bug.cgi?id=40032
...and this might mask the original backend bug for ARM shown in:
https://bugs.llvm.org/show_bug.cgi?id=39967
As the test diffs here show, we were (and probably still aren't) doing
these kinds of transforms in a principled way. We are producing more or
equal wide instructions than we started with in some cases, so we still
need to restrict/correct other transforms from overstepping.
If there are perf regressions from this change, we can either carve out
exceptions to the general IR rules, or improve the backend to do these
transforms when we know the transform is profitable. That's probably
similar to a change like D55448.
Differential Revision: https://reviews.llvm.org/D55744
llvm-svn: 349389
This fixes https://bugs.llvm.org/show_bug.cgi?id=39908.
The evaluateGEPOffsetExpression() function simplifies GEP offsets for
use in comparisons against zero, basically by converting X*Scale+Offset==0
to X+Offset/Scale==0 if Scale divides Offset. However, before this is done,
Offset is masked down to the pointer size. This results in incorrect
results for negative Offsets, because we basically end up dividing the
32-bit offset *zero* extended to 64-bit bits (rather than sign extended).
Fix this by explicitly sign extending the truncated value.
Differential Revision: https://reviews.llvm.org/D55449
llvm-svn: 348987
call iM movmsk(sext <N x i1> X) --> zext (bitcast <N x i1> X to iN) to iM
This has the potential to create less-than-8-bit scalar types as shown in
some of the test diffs, but it looks like the backend knows how to deal
with that in these patterns. This is the simple part of the fix suggested in:
https://bugs.llvm.org/show_bug.cgi?id=39927
Differential Revision: https://reviews.llvm.org/D55529
llvm-svn: 348862
I was finally able to quantify what i thought was missing in the fix,
it was vector constants. If we have a scalar (and %x, -1),
it will be instsimplified before we reach this code,
but if it is a vector, we may still have a -1 element.
Thus, we want to avoid the fold if *at least one* element is -1.
Or in other words, ignoring the undef elements, no sign bits
should be set. Thus, m_NonNegative().
A follow-up for rL348181
https://bugs.llvm.org/show_bug.cgi?id=39861
llvm-svn: 348462
Extracting from a splat constant is always handled by InstSimplify.
Move the test for this from InstCombine to InstSimplify to make
sure that stays true.
llvm-svn: 348423
The tests here are based on the motivating cases from D54827.
More background:
1. We don't get these cases in general with SimplifyCFG because the root
of the pattern match is an icmp, not a branch. I'm not sure how often
we encounter this pattern vs. the seemingly more likely case with
branches, but I don't see evidence to leave the minimal pattern
unoptimized.
2. This has a chance of increasing compile-time because we're using a
ValueTracking call to handle the match. The motivating cases could be
handled with a simpler pair of calls to isImpliedTrueByMatchingCmp/
isImpliedFalseByMatchingCmp, but I saw that we have a more
comprehensive wrapper around those, so we might as well use it here
unless there's evidence that it's significantly slower.
3. Ideally, we'd handle the fold to constants in InstSimplify, but as
with the existing code here, we could extend this to handle cases
where the result is not a constant, but a new combined predicate.
That would mean splitting the logic across the 2 passes and possibly
duplicating the pattern-matching cost.
4. As mentioned in D54827, this seems like the kind of thing that should
be handled in Correlated Value Propagation, but that pass is currently
limited to dealing with instructions with constant operands, so extending
this bit of InstCombine is the smallest/easiest way to get these patterns
optimized.
llvm-svn: 348367
Move it out from under the constant check, reorder
predicates, add comments. This makes it easier to
extend to handle the non-constant case.
llvm-svn: 348284
There's a potential small enhancement to this code that could
solve the cases currently under proposal in D54827 via SimplifyCFG.
Whether instcombine should be doing this kind of semi-non-local
analysis in the first place is an open question, but separating
the logic out can only help if/when we decide to move it to a
different pass.
AFAICT, any proposal to do this in SimplifyCFG could also be seen
as an overreach + it would be incomplete to start the fold from a
branch rather than an icmp.
There's another question here about the code for processUGT_ADDCST_ADD().
That part may be completely dead after rL234638 ?
llvm-svn: 348273
When we have a shuffle that extends a source vector with undefs
and then do some binop on that, we must make sure that the extra
elements remain undef with that binop if we reverse the order of
the binop and shuffle.
'or' is probably the easiest example to show the bug because
'or C, undef --> -1' (not undef). But there are other
opcode/constant combinations where this is true as shown by
the 'shl' test.
llvm-svn: 348191
There are potential improvements to the structure of this API
raised by D54994, but remove some cosmetic blemishes before
making any functional changes.
llvm-svn: 348149
Extend ssub.sat(X, C) -> sadd.sat(X, -C) canonicalization to also
support non-splat vector constants. This is done by generalizing
the implementation of the isNotMinSignedValue() helper to return
true for constants that are non-splat, but don't contain any
signed min elements.
Differential Revision: https://reviews.llvm.org/D55011
llvm-svn: 348072
Also revert fix r347876
One of the buildbots was reporting a failure in some relevant tests that I can't
repro or explain at present, so reverting until I can isolate.
llvm-svn: 347911
This is an almost direct move of the functionality from InstCombine to
InstSimplify. There's no reason not to do this in InstSimplify because
we never create a new value with this transform.
(There's a question of whether any dominance-based transform belongs in
either of these passes, but that's a separate issue.)
I've changed 1 of the conditions for the fold (1 of the blocks for the
branch must be the block we started with) into an assert because I'm not
sure how that could ever be false.
We need 1 extra check to make sure that the instruction itself is in a
basic block because passes other than InstCombine may be using InstSimplify
as an analysis on values that are not wired up yet.
The 3-way compare changes show that InstCombine has some kind of
phase-ordering hole. Otherwise, we would have already gotten the intended
final result that we now show here.
llvm-svn: 347896
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
llvm-svn: 347871
Combine
sat(sat(X + C1) + C2) -> sat(X + (C1+C2))
and
sat(sat(X - C1) - C2) -> sat(X - (C1+C2))
if the sign of C1 and C2 matches.
In the unsigned case we can compute C1+C2 with saturating arithmetic,
and InstSimplify will reduce this just to the saturation value. For
the signed case, we cannot perform the simplification if the result
of the addition overflows.
This change is part of https://reviews.llvm.org/D54534.
llvm-svn: 347773
Canonicalize ssub.sat(X, C) to ssub.sat(X, -C) if C is constant and
not signed minimum. This will help further optimizations to apply.
This change is part of https://reviews.llvm.org/D54534.
llvm-svn: 347772
If ValueTracking can determine that the add/sub can newer overflow,
replace it with the corresponding nuw/nsw add/sub.
Additionally, for the unsigned case, if ValueTracking determines
that the add/sub always overflows, replace the result with the
saturation value.
This change is part of https://reviews.llvm.org/D54534.
llvm-svn: 347770
If a saturating add intrinsic has one constant argument, make sure
it is on the RHS. This will simplify further transformations.
This change is part of https://reviews.llvm.org/D54534.
llvm-svn: 347769
I tried to change this, not quite realising the logic behind what we
were doing. Hopefully this comment will help the next person to come
along.
llvm-svn: 347653
Support funnel shifts in InstCombine demanded bits simplification.
If the shift amount is constant, we can determine both the demanded
bits of the operands, as well as the known bits of the result.
If one of the operands has no demanded bits, it will be replaced
by undef and the funnel shift will be simplified into a simple shift
due to the simplifications added in D54778.
Differential Revision: https://reviews.llvm.org/D54869
llvm-svn: 347515
The following simplifications are implemented:
* `fshl(X, 0, C) -> shl X, C%BW`
* `fshl(X, undef, C) -> shl X, C%BW` (assuming undef = 0)
* `fshl(0, X, C) -> lshr X, BW-C%BW`
* `fshl(undef, X, C) -> lshr X, BW-C%BW` (assuming undef = 0)
* `fshr(X, 0, C) -> shl X, (BW-C%BW)`
* `fshr(X, undef, C) -> shl X, BW-C%BW` (assuming undef = 0)
* `fshr(0, X, C) -> lshr X, C%BW`
* `fshr(undef, X, C) -> lshr, X, C%BW` (assuming undef = 0)
The simplification is only performed if the shift amount C is constant,
because we can explicitly compute C%BW and BW-C%BW in this case.
Differential Revision: https://reviews.llvm.org/D54778
llvm-svn: 347505
Add methods to BasicBlock which make it easier to efficiently check
whether a block has N (or more) predecessors.
This can be more efficient than using pred_size(), which is a linear
time operation.
We might consider adding similar methods for successors. I haven't done
so in this patch because succ_size() is already O(1).
With this patch applied, I measured a 0.065% compile-time reduction in
user time for running `opt -O3` on the sqlite3 amalgamation (30 trials).
The change in mergeStoreIntoSuccessor alone saves 45 million linked list
iterations in a stage2 Release build of llc.
See llvm.org/PR39702 for a harder but more general way of achieving
similar results.
Differential Revision: https://reviews.llvm.org/D54686
llvm-svn: 347256
Summary:
These asserts are based on the assumption that the order of true/false operands in a select and those in the compare would always be the same.
This fixes PR39595.
Reviewers: craig.topper, spatel, dmgreen
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D54359
llvm-svn: 346874
The shift amount of a funnel shift is modulo the scalar bitwidth:
http://llvm.org/docs/LangRef.html#llvm-fshl-intrinsic
...so we can use demanded bits analysis on that operand to simplify it
when we have a power-of-2 bitwidth.
This is another step towards canonicalizing {shift/shift/or} to the
intrinsics in IR.
Differential Revision: https://reviews.llvm.org/D54478
llvm-svn: 346814
The cmp+branch variant of this pattern is shown in:
https://bugs.llvm.org/show_bug.cgi?id=34924
...and as discussed there, we probably can't transform
that without a rotate intrinsic. We do have that now
via funnel shift, but we're not quite ready to
canonicalize IR to that form yet. The case with 'select'
should already be transformed though, so that's this patch.
The sequence with negation followed by masking is what we
use in the backend and partly in clang (though that part
should be updated).
https://rise4fun.com/Alive/TplC
%cmp = icmp eq i32 %shamt, 0
%sub = sub i32 32, %shamt
%shr = lshr i32 %x, %shamt
%shl = shl i32 %x, %sub
%or = or i32 %shr, %shl
%r = select i1 %cmp, i32 %x, i32 %or
=>
%neg = sub i32 0, %shamt
%masked = and i32 %shamt, 31
%maskedneg = and i32 %neg, 31
%shl2 = lshr i32 %x, %masked
%shr2 = shl i32 %x, %maskedneg
%r = or i32 %shl2, %shr2
llvm-svn: 346807
This is a longer variant for the pattern handled in
rL346713
This one includes zexts.
Eventually, we should canonicalize all rotate patterns
to the funnel shift intrinsics, but we need a bit more
infrastructure to make sure the vectorizers handle those
intrinsics as well as the shift+logic ops.
https://rise4fun.com/Alive/FMn
Name: narrow rotateright
%neg = sub i8 0, %shamt
%rshamt = and i8 %shamt, 7
%rshamtconv = zext i8 %rshamt to i32
%lshamt = and i8 %neg, 7
%lshamtconv = zext i8 %lshamt to i32
%conv = zext i8 %x to i32
%shr = lshr i32 %conv, %rshamtconv
%shl = shl i32 %conv, %lshamtconv
%or = or i32 %shl, %shr
%r = trunc i32 %or to i8
=>
%maskedShAmt2 = and i8 %shamt, 7
%negShAmt2 = sub i8 0, %shamt
%maskedNegShAmt2 = and i8 %negShAmt2, 7
%shl2 = lshr i8 %x, %maskedShAmt2
%shr2 = shl i8 %x, %maskedNegShAmt2
%r = or i8 %shl2, %shr2
llvm-svn: 346716
The sub-pattern for the shift amount in a rotate can take on
several different forms, and there's apparently no way to
canonicalize those without seeing the entire rotate sequence.
This is the form noted in:
https://bugs.llvm.org/show_bug.cgi?id=39624https://rise4fun.com/Alive/qnT
%zx = zext i8 %x to i32
%maskedShAmt = and i32 %shAmt, 7
%shl = shl i32 %zx, %maskedShAmt
%negShAmt = sub i32 0, %shAmt
%maskedNegShAmt = and i32 %negShAmt, 7
%shr = lshr i32 %zx, %maskedNegShAmt
%rot = or i32 %shl, %shr
%r = trunc i32 %rot to i8
=>
%truncShAmt = trunc i32 %shAmt to i8
%maskedShAmt2 = and i8 %truncShAmt, 7
%shl2 = shl i8 %x, %maskedShAmt2
%negShAmt2 = sub i8 0, %truncShAmt
%maskedNegShAmt2 = and i8 %negShAmt2, 7
%shr2 = lshr i8 %x, %maskedNegShAmt2
%r = or i8 %shl2, %shr2
llvm-svn: 346713
Noticed via inspection. Appears to be largely innocious in practice, but slight code change could have resulted in either visit order dependent missed optimizations or infinite loops. May be a minor compile time problem today.
llvm-svn: 346698
Summary:
When the 3rd argument to these intrinsics is zero, lowering them
to shift instructions produces poison values, since we end up with
shift amounts equal to the number of bits in the shifted value. This
means we can only lower these intrinsics if we can prove that the
3rd argument is not zero.
Reviewers: arsenm
Reviewed By: arsenm
Subscribers: bnieuwenhuizen, jvesely, wdng, nhaehnle, llvm-commits
Differential Revision: https://reviews.llvm.org/D53739
llvm-svn: 346422
By morphing the instruction rather than deleting and creating a new one,
we retain fast-math-flags and potentially other metadata (profile info?).
llvm-svn: 346331
The sibling fold for 'oge' --> 'ord' was already here,
but this half was missing.
The result of fabs() must be positive or nan, so asking
if the result is negative or nan is the same as asking
if the result is nan.
This is another step towards fixing:
https://bugs.llvm.org/show_bug.cgi?id=39475
llvm-svn: 346321
As shown, this is used to eliminate redundant code in InstCombine,
and there are more cases where we should be using this pattern, but
we're currently unintentionally dropping flags.
llvm-svn: 346282