The purpose of patch is to learn Loop idiom recognition pass how to recognize simple memmove patterns
in similar way like GCC: https://godbolt.org/z/fh95e83od
LoopIdiomRecognize already has machinery for memset and memcpy recognition, patch tries to extend exisiting capabilities with minimal effort.
Differential Revision: https://reviews.llvm.org/D104464
Nowadays LLVM does not assume that all loops are finite,
so if we want to produce a finite loop from a potentially-infinite one,
we must ensure that the original loop is known to be a finite one.
For this transform, it only matters for arithmetic right-shifts.
For them, either the function or the loop must be known to
be `mustprogress`, or the original value being shifted must be known
to be non-negative (because iff the sign bit was set,
it will never become zero, but will become `-1` in the "end").
It would be really good for alive2 to actually complain about this,
but it currently does not: https://github.com/AliveToolkit/alive2/issues/726
This adds support for the "count active bits" pattern, i.e.:
```
int countBits(unsigned val) {
int cnt = 0;
for( ; (val << cnt) != 0; ++cnt)
;
return cnt;
}
```
but a somewhat more general one:
```
int countBits(unsigned val, int start, int off) {
int cnt;
for (cnt = start; val << (cnt + off); cnt++)
;
return cnt;
}
```
alive2 is happy with all the tests there.
Note that, again, much like with the right-shift cases,
we don't require the `val != 0` guard.
This is the last pattern that was supported by
`detectShiftUntilZeroIdiom()`, which now becomes obsolete.
This adds support for the "count active bits" pattern, i.e.:
```
int countActiveBits(signed val) {
int cnt = 0;
for( ; (val >> cnt) != 0; ++cnt)
;
return cnt;
}
```
but a somewhat more general one:
```
int countActiveBits(signed val, int start, int off) {
int cnt;
for (cnt = start; val >> (cnt + off); cnt++)
;
return cnt;
}
```
This directly matches the existing 'logical right-shift until zero' idiom.
alive2 is happy with all the tests there.
Note that, again, much like with the original unsigned case,
we don't require the `val != 0` guard.
The old `detectShiftUntilZeroIdiom()` already supports this pattern,
the idea here is that the `val` must be positive (have at least one
leading zero), because otherwise the loop is non-terminating,
but since it is not `while(1)`, that would have been UB.
I think i've added exhaustive test coverage, and i have verified that alive2 is happy with all the tests,
so in principle i'm fine with landing this without review, but just in case..
This adds support for the "count active bits" pattern, i.e.:
```
int countActiveBits(unsigned val) {
int cnt = 0;
for( ; (val >> cnt) != 0; ++cnt)
;
return cnt;
}
```
but a somewhat more general one, since that is what i need:
```
int countActiveBits(unsigned val, int start, int off) {
int cnt;
for (cnt = start; val >> (cnt + off); cnt++)
;
return cnt;
}
```
I've followed in footstep of 'left-shift until bittest' idiom (D91038),
in the sense that iff the `ctlz` intrinsic is cheap, we'll transform,
regardless of all other factors.
This can have a shocking effect on certain benchmarks:
```
raw.pixls.us-unique/Olympus/XZ-1$ /repositories/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 p1319978.orf
RUNNING: /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 p1319978.orf --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmp49_28zcm
2021-05-09T01:06:05+03:00
Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
Run on (32 X 3600.24 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 5.26, 6.29, 3.49
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations CPUTime,s CPUTime/WallTime Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
p1319978.orf/threads:32/process_time/real_time_mean 145 ms 145 ms 128 0.145319 0.999981 10.1568M 69.8949M 69.8936M 6.88159 6.88146 0.145322
p1319978.orf/threads:32/process_time/real_time_median 145 ms 145 ms 128 0.145317 0.999986 10.1568M 69.8941M 69.8931M 6.88151 6.88141 0.145319
p1319978.orf/threads:32/process_time/real_time_stddev 0.766 ms 0.766 ms 128 766.586u 15.1302u 0 354.167k 354.098k 0.0348699 0.0348631 766.469u
RUNNING: /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 p1319978.orf --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpwb9sw2x0
2021-05-09T01:06:24+03:00
Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Run on (32 X 3599.95 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 4.05, 5.95, 3.43
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations CPUTime,s CPUTime/WallTime Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
p1319978.orf/threads:32/process_time/real_time_mean 99.8 ms 99.8 ms 128 0.0997758 0.999972 10.1568M 101.797M 101.794M 10.0225 10.0222 0.0997786
p1319978.orf/threads:32/process_time/real_time_median 99.7 ms 99.7 ms 128 0.0997165 0.999985 10.1568M 101.857M 101.854M 10.0284 10.0281 0.0997195
p1319978.orf/threads:32/process_time/real_time_stddev 0.224 ms 0.224 ms 128 224.166u 34.345u 0 226.81k 227.231k 0.0223309 0.0223723 224.586u
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
p1319978.orf/threads:32/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 128 vs 128
p1319978.orf/threads:32/process_time/real_time_mean -0.3134 -0.3134 145 100 145 100
p1319978.orf/threads:32/process_time/real_time_median -0.3138 -0.3138 145 100 145 100
p1319978.orf/threads:32/process_time/real_time_stddev -0.7073 -0.7078 1 0 1 0
```
Reviewed By: craig.topper, zhuhan0
Differential Revision: https://reviews.llvm.org/D102116
While `%x.curr` is always safe to compute, because `LoopBackedgeTakenCount`
will always be smaller than `bitwidth(X)`, i.e. we never get poison,
rewriting `%x.next` is more complicated, however, because `X << LoopTripCount`
will be poison iff `LoopTripCount == bitwidth(X)` (which will happen
iff `BitPos` is `bitwidth(x) - 1` and `X` is `1`).
So unless we know that isn't the case (as alive2 notes, we know it's safe
to do iff shift had no-wrap flags, or bitpos does not indicate signbit,
or we know that %x is never `1`), we'll need to emit an alternative,
safe IR, by either just shifting the `%x.curr`, or conditionally selecting
between the computed `%x.next` and `0`..
Former IR looks better so let's do that.
While there, ensure that we don't drop no-wrap flags from said shift.
In particular, add tests with no-wrap flags on shift,
a test where %x is not `1`, and ensure that tests where %bit
is a constant bitwidth-1, or is not a constant bitwidth-1
test both liveout values.
The current state of the transform is still not enough to support
my motivational pattern, because it has one more "induction variable".
I have delayed posting this patch, because originally even just rewriting
the loop as countable wasn't enough to nicely transform my motivational pattern,
because i expected that extra IV to be rewritten afterwards,
but it wasn't happening until i fixed that in D91800.
So, this patch allows the 'left-shift until bittest' loop idiom
as long as the inserted ops are cheap,
and lifts any and all extra use checks on the instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D92754
If the bitmask is for sign bit, instcombine would have canonicalized
the pattern into a proper sign bit check. Supporting that is still
simple, but requires a bit of a roundtrip - we first have to use
`decomposeBitTestICmp()`, and the rest again just works.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91726
The handing of the case where the mask is a constant is trivial,
if said constant is a power of two, the bit in question is log2(mask),
rest just works.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91725
The motivation here is the following inner loop in fp16/fp24 -> fp32 expander,
that runs as part of the floating-point DNG decompression in RawSpeed library:
cd380bb9a2/src/librawspeed/decompressors/DeflateDecompressor.cpp (L112-L115)
```
while (!(fp32_fraction & (1 << 23))) {
fp32_exponent -= 1;
fp32_fraction <<= 1;
}
```
(https://godbolt.org/z/r13YMh)
As one might notice, that loop is currently uncountable, and that whole code stays scalar.
Yet, it is rather trivial to make that loop countable:
https://godbolt.org/z/do8WMz
and we can prove that via alive2:
https://alive2.llvm.org/ce/z/7vQnji (ha nice, isn't it?)
... and that allow for the whole fp16->fp32 code to vectorize:
https://godbolt.org/z/7hYr13
Now, while i'd love to get there, i feel like i should take it in steps.
For now, this introduces support for the most basic case,
where the bit position is known as a variable,
and the loop *will* go away (has no live-outs other than the recurrence,
no extra instructions in the loop).
I have added sufficient (i believe) test coverage,
and alive2 is happy with those transforms.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D91038
The LCSSA pass makes use of a function insertDebugValuesForPHIs() to
propogate dbg.value() intrinsics to newly inserted PHI instructions. Faulty
behaviour occurs when the parent PHI of a newly inserted PHI is not the
most recent assignment to a source variable. insertDebugValuesForPHIs ends
up propagating a value that isn't the most recent assignemnt.
This change removes the call to insertDebugValuesForPHIs() from LCSSA,
preventing incorrect dbg.value intrinsics from being propagated.
Propagating variable locations between blocks will occur later, during
LiveDebugValues.
Differential Revision: https://reviews.llvm.org/D92576
This adds support for loops like
unsigned clz(unsigned x) {
unsigned w = sizeof (x) * CHAR_BIT;
while (x) {
w--;
x >>= 1;
}
return w;
}
and
unsigned clz(unsigned x) {
unsigned w = sizeof (x) * CHAR_BIT - 1;
while (x >>= 1) {
w--;
}
return w;
}
To support these we look for add x, -1 as well as add x, 1 that
we already matched. If the value was -1 we need to subtract from
the initial counter value instead of adding to it.
Fixes PR48404.
Differential Revision: https://reviews.llvm.org/D92745
Explicitly opt-out llvm/test/Transforms/Attributor.
Verified by flipping the default value of allow-unused-prefixes and
observing that none of the failures were under llvm/test/Transforms.
Differential Revision: https://reviews.llvm.org/D92404
This is D77454, except for stores. All the infrastructure work was done
for loads, so the remaining changes necessary are relatively small.
Differential Revision: https://reviews.llvm.org/D79968
For IR generated by a compiler, this is really simple: you just take the
datalayout from the beginning of the file, and apply it to all the IR
later in the file. For optimization testcases that don't care about the
datalayout, this is also really simple: we just use the default
datalayout.
The complexity here comes from the fact that some LLVM tools allow
overriding the datalayout: some tools have an explicit flag for this,
some tools will infer a datalayout based on the code generation target.
Supporting this properly required plumbing through a bunch of new
machinery: we want to allow overriding the datalayout after the
datalayout is parsed from the file, but before we use any information
from it. Therefore, IR/bitcode parsing now has a callback to allow tools
to compute the datalayout at the appropriate time.
Not sure if I covered all the LLVM tools that want to use the callback.
(clang? lli? Misc IR manipulation tools like llvm-link?). But this is at
least enough for all the LLVM regression tests, and IR without a
datalayout is not something frontends should generate.
This change had some sort of weird effects for certain CodeGen
regression tests: if the datalayout is overridden with a datalayout with
a different program or stack address space, we now parse IR based on the
overridden datalayout, instead of the one written in the file (or the
default one, if none is specified). This broke a few AVR tests, and one
AMDGPU test.
Outside the CodeGen tests I mentioned, the test changes are all just
fixing CHECK lines and moving around datalayout lines in weird places.
Differential Revision: https://reviews.llvm.org/D78403
This will allow us to use the datalayout to disambiguate other
constructs in IR, like load alignment. Split off from D78403.
Differential Revision: https://reviews.llvm.org/D78413
Second functional change following on from rL362687. Pass the
NoWrapFlags from the MulExpr to InsertBinop when we're generating a
shl or mul.
Differential Revision: https://reviews.llvm.org/D61934
llvm-svn: 363540
If the given SCEVExpr has no (un)signed flags attached to it, transfer
these to the resulting instruction or use them to find an existing
instruction.
Differential Revision: https://reviews.llvm.org/D61934
llvm-svn: 362687
As it's causing some bot failures (and per request from kbarton).
This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.
llvm-svn: 358546
Summary:
Existing LIR recognizes CTLZ where shifting input variable right until it is zero. (Shift-Until-Zero idiom)
This commit:
1. Augments Shift-Until-Zero idiom to recognize CTTZ where input variable is shifted left.
2. Prepare for BitScan idiom recognition.
Patch by Yuanfang Chen (tabloid.adroit)
Reviewers: craig.topper, evstupac
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55876
llvm-svn: 350074
This commit suppresses turning loops like this into "(bitwidth - ctlz(input))".
unsigned foo(unsigned input) {
unsigned num = 0;
do {
++num;
input >>= 1;
} while (input != 0);
return num;
}
The loop version returns a value of 1 for both an input of 0 and an input of 1. Converting to a naive ctlz does not preserve that.
Theoretically we could do better if we checked isKnownNonZero or we could insert a select to handle the divergence. But until we have motivating cases for that, this is the easiest solution.
llvm-svn: 336864
This loop executes one iteration without checking the input value. This produces a count of 1 for an input of 0 and 1. We are turning this into 32 - ctlz(n), but that returns 0 if n is 0.
llvm-svn: 336862
In the 'detectCTLZIdiom' function support for loops that use LSHR instruction instead of ASHR has been added.
This supports creating ctlz from the following code.
int lzcnt(int x) {
int count = 0;
while (x > 0) {
count++;
x = x >> 1;
}
return count;
}
Patch by Olga Moldovanova
Differential Revision: https://reviews.llvm.org/D48354
llvm-svn: 336509
When checking a select to see if it matches an abs, allow the true/false values
to be a sign-extension of the comparison value instead of requiring that they're
directly the comparison value, as all the comparison cares about is the sign of
the value.
This fixes a regression due to r333702, where we were no longer generating ctlz
due to isKnownNonNegative failing to match such a pattern.
Differential Revision: https://reviews.llvm.org/D47631
llvm-svn: 333927
Summary:
Loop idiom recognize tries to convert loops like
```
int foo(int x) {
int cnt = 0;
while (x) {
x >>= 1;
++cnt;
}
return cnt;
}
```
into calls to ctlz, but if x is initially negative this loop should be infinite.
It happens that the cases that motivated this change have an absolute value of x before the loop. So this patch restricts the transform to cases where we know x is positive. Note: We are relying on the absolute value of INT_MIN to be undefined so we can assume that the result is always positive.
Fixes PR37479
Reviewers: spatel, hfinkel, efriedma, javed.absar
Reviewed By: efriedma
Subscribers: dmgreen, llvm-commits
Differential Revision: https://reviews.llvm.org/D47348
llvm-svn: 333702
We currently recognize this idiom where x is signed and thus the shift in an ashr.
int cnt = 0;
while (x) {
x >>= 1; // arithmetic shift right
++cnt;
}
and turn it into (bitwidth - ctlz(x)). And if there is anything else in the loop we will create a new loop that runs that many times.
If x is initially negative, the shift result will never be 0 and thus the loop is infinite. If you put something with side effects in the loop, that side effect will now only happen bitwidth times instead of an infinite number of times.
So this transform is only safe for logical shift right (which we don't currently recognize) or if we can prove that x cannot be negative before the loop.
llvm-svn: 331493
The code fails to check that the same value is used twice. We only make sure the left hand side of the and is part of the loop recurrence. The 'x' in the subtract can be any value.
llvm-svn: 331436
Summary:
Background: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112779.html
This change is to alter the prototype for the atomic memcpy intrinsic. The prototype itself is being changed to more closely resemble the semantics and parameters of the llvm.memcpy intrinsic -- to ease later combination of the llvm.memcpy and atomic memcpy intrinsics. Furthermore, the name of the atomic memcpy intrinsic is being changed to make it clear that it is not a generic atomic memcpy, but specifically a memcpy is unordered atomic.
Reviewers: reames, sanjoy, efriedma
Reviewed By: reames
Subscribers: mzolotukhin, anna, llvm-commits, skatkov
Differential Revision: https://reviews.llvm.org/D33240
llvm-svn: 305558
Patch https://reviews.llvm.org/rL304806 was causing failures in Aarch64
and multiple other targets since the test should be run on X86 only.
Specifying the target triple is not enough. Moving the testcase to the
X86 target directory in LoopIdiom.
llvm-svn: 304809