This adds a loop rerolling pass: the opposite of (partial) loop unrolling. The
transformation aims to take loops like this:
for (int i = 0; i < 3200; i += 5) {
a[i] += alpha * b[i];
a[i + 1] += alpha * b[i + 1];
a[i + 2] += alpha * b[i + 2];
a[i + 3] += alpha * b[i + 3];
a[i + 4] += alpha * b[i + 4];
}
and turn them into this:
for (int i = 0; i < 3200; ++i) {
a[i] += alpha * b[i];
}
and loops like this:
for (int i = 0; i < 500; ++i) {
x[3*i] = foo(0);
x[3*i+1] = foo(0);
x[3*i+2] = foo(0);
}
and turn them into this:
for (int i = 0; i < 1500; ++i) {
x[i] = foo(0);
}
There are two motivations for this transformation:
1. Code-size reduction (especially relevant, obviously, when compiling for
code size).
2. Providing greater choice to the loop vectorizer (and generic unroller) to
choose the unrolling factor (and a better ability to vectorize). The loop
vectorizer can take vector lengths and register pressure into account when
choosing an unrolling factor, for example, and a pre-unrolled loop limits that
choice. This is especially problematic if the manual unrolling was optimized
for a machine different from the current target.
The current implementation is limited to single basic-block loops only. The
rerolling recognition should work regardless of how the loop iterations are
intermixed within the loop body (subject to dependency and side-effect
constraints), but the significant restriction is that the order of the
instructions in each iteration must be identical. This seems sufficient to
capture all current use cases.
This pass is not currently enabled by default at any optimization level.
llvm-svn: 194939
InstCombine, in visitFPTrunc, applies the following optimization to sqrt calls:
(fptrunc (sqrt (fpext x))) -> (sqrtf x)
but does not apply the same optimization to llvm.sqrt. This is a problem
because, to enable vectorization, Clang generates llvm.sqrt instead of sqrt in
fast-math mode, and because this optimization is being applied to sqrt and not
applied to llvm.sqrt, sometimes the fast-math code is slower.
This change makes InstCombine apply this optimization to llvm.sqrt as well.
This fixes the specific problem in PR17758, although the same underlying issue
(optimizations applied to libcalls are not applied to intrinsics) exists for
other optimizations in SimplifyLibCalls.
llvm-svn: 194935
clang -cc1 skips the driver so it never made sense to include these with the
Driver tests.
Basic type tests and flag tests generally both go in Frontend.
Now that the final -cc1 tests have been moved out of test/Driver, add a
local substitution to enforce and detect future mistakes.
These miscategorized tests were probably the source of confusion in r194817.
llvm-svn: 194919
(and same thing to Thread base class) which can be used when looking
at an ExtendedBacktrace thread; it will try to find the IndexID() of
the original thread that was executing this backtrace when it was
recorded. If lldb can't find a record of that thread, it will return
the same value as IndexID() for the ExtendedBacktrace thread.
llvm-svn: 194912
Teach the '-arch' command line option to enable the compiler-friendly
features of core-avx2 CPUs on Darwin. Pass the information along in the
target triple like Darwin+ARM does.
llvm-svn: 194907
The tests just hit this with a different sized
address space since I haven't figured out how
to use this to break it.
I thought I committed this a long time ago,
and I'm not sure why missing this hasn't caused
any problems.
llvm-svn: 194903
Earlier versions discarded the state too soon, and did not track state changes,
e.g. when passing a temporary to a move constructor. Patch by
chris.wailes@gmail.com; review and minor fixes by delesley.
llvm-svn: 194900
and update test cases accordingly.
This doesn't affect the output dumped using llvm-dwarfdump, but
readelf does now dump the debug_loc section.
llvm-svn: 194898