This patch ensures that SHL/SRL/SRA shifts for i8 and i16 vectors avoid scalarization. It builds on the existing i8 SHL vectorized implementation of moving the shift bits up to the sign bit position and separating the 4, 2 & 1 bit shifts with several improvements:
1 - SSE41 targets can use (v)pblendvb directly with the sign bit instead of performing a comparison to feed into a VSELECT node.
2 - pre-SSE41 targets were masking + comparing with an 0x80 constant - we avoid this by using the fact that a set sign bit means a negative integer which can be compared against zero to then feed into VSELECT, avoiding the need for a constant mask (zero generation is much cheaper).
3 - SRA i8 needs to be unpacked to the upper byte of a i16 so that the i16 psraw instruction can be correctly used for sign extension - we have to do more work than for SHL/SRL but perf tests indicate that this is still beneficial.
The i16 implementation is similar but simpler than for i8 - we have to do 8, 4, 2 & 1 bit shifts but less shift masking is involved. SSE41 use of (v)pblendvb requires that the i16 shift amount is splatted to both bytes however.
Tested on SSE2, SSE41 and AVX machines.
Differential Revision: http://reviews.llvm.org/D9474
llvm-svn: 239509
Part of D9474, this patch extends AVX2 v16i16 types to 2 x 8i32 vectors and uses i32 shift variable shifts before packing back to i16.
Adds AVX2 tests for v8i16 and v16i16
llvm-svn: 238149
The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture,
by setting MaxInterleaveFactor to 1. Unrolling in loop vectorization pass may introduce
the cost of overflow check, memory boundary check and extra prologue/epilogue code when
regular unroller will unroll the loop another time. Disable it when VF==1 remove the
unnecessary cost on x86. The same can be done for other platforms after verifying
interleaving/memory bound checking to be not perf critical on those platforms.
Differential Revision: http://reviews.llvm.org/D9515
llvm-svn: 236613
base which it adds a single analysis pass to, to instead return the type
erased TargetTransformInfo object constructed for that TargetMachine.
This removes all of the pass variants for TTI. There is now a single TTI
*pass* in the Analysis layer. All of the Analysis <-> Target
communication is through the TTI's type erased interface itself. While
the diff is large here, it is nothing more that code motion to make
types available in a header file for use in a different source file
within each target.
I've tried to keep all the doxygen comments and file boilerplate in line
with this move, but let me know if I missed anything.
With this in place, the next step to making TTI work with the new pass
manager is to introduce a really simple new-style analysis that produces
a TTI object via a callback into this routine on the target machine.
Once we have that, we'll have the building blocks necessary to accept
a function argument as well.
llvm-svn: 227685
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
I'm recommiting the codegen part of the patch.
The vectorizer part will be send to review again.
Masked Vector Load and Store Intrinsics.
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
llvm-svn: 223348
This reverts commit r222632 (and follow-up r222636), which caused a host
of LNT failures on an internal bot. I'll respond to the commit on the
list with a reproduction of one of the failures.
Conflicts:
lib/Target/X86/X86TargetTransformInfo.cpp
llvm-svn: 222936
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
llvm-svn: 222632
AVX2 is available.
According to IACA, the new lowering has a throughput of 8 cycles instead of 13
with the previous one.
Althought this lowering kicks in some SPECs benchmarks, the performance
improvement was within the noise.
Correctness testing has been done for the whole range of uint32_t with the
following program:
uint4 v = (uint4) {0,1,2,3};
uint32_t i;
//Check correctness over entire range for uint4 -> float4 conversion
for( i = 0; i < 1U << (32-2); i++ )
{
float4 t = test(v);
float4 c = correct(v);
if( 0xf != _mm_movemask_ps( t == c ))
{
printf( "Error @ %vx: %vf vs. %vf\n", v, c, t);
return -1;
}
v += 4;
}
Where "correct" is the old lowering and "test" the new one.
The patch adds a test case for the two custom lowering instruction.
It also modifies the vector cost model, which is why cast.ll and uitofp.ll are
modified.
2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7
instructions instead of 4 (3 more constant loads).
rdar://problem/18153096>
llvm-svn: 221657
"Unroll" is not the appropriate name for this variable. Clang already uses
the term "interleave" in pragmas and metadata for this.
Differential Revision: http://reviews.llvm.org/D5066
llvm-svn: 217528
This patch adds support to recognize division by uniform power of 2 and modifies the cost table to vectorize division by uniform power of 2 whenever possible.
Updates Cost model for Loop and SLP Vectorizer.The cost table is currently only updated for X86 backend.
Thanks to Hal, Andrea, Sanjay for the review. (http://reviews.llvm.org/D4971)
llvm-svn: 216371
This lets us experiment with 512-bit vectorization without passing
force-vector-width manually.
The code generated for a simple integer memset loop is properly vectorized.
Disassembly is still broken for it though :(.
llvm-svn: 212634
This patch:
1) Improves the cost model for x86 alternate shuffles (originally
added at revision 211339);
2) Teaches the Cost Model Analysis pass how to analyze alternate shuffles.
Alternate shuffles are a special kind of blend; on x86, we can often
easily lowered alternate shuffled into single blend
instruction (depending on the subtarget features).
The existing cost model didn't take into account subtarget features.
Also, it had a couple of "dead" entries for vector types that are never
legal (example: on x86 types v2i32 and v2f32 are not legal; those are
always either promoted or widened to 128-bit vector types).
The new x86 cost model takes into account what target features we have
before returning the shuffle cost (i.e. the number of instructions
after the blend is lowered/expanded).
This patch also teaches the Cost Model Analysis how to identify and analyze
alternate shuffles (i.e. 'SK_Alternate' shufflevector instructions):
- added function 'isAlternateVectorMask';
- added some logic to check if an instruction is a alternate shuffle and, in
case, call the target specific TTI to get the corresponding shuffle cost;
- added a test to verify the cost model analysis on alternate shuffles.
llvm-svn: 212296
This patch adds support to recognize patterns such as fadd,fsub,fadd,fsub.../add,sub,add,sub... and
vectorizes them as vector shuffles if they are profitable.
These patterns of vector shuffle can later be converted to instructions such as addsubpd etc on X86.
Thanks to Arnold and Hal for the reviews. http://reviews.llvm.org/D4015
llvm-svn: 211339
lib/Target/X86/X86TargetTransformInfo.cpp: In member function ‘virtual unsigned int {anonymous}::X86TTI::getIntImmCost(unsigned int, unsigned int, const llvm::APInt&, llvm::Type*) const’:
lib/Target/X86/X86TargetTransformInfo.cpp:920:60: warning: enumeral and non-enumeral type in conditional expression [enabled by default]
This seems like an unhelpful warning, but there doesnt seem to be a controlling
flag, so add an explicit cast to silence the warning.
llvm-svn: 210806
This improves the X86 cost model for small constants with large types. Before
this commit we would even hoist trivial constants such as i96 2.
This is related to <rdar://problem/17070936>
llvm-svn: 210504
Currently the X86 backend doesn't support types larger than i128 very well. For
example an i192 multiply will assert in codegen when the 2nd argument is a constant and the constant got hoisted.
This fix changes the cost model to never hoist constants for types larger than
i128. Once the codegen issues have been resolved, the cost model can be updated
to allow also larger types.
This is related to <rdar://problem/16954938>
llvm-svn: 209162
The old method used by X86TTI to determine partial-unrolling thresholds was
messy (because it worked by testing target features), and also would not
correctly identify the target CPU if certain target features were disabled.
After some discussions on IRC with Chandler et al., it was decided that the
processor scheduling models were the right containers for this information
(because it is often tied to special uop dispatch-buffer sizes).
This does represent a small functionality change:
- For generic x86-64 (which uses the SB model and, thus, will get some
unrolling).
- For AMD cores (because they still currently use the SB scheduling model)
- For Haswell (based on benchmarking by Louis Gerbarg, it was decided to bump
the default threshold to 50; we're working on a test case for this).
Otherwise, nothing has changed for any other targets. The logic, however, has
been moved into BasicTTI, so other targets may now also opt-in to this
functionality simply by setting LoopMicroOpBufferSize in their processor
model definitions.
llvm-svn: 208289
The loop stream detector (LSD) on modern Intel cores, which optimizes the
execution of small loops, has limits on the number of taken branches in
addition to uop-count limits (modern AMD cores have similar limits).
Unfortunately, at the IR level, estimating the number of branches that will be
taken is difficult. For one thing, it strongly depends on later passes (block
placement, etc.). The original implementation took a conservative approach and
limited the maximal BB DFS depth of the loop. However, fairly-extensive
benchmarking by several of us has revealed that this is the wrong approach. In
fact, there are zero known cases where the branch limit prevents a detrimental
unrolling (but plenty of cases where it does prevent beneficial unrolling).
While we could improve the current branch counting logic by incorporating
branch probabilities, this further complication seems unjustified without a
motivating regression. Instead, unless and until a regression appears, the
branch counting will be removed.
llvm-svn: 208255
There is no need to check if we want to hoist the immediate value of an
shift instruction. Simply return TCC_Free right away.
This change is like r206101, but for X86.
rdar://problem/16190769
llvm-svn: 207692
This provides an initial implementation of getUnrollingPreferences for x86.
getUnrollingPreferences is used by the generic (concatenation) unroller, which
is distinct from the unrolling done by the loop vectorizer. Many modern x86
cores have some kind of uop cache and loop-stream detector (LSD) used to
efficiently dispatch small loops, and taking full advantage of this requires
unrolling small loops (small here means 10s of uops).
These caches also have limits on the number of taken branches in the loop, and
so we also cap the loop unrolling factor based on the maximum "depth" of the
loop. This is currently calculated with a partial DFS traversal (partial
because it will stop early if the path length grows too much). This is still an
approximation, and one that is both conservative (because it does not account
for branches eliminated via block placement) and optimistic (because it is only
recording the maximum depth over minimum paths). Nevertheless, because the
loops that fit in these uop caches are so small, it is not clear how much the
details matter.
The original set of patches posted for review produced the following test-suite
performance results (from the TSVC benchmark) at that time:
ControlLoops-dbl - 13% speedup
ControlLoops-flt - 15% speedup
Reductions-dbl - 7.5% speedup
llvm-svn: 205348
There is no direct AVX instruction to convert to unsigned. I have some ideas
how we may be able to do this with three vector instructions but the current
backend just bails on this to get it scalarized.
See the comment why we need to adjust the cost returned by BasicTTI.
The test is a bit roundabout (and checks assembly rather than bit code) because
I'd like it to work even if at some point we could vectorize this conversion.
Fixes <rdar://problem/16371920>
llvm-svn: 205159
If getElementPtr uses a constant as base pointer, then make the constant opaque.
This prevents constant folding it with the offset. The offset can usually be
encoded in the load/store instruction itself and the base address doesn't have
to be rematerialized several times.
llvm-svn: 204739
The cost for the first four stackmap operands was always TCC_Free.
This is only true for the first two operands. All other operands
are TCC_Free if they are within 64bit.
llvm-svn: 204738
Extend the target hook to take also the operand index into account when
calculating the cost of the constant materialization.
Related to <rdar://problem/16381500>
llvm-svn: 204435
This commit extends the coverage of the constant hoisting pass, adds additonal
debug output and updates the function names according to the style guide.
Related to <rdar://problem/16381500>
llvm-svn: 204389
the stack of the analysis group because they are all immutable passes.
This is made clear by Craig's recent work to use override
systematically -- we weren't overriding anything for 'finalizePass'
because there is no such thing.
This is kind of a lame restriction on the API -- we can no longer push
and pop things, we just set up the stack and run. However, I'm not
invested in building some better solution on top of the existing
(terrifying) immutable pass and legacy pass manager.
llvm-svn: 203437
'OK_NonUniformConstValue' to identify operands which are constants but
not constant splats.
The cost model now allows returning 'OK_NonUniformConstValue'
for non splat operands that are instances of ConstantVector or
ConstantDataVector.
With this change, targets are now able to compute different costs
for instructions with non-uniform constant operands.
For example, On X86 the cost of a vector shift may vary depending on whether
the second operand is a uniform or non-uniform constant.
This patch applies the following changes:
- The cost model computation now takes into account non-uniform constants;
- The cost of vector shift instructions has been improved in
X86TargetTransformInfo analysis pass;
- BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish
between non-uniform and uniform constant operands.
Added a new test to verify that the output of opt
'-cost-model -analyze' is valid in the following configurations: SSE2,
SSE4.1, AVX, AVX2.
llvm-svn: 201272
The most important part of this is probably adding any cost at all for
operations like zext <8 x i8> to <8 x i32>. Before they were being
recorded as extremely costly (24, I believe) which made LLVM fall back
on a 4-wide vectorisation of a loop.
It also rebalances the values for sext, zext and trunc. Lacking any
other sane metric that might work across CPU microarchitectures I went
for instructions. This seems to be in reasonable accord with the rest
of the table (sitofp, ...) though no doubt at least one value is
sub-optimal for some bizarre reason.
Finally, separate AVX and AVX2 values are provided where appropriate.
The CodeGen is quite different in many cases.
rdar://problem/15981990
llvm-svn: 200928
This commit caused -Woverloaded-virtual warnings. The two new
TargetTransformInfo::getIntImmCost functions were only added to the superclass,
and to the X86 subclass. The other targets were not updated, and the
warning highlighted this by pointing out that e.g. ARMTTI::getIntImmCost was
hiding the two new getIntImmCost variants.
We could pacify the warning by adding "using TargetTransformInfo::getIntImmCost"
to the various subclasses, or turning it off, but I suspect that it's wrong to
leave the functions unimplemnted in those targets. The default implementations
return TCC_Free, which I don't think is right e.g. for ARM.
llvm-svn: 200058
Retry commit r200022 with a fix for the build bot errors. Constant expressions
have (unlike instructions) module scope use lists and therefore may have users
in different functions. The fix is to simply ignore these out-of-function uses.
llvm-svn: 200034
This pass identifies expensive constants to hoist and coalesces them to
better prepare it for SelectionDAG-based code generation. This works around the
limitations of the basic-block-at-a-time approach.
First it scans all instructions for integer constants and calculates its
cost. If the constant can be folded into the instruction (the cost is
TCC_Free) or the cost is just a simple operation (TCC_BASIC), then we don't
consider it expensive and leave it alone. This is the default behavior and
the default implementation of getIntImmCost will always return TCC_Free.
If the cost is more than TCC_BASIC, then the integer constant can't be folded
into the instruction and it might be beneficial to hoist the constant.
Similar constants are coalesced to reduce register pressure and
materialization code.
When a constant is hoisted, it is also hidden behind a bitcast to force it to
be live-out of the basic block. Otherwise the constant would be just
duplicated and each basic block would have its own copy in the SelectionDAG.
The SelectionDAG recognizes such constants as opaque and doesn't perform
certain transformations on them, which would create a new expensive constant.
This optimization is only applied to integer constants in instructions and
simple (this means not nested) constant cast experessions. For example:
%0 = load i64* inttoptr (i64 big_constant to i64*)
Reviewed by Eric
llvm-svn: 200022
subsequent changes are easier to review. About to fix some layering
issues, and wanted to separate out the necessary churn.
Also comment and sink the include of "Windows.h" in three .inc files to
match the usage in Memory.inc.
llvm-svn: 198685
Use it to avoid repeating ourselves too often. Also store MVT::SimpleValueType
in the TTI tables so they can be statically initialized, MVT's constructors
create bloated initialization code otherwise.
llvm-svn: 188095
The costs are overfitted so that I can still use the legalization factor.
For example the following kernel has about half the throughput vectorized than
unvectorized when compiled with SSE2. Before this patch we would vectorize it.
unsigned short A[1024];
double B[1024];
void f() {
int i;
for (i = 0; i < 1024; ++i) {
B[i] = (double) A[i];
}
}
radar://13599001
llvm-svn: 179033
SSE2 has efficient support for shifts by a scalar. My previous change of making
shifts expensive did not take this into account marking all shifts as expensive.
This would prevent vectorization from happening where it is actually beneficial.
With this change we differentiate between shifts of constants and other shifts.
radar://13576547
llvm-svn: 178808
On certain architectures we can support efficient vectorized version of
instructions if the operand value is uniform (splat) or a constant scalar.
An example of this is a vector shift on x86.
We can efficiently support
for (i = 0 ; i < ; i += 4)
w[0:3] = v[0:3] << <2, 2, 2, 2>
but not
for (i = 0; i < ; i += 4)
w[0:3] = v[0:3] << x[0:3]
This patch adds a parameter to getArithmeticInstrCost to further qualify operand
values as uniform or uniform constant.
Targets can then choose to return a different cost for instructions with such
operand values.
A follow-up commit will test this feature on x86.
radar://13576547
llvm-svn: 178807
The default logic does not correctly identify costs of casts because they are
marked as custom on x86.
For some cases, where the shift amount is a scalar we would be able to generate
better code. Unfortunately, when this is the case the value (the splat) will get
hoisted out of the loop, thereby making it invisible to ISel.
radar://13130673
radar://13537826
llvm-svn: 178703
- After moving logic recognizing vector shift with scalar amount from
DAG combining into DAG lowering, we declare to customize all vector
shifts even vector shift on AVX is legal. As a result, the cost model
needs special tuning to identify these legal cases.
llvm-svn: 177586
This matters for example in following matrix multiply:
int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
int i, j, k, val;
for (i=0; i<rows; i++) {
for (j=0; j<cols; j++) {
val = 0;
for (k=0; k<cols; k++) {
val += m1[i][k] * m2[k][j];
}
m3[i][j] = val;
}
}
return(m3);
}
Taken from the test-suite benchmark Shootout.
We estimate the cost of the multiply to be 2 while we generate 9 instructions
for it and end up being quite a bit slower than the scalar version (48% on my
machine).
Also, properly differentiate between avx1 and avx2. On avx-1 we still split the
vector into 2 128bits and handle the subvector muls like above with 9
instructions.
Only on avx-2 will we have a cost of 9 for v4i64.
I changed the test case in test/Transforms/LoopVectorize/X86/avx1.ll to use an
add instead of a mul because with a mul we now no longer vectorize. I did
verify that the mul would be indeed more expensive when vectorized with 3
kernels:
for (i ...)
r += a[i] * 3;
for (i ...)
m1[i] = m1[i] * 3; // This matches the test case in avx1.ll
and a matrix multiply.
In each case the vectorized version was considerably slower.
radar://13304919
llvm-svn: 176403
sext <4 x i1> to <4 x i64>
sext <4 x i8> to <4 x i64>
sext <4 x i16> to <4 x i64>
I'm running Combine on SIGN_EXTEND_IN_REG and revert SEXT patterns:
(sext_in_reg (v4i64 anyext (v4i32 x )), ExtraVT) -> (v4i64 sext (v4i32 sext_in_reg (v4i32 x , ExtraVT)))
The sext_in_reg (v4i32 x) may be lowered to shl+sar operations.
The "sar" does not exist on 64-bit operation, so lowering sext_in_reg (v4i64 x) has no vector solution.
I also added a cost of this operations to the AVX costs table.
llvm-svn: 175619
Moving the X86CostTable to a common place, so that other back-ends
can share the code. Also simplifying it a bit and commoning up
tables with one and two types on operations.
llvm-svn: 172658
a TargetMachine to construct (and thus isn't always available), to an
analysis group that supports layered implementations much like
AliasAnalysis does. This is a pretty massive change, with a few parts
that I was unable to easily separate (sorry), so I'll walk through it.
The first step of this conversion was to make TargetTransformInfo an
analysis group, and to sink the nonce implementations in
ScalarTargetTransformInfo and VectorTargetTranformInfo into
a NoTargetTransformInfo pass. This allows other passes to add a hard
requirement on TTI, and assume they will always get at least on
implementation.
The TargetTransformInfo analysis group leverages the delegation chaining
trick that AliasAnalysis uses, where the base class for the analysis
group delegates to the previous analysis *pass*, allowing all but tho
NoFoo analysis passes to only implement the parts of the interfaces they
support. It also introduces a new trick where each pass in the group
retains a pointer to the top-most pass that has been initialized. This
allows passes to implement one API in terms of another API and benefit
when some other pass above them in the stack has more precise results
for the second API.
The second step of this conversion is to create a pass that implements
the TargetTransformInfo analysis using the target-independent
abstractions in the code generator. This replaces the
ScalarTargetTransformImpl and VectorTargetTransformImpl classes in
lib/Target with a single pass in lib/CodeGen called
BasicTargetTransformInfo. This class actually provides most of the TTI
functionality, basing it upon the TargetLowering abstraction and other
information in the target independent code generator.
The third step of the conversion adds support to all TargetMachines to
register custom analysis passes. This allows building those passes with
access to TargetLowering or other target-specific classes, and it also
allows each target to customize the set of analysis passes desired in
the pass manager. The baseline LLVMTargetMachine implements this
interface to add the BasicTTI pass to the pass manager, and all of the
tools that want to support target-aware TTI passes call this routine on
whatever target machine they end up with to add the appropriate passes.
The fourth step of the conversion created target-specific TTI analysis
passes for the X86 and ARM backends. These passes contain the custom
logic that was previously in their extensions of the
ScalarTargetTransformInfo and VectorTargetTransformInfo interfaces.
I separated them into their own file, as now all of the interface bits
are private and they just expose a function to create the pass itself.
Then I extended these target machines to set up a custom set of analysis
passes, first adding BasicTTI as a fallback, and then adding their
customized TTI implementations.
The fourth step required logic that was shared between the target
independent layer and the specific targets to move to a different
interface, as they no longer derive from each other. As a consequence,
a helper functions were added to TargetLowering representing the common
logic needed both in the target implementation and the codegen
implementation of the TTI pass. While technically this is the only
change that could have been committed separately, it would have been
a nightmare to extract.
The final step of the conversion was just to delete all the old
boilerplate. This got rid of the ScalarTargetTransformInfo and
VectorTargetTransformInfo classes, all of the support in all of the
targets for producing instances of them, and all of the support in the
tools for manually constructing a pass based around them.
Now that TTI is a relatively normal analysis group, two things become
straightforward. First, we can sink it into lib/Analysis which is a more
natural layer for it to live. Second, clients of this interface can
depend on it *always* being available which will simplify their code and
behavior. These (and other) simplifications will follow in subsequent
commits, this one is clearly big enough.
Finally, I'm very aware that much of the comments and documentation
needs to be updated. As soon as I had this working, and plausibly well
commented, I wanted to get it committed and in front of the build bots.
I'll be doing a few passes over documentation later if it sticks.
Commits to update DragonEgg and Clang will be made presently.
llvm-svn: 171681