Summary:
Consider a GEP of:
i8* getelementptr ({ [2 x i8], i32, i8, [3 x i8] }* @main.c, i32 0, i32 0, i64 0)
If we proceeded to GEP the aforementioned object by 8, would form a GEP of:
i8* getelementptr ({ [2 x i8], i32, i8, [3 x i8] }* @main.c, i32 0, i32 0, i64 8)
Note that we would go through the first array member, causing an
out-of-bounds accesses. This is problematic because we might get fooled
if we are trying to evaluate loads using this GEP, for example, based
off of an object with a constant initializer where the array is zero.
This fixes PR17732.
Reviewers: nicholas, chandlerc, void
Reviewed By: void
CC: llvm-commits, echristo, void, aemerson
Differential Revision: http://llvm-reviews.chandlerc.com/D2093
llvm-svn: 194220
Patch by Michele Scandale!
Rewrite of the functions used to compute the backedge taken count of a
loop on LT and GT comparisons.
I decided to split the handling of LT and GT cases becasue the trick
"a > b == -a < -b" in some cases prevents the trip count computation
due to the multiplication by -1 on the two operands of the
comparison. This issue comes from the conservative computation of
value range of SCEVs: taking the negative SCEV of an expression that
have a small positive range (e.g. [0,31]), we would have a SCEV with a
fullset as value range.
Indeed, in the new rewritten function I tried to better handle the
maximum backedge taken count computation when MAX/MIN expression are
used to handle the cases where no entry guard is found.
Some test have been modified in order to check the new value correctly
(I manually check them and reasoning on possible overflow the new
values seem correct).
I finally added a new test case related to the multiplication by -1
issue on GT comparisons.
llvm-svn: 194116
Due to the previously added overflow checks, we can have a retain/release
relation that is one directional. This occurs specifically when we run into an
additive overflow causing us to drop state in only one direction. If that
occurs, we should bail and not optimize that retain/release instead of
asserting.
Apologies for the size of the testcase. It is necessary to cause the additive
cfg overflow to trigger.
rdar://15377890
llvm-svn: 194083
When the elements are extracted from a select on vectors
or a vector select, do the select on the extracted scalars
from the input if there is only one use.
llvm-svn: 194013
This adds an SimplifyLibCalls case which converts the special __sinpi and
__cospi (float & double variants) into a __sincospi_stret where appropriate to
remove duplicated work.
Patch by Tim Northover
llvm-svn: 193943
When the loop vectorizer was part of the SCC inliner pass manager gvn would
run after the loop vectorizer followed by instcombine. This way redundancy
(multiple uses) were removed and instcombine could perform scalarization on the
induction variables. Having moved the loop vectorizer to later we no longer run
any form of redundancy elimination before we perform instcombine. This caused
vectorized induction variables to survive that did not before.
On a recent iMac this helps linpack back from 6000Mflops to 7000Mflops.
This should also help lpbench and paq8p.
I ran a Release (without Asserts) build over the test-suite and did not see any
negative impact on compile time.
radar://15339680
llvm-svn: 193891
When a dependence check fails we can still try to vectorize loops with runtime
array bounds checks.
This helps linpack to vectorize a loop in dgefa. And we are back to 2x of the
scalar performance on a corei7-avx.
radar://15339680
llvm-svn: 193853
Given that backend does not handle "invoke asm" correctly ("invoke asm" will be
handled by SelectionDAGBuilder::visitInlineAsm, which does not have the right
setup for LPadToCallSiteMap) and we already made the assumption that inline asm
does not throw in InstCombiner::visitCallSite, we are going to make the same
assumption in Inliner to make sure we don't convert "call asm" to "invoke asm".
If it becomes necessary to add support for "invoke asm" later on, we will need
to modify the backend as well as remove the assumptions that inline asm does
not throw.
Fix rdar://15317907
llvm-svn: 193808
There are two ways one could implement hiding of linkonce_odr symbols in LTO:
* LLVM tells the linker which symbols can be hidden if not used from native
files.
* The linker tells LLVM which symbols are not used from other object files,
but will be put in the dso symbol table if present.
GOLD's API is the second option. It was implemented almost 1:1 in llvm by
passing the list down to internalize.
LLVM already had partial support for the first option. It is also very similar
to how ld64 handles hiding these symbols when *not* doing LTO.
This patch then
* removes the APIs for the DSO list.
* marks LTO_SYMBOL_SCOPE_DEFAULT_CAN_BE_HIDDEN all linkonce_odr unnamed_addr
global values and other linkonce_odr whose address is not used.
* makes the gold plugin responsible for handling the API mismatch.
llvm-svn: 193800
By vectorizing a series of srl, or, ... instructions we have obfuscated the
intention so much that the backend does not know how to fold this code away.
radar://15336950
llvm-svn: 193573
Partial fix for PR17459: wrong code at -O3 on x86_64-linux-gnu
(affecting trunk and 3.3)
When SCEV expands a recurrence outside of a loop it attempts to scale
by the stride of the recurrence. Chained recurrences don't work that
way. We could compute binomial coefficients, but would hve to
guarantee that the chained AddRec's are in a perfectly reduced form.
llvm-svn: 193438
Partial fix for PR17459: wrong code at -O3 on x86_64-linux-gnu
(affecting trunk and 3.3)
ScalarEvolutionNormalization was attempting to normalize by adding and
subtracting strides. Chained recurrences don't work that way.
llvm-svn: 193437
This patch teaches GlobalStatus to analyze a call that uses the global value as
a callee, not as an argument.
With this change internalize call handle the common use of linkonce_odr
functions. This reduces the number of linkonce_odr functions in a LTO build of
clang (checked with the emit-llvm gold plugin option) from 1730 to 60.
llvm-svn: 193436
The loop vectorizer does not currently understand how to vectorize
extractelement instructions. The existing check, which excluded all
vector-valued instructions, did not catch extractelement instructions because
it checked only the return value. As a result, vectorization would proceed,
producing illegal instructions like this:
%58 = extractelement <2 x i32> %15, i32 0
%59 = extractelement i32 %58, i32 0
where the second extractelement is illegal because its first operand is not a vector.
llvm-svn: 193434
Make sure we mark all loops (scalar and vector) when vectorizing,
so that we don't try to vectorize them anymore. Also, set unroll
to 1, since this is what we check for on early exit.
llvm-svn: 193349
Major steps include:
1). introduces a not-addr-taken bit-field in GlobalVariable
2). GlobalOpt pass sets "not-address-taken" if it proves a global varirable
dosen't have its address taken.
3). AA use this info for disambiguation.
llvm-svn: 193251
When a linkonce_odr value that is on the dso list is not unnamed_addr
we can still look to see if anything is actually using its address. If
not, it is safe to hide it.
This patch implements that by moving GlobalStatus to Transforms/Utils
and using it in Internalize.
llvm-svn: 193090
A landing pad can be jumped to only by the unwind edge of an invoke
instruction. If we eliminate a partially redundant load in a landing pad, it
will create a basic block that violates this constraint. It then leads to other
problems down the line if it tries to merge that basic block with the landing
pad. Avoid this by not eliminating the load in a landing pad.
PR17621
llvm-svn: 193064
One optimization simplify-cfg performs is the converting of switches to
lookup tables if the switch has > 4 cases. This is done by:
1. Finding the max/min case value and calculating the switch case range.
2. Create a lookup table basic block.
3. Perform a check in the switch's BB to see if the input value is in
the switch's case range. If the input value satisfies said predicate
branch to the lookup table BB, otherwise branch to the switch's default
destination BB using the default value as the result.
The conditional check consists of subtracting the min case value of the
table from any input iN value and then ensuring that said value is
unsigned less than the size of the lookup table represented as an iN
value.
If the lookup table is a covered lookup table, the size of the table will be N
which is 0 as an iN value. Thus the comparison will be an `icmp ult` of an iN
value against 0 which is always false yielding the incorrect result.
This patch fixes this problem by recognizing if we have a covered lookup table
and if we do, unconditionally jumps to the lookup table BB since the covering
property of the lookup table implies no input values could not be handled by
said BB.
rdar://15268442
llvm-svn: 193045
If the predecessor's being spliced into a landing pad, then we need the PHIs to
come first and the rest of the predecessor's code to come *after* the landing
pad instruction.
llvm-svn: 193035
Before this patch we relied on the order of phi nodes when we looked for phi
nodes of the same type. This could prevent vectorization of cases where there
was a phi node of a second type in between phi nodes of some type.
This is important for vectorization of an internal graphics kernel. On the test
suite + external on x86_64 (and on a run on armv7s) it showed no impact on
either performance or compile time.
radar://15024459
llvm-svn: 192537
If a function seen at compile time is not necessarily the one linked to
the binary being built, it is illegal to change the actual arguments
passing to it.
e.g.
--------------------------
void foo(int lol) {
// foo() has linkage satisifying isWeakForLinker()
// "lol" is not used at all.
}
void bar(int lo2) {
// xform to foo(undef) is illegal, as compiler dose not know which
// instance of foo() will be linked to the the binary being built.
foo(lol2);
}
-----------------------------
Such functions can be captured by isWeakForLinker(). NOTE that
mayBeOverridden() is insufficient for this purpose as it dosen't include
linkage types like AvailableExternallyLinkage and LinkOnceODRLinkage.
Take link_odr* as an example, it indicates a set of *EQUIVALENT* globals
that can be merged at link-time. However, the semantic of
*EQUIVALENT*-functions includes parameters. Changing parameters breaks
the assumption.
Thank John McCall for help, especially for the explanation of subtle
difference between linkage types.
rdar://11546243
llvm-svn: 192302
is updated to use DITypeRef.
Move isUnsignedDIType and getOriginalTypeSize from DebugInfo.h to be static
helper functions in DwarfCompileUnit. We already have a static helper function
"isTypeSigned" in DwarfCompileUnit, and a pointer to DwarfDebug is added to
resolve the derived-from field. All three functions need to go across link
for derived-from fields, so we need to get hold of a type identifier map.
A pointer to DwarfDebug is also added to DbgVariable in order to resolve the
derived-from field.
Debug info verifier is updated to check a derived-from field is a TypeRef.
Verifier will not go across link for derived-from fields, in debug info finder,
we go across the link to add derived-from fields to types.
Function getDICompositeType is only used by dragonegg and since dragonegg does
not generate identifier for types, we use an empty map to resolve the
derived-from field.
When printing a derived-from field, we use DITypeRef::getName to either return
the type identifier or getName of the DIType.
A paired commit at clang is required due to changes to DIBuilder.
llvm-svn: 192018
UpdatePHINodes has an optimization to reuse an existing PHI node, where it
first deletes all of its entries and then replaces them. Unfortunately, in the
case where we had duplicate predecessors (which are allowed so long as the
associated PHI entries have the same value), the loop removing the existing PHI
entries from the to-be-reused PHI would assert (if that PHI was not the one
which had the duplicates).
llvm-svn: 192001
Sort the operands of the other entries in the current vectorization root
according to the first entry's operands opcodes.
%conv0 = uitofp ...
%load0 = load float ...
= fmul %conv0, %load0
= fmul %load0, %conv1
= fmul %load0, %conv2
Make sure that we recursively vectorize <%conv0, %conv1, %conv2> and <%load0,
%load0, %load0>.
This makes it more likely to obtain vectorizable trees. We have to be careful
when we sort that we don't destroy 'good' existing ordering implied by source
order.
radar://15080067
llvm-svn: 191977
Generalize the API so we can distinguish symbols that are needed just for a DSO
symbol table from those that are used from some native .o.
The symbols that are only wanted for the dso symbol table can be dropped if
llvm can prove every other dso has a copy (linkonce_odr) and the address is not
important (unnamed_addr).
llvm-svn: 191922
Don't vectorize with a runtime check if it requires a
comparison between pointers with different address spaces.
The values can't be assumed to be directly comparable.
Previously it would create an illegal bitcast.
llvm-svn: 191862
This recursively strips all GEPs like the existing code. It also handles bitcasts and
other operations that do not change the pointer value.
llvm-svn: 191847
Switch instructions were crashing the StructurizeCFG pass, and it's
probably easier anyway if we don't need to handle them in this pass.
Reviewed-by: Christian König <christian.koenig@amd.com>
llvm-svn: 191841
is updated to use DITypeRef.
Move isUnsignedDIType and getOriginalTypeSize from DebugInfo.h to be static
helper functions in DwarfCompileUnit. We already have a static helper function
"isTypeSigned" in DwarfCompileUnit, and a pointer to DwarfDebug is added to
resolve the derived-from field. All three functions need to go across link
for derived-from fields, so we need to get hold of a type identifier map.
A pointer to DwarfDebug is also added to DbgVariable in order to resolve the
derived-from field.
Debug info verifier is updated to check a derived-from field is a TypeRef.
Verifier will not go across link for derived-from fields, in debug info finder,
we go across the link to add derived-from fields to types.
Function getDICompositeType is only used by dragonegg and since dragonegg does
not generate identifier for types, we use an empty map to resolve the
derived-from field.
When printing a derived-from field, we use DITypeRef::getName to either return
the type identifier or getName of the DIType.
A paired commit at clang is required due to changes to DIBuilder.
llvm-svn: 191800
Inspired by the object from the SLPVectorizer. This found a minor bug in the
debug loc restoration in the vectorizer where the location of a following
instruction was attached instead of the location from the original instruction.
llvm-svn: 191673
when it was actually a Constant*.
There are quite a few other casts to Instruction that might have the same problem,
but this is the only one I have a test case for.
llvm-svn: 191668
Currently foldSelectICmpAndOr asserts if the "or" involves a vector
containing several of the same power of two. We can easily avoid this by
only performing the fold on integer types, like foldSelectICmpAnd does.
Fixes <rdar://problem/15012516>
llvm-svn: 191552
Remove the command line argument "struct-path-tbaa" since we should not depend
on command line argument to decide which format the IR file is using. Instead,
we check the first operand of the tbaa tag node, if it is a MDNode, we treat
it as struct-path aware TBAA format, otherwise, we treat it as scalar TBAA
format.
When clang starts to use struct-path aware TBAA format no matter whether
struct-path-tbaa is no, and we can auto-upgrade existing bc files, the support
for scalar TBAA format can be dropped.
Existing testing cases are updated to use the struct-path aware TBAA format.
llvm-svn: 191538
We were previously using getFirstInsertionPt to insert PHI
instructions when vectorizing, but getFirstInsertionPt also skips past
landingpads, causing this to generate invalid IR.
We can avoid this issue by using getFirstNonPHI instead.
llvm-svn: 191526
Put them under a separate flag for experimentation. They are more likely to
interfere with loop vectorization which happens later in the pass pipeline.
llvm-svn: 191371
Some supplemental information for r191314: We would like to make sure SLP Vectorizer will not try to vectorize tiny trees even with a negative threshold so we set the cost to INT_MAX.
llvm-svn: 191327
This is safe per C++11 18.6.1.1p3: [operator new returns] a non-null pointer to
suitably aligned storage (3.7.4), or else throw a bad_alloc exception. This
requirement is binding on a replacement version of this function.
Brings us a tiny bit closer to eliminating more vector push_backs.
llvm-svn: 191310
Revert 191122 - with extra checks we are allowed to vectorize math library
function calls.
Standard library indentifiers are reserved names so functions with external
linkage must not overrided them. However, functions with internal linkage can.
Therefore, we can vectorize calls to math library functions with a check for
external linkage and matching signature. This matches what we do during
SelectionDAG building.
llvm-svn: 191206
Overflow doesn't affect the correctness of equalities. Computing this is cheap,
we just reuse the computation for the inbounds case and try to peel of more
non-inbounds GEPs. This pattern is unlikely to ever appear in code generated by
Clang, but SCEV occasionally produces it.
llvm-svn: 191200
SROA wants to convert any types of equivalent widths but it's not possible to
convert vectors of pointers to an integer scalar with a single cast. As a
workaround we add a bitcast to the corresponding int ptr type first. This type
of cast used to be an edge case but has become common with SLP vectorization.
Fixes PR17271.
llvm-svn: 191143
Reapply r191108 with a fix for a memory corruption error I introduced. Of
course, we can't reference the scalars that we replace by vectorizing and then
call their eraseFromParent method. I only 'needed' the scalars to get the
DebugLoc. Just store the DebugLoc before actually vectorizing instead. As a nice
side effect, this also simplifies the interface between BoUpSLP and the
HorizontalReduction class to returning a value pointer (the vectorized tree
root).
radar://14607682
llvm-svn: 191123
The problem of r191017 is that when GVN fabricate a val-number for a dead instruction (in order
to make following expr-PRE happy), it forget to fabricate a leader-table entry for it as well.
llvm-svn: 191118
Match reductions starting at binary operation feeding into a phi. The code
handles trees like
r += v1 + v2 + v3 ...
and
r += v1
r += v2
...
and
r *= v1 + v2 + ...
We currently only handle associative operations (add, fadd fast).
The code can now also handle reductions feeding into stores.
a[i] = v1 + v2 + v3 + ...
The code is currently disabled behind the flag "-slp-vectorize-hor". The cost
model for most architectures is not there yet.
I found one opportunity of a horizontal reduction feeding a phi in TSVC
(LoopRerolling-flt) and there are several opportunities where reductions feed
into stores.
radar://14607682
llvm-svn: 191108
The GEP pattern is what SCEV expander emits for "ugly geps". The latter is what
you get for pointer subtraction in C code. The rest of instcombine already
knows how to deal with that so just canonicalize on that.
llvm-svn: 191090
If "C1/X" were having multiple uses, the only benefit of this
transformation is to potentially shorten critical path. But it is at the
cost of instroducing additional div.
The additional div may or may not incur cost depending on how div is
implemented. If it is implemented using Newton–Raphson iteration, it dosen't
seem to incur any cost (FIXME). However, if the div blocks the entire
pipeline, that sounds to be pretty expensive. Let CodeGen to take care
this transformation.
This patch sees 6% on a benchmark.
rdar://15032743
llvm-svn: 191037
This is how it ignores the dead code:
1) When a dead branch target, say block B, is identified, all the
blocks dominated by B is dead as well.
2) The PHIs of those blocks in dominance-frontier(B) is updated such
that the operands corresponding to dead predecessors are replaced
by "UndefVal".
Using lattice's jargon, the "UndefVal" is the "Top" in essence.
Phi node like this "phi(v1 bb1, undef xx)" will be optimized into
"v1" if v1 is constant, or v1 is an instruction which dominate this
PHI node.
3) When analyzing the availability of a load L, all dead mem-ops which
L depends on disguise as a load which evaluate exactly same value as L.
4) The dead mem-ops will be materialized as "UndefVal" during code motion.
llvm-svn: 191017
XCore target: Add XCoreTargetTransformInfo
This is where getNumberOfRegisters() resides, which in turn returns the
number of vector registers (=0).
llvm-svn: 190936
Some of this code is no longer necessary since int<->ptr casts are no
longer occur as of r187444.
This also fixes handling vectors of pointers, and adds a bunch of new
testcases for vectors and address spaces.
llvm-svn: 190885
We can't insert an insertelement after an invoke. We would have to split a
critical edge. So when we see a phi node that uses an invoke we just give up.
radar://14990770
llvm-svn: 190871
other in memory.
The motivation was to get rid of truncate and shift right instructions that get
in the way of paired load or floating point load.
E.g.,
Consider the following example:
struct Complex {
float real;
float imm;
};
When accessing a complex, llvm was generating a 64-bits load and the imm field
was obtained by a trunc(lshr) sequence, resulting in poor code generation, at
least for x86.
The idea is to declare that two load instructions is the canonical form for
loading two arithmetic type, which are next to each other in memory.
Two scalar loads at a constant offset from each other are pretty
easy to detect for the sorts of passes that like to mess with loads.
<rdar://problem/14477220>
llvm-svn: 190870
We would have to compute the pre increment value, either by computing it on
every loop iteration or by splitting the edge out of the loop and inserting a
computation for it there.
For now, just give up vectorizing such loops.
Fixes PR17179.
llvm-svn: 190790
This pass was based on the previous (essentially unused) profiling
infrastructure and the assumption that by ordering the basic blocks at
the IR level in a particular way, the correct layout would happen in the
end. This sometimes worked, and mostly didn't. It also was a really
naive implementation of the classical paper that dates from when branch
predictors were primarily directional and when loop structure wasn't
commonly available. It also didn't factor into the equation
non-fallthrough branches and other machine level details.
Anyways, for all of these reasons and more, I wrote
MachineBlockPlacement, which completely supercedes this pass. It both
uses modern profile information infrastructure, and actually works. =]
llvm-svn: 190748
The PowerPC A2 core greatly benefits from aggressive concatenation unrolling;
use the new getUnrollingPreferences to enable this by default when targeting
the PPC A2 core.
llvm-svn: 190549
LLVM IR doesn't currently allow atomic bool load/store operations, and the
transformation is dubious anyway because it isn't profitable on all platforms.
PR17163.
llvm-svn: 190357
Several architectures use the same instruction to perform both a comparison and
a subtract. The instruction selection framework does not allow to consider
different basic blocks to expose such fusion opportunities.
Therefore, these instructions are “merged” by CSE at MI IR level.
To increase the likelihood of CSE to apply in such situation, we reorder the
operands of the comparison, when they have the same complexity, so that they
matches the order of the most frequent subtract.
E.g.,
icmp A, B
...
sub B, A
<rdar://problem/14514580>
llvm-svn: 190352
The work on this project was left in an unfinished and inconsistent state.
Hopefully someone will eventually get a chance to implement this feature, but
in the meantime, it is better to put things back the way the were. I have
left support in the bitcode reader to handle the case-range bitcode format,
so that we do not lose bitcode compatibility with the llvm 3.3 release.
This reverts the following commits: 155464, 156374, 156377, 156613, 156704,
156757, 156804 156808, 156985, 157046, 157112, 157183, 157315, 157384, 157575,
157576, 157586, 157612, 157810, 157814, 157815, 157880, 157881, 157882, 157884,
157887, 157901, 158979, 157987, 157989, 158986, 158997, 159076, 159101, 159100,
159200, 159201, 159207, 159527, 159532, 159540, 159583, 159618, 159658, 159659,
159660, 159661, 159703, 159704, 160076, 167356, 172025, 186736
llvm-svn: 190328
Field 2 of DIType (Context), field 9 of DIDerivedType (TypeDerivedFrom),
field 12 of DICompositeType (ContainingType), fields 2, 7, 12 of DISubprogram
(Context, Type, ContainingType).
llvm-svn: 190205
This reverts commit r189886.
I found a corner case where this optimization is not valid:
Say we have a "linkonce_odr unnamed_addr" in two translation units:
* In TU 1 this optimization kicks in and makes it hidden.
* In TU 2 it gets const merged with a constant that is *not* unnamed_addr,
resulting in a non unnamed_addr constant with default visibility.
* The static linker rules for combining visibility them produce a hidden
symbol, which is incorrect from the point of view of the non unnamed_addr
constant.
The one place we can do this is when we know that the symbol is not used from
another TU in the same shared object, i.e., during LTO. I will move it there.
llvm-svn: 189954
"(icmp op i8 A, B)" is equivalent to "(icmp op i8 (A & 0xff), B)" as a
degenerate case. Allowing this as a "masked" comparison when analysing "(icmp)
&/| (icmp)" allows us to combine them in more cases.
rdar://problem/7625728
llvm-svn: 189931
Even in cases which aren't universally optimisable like "(A & B) != 0 && (A &
C) != 0", the masks can make one of the comparisons completely redundant. In
this case, since we've gone to the effort of spotting masked comparisons we
should combine them.
rdar://problem/7625728
llvm-svn: 189930
Original message:
If a constant or a function has linkonce_odr linkage and unnamed_addr, mark
hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.
llvm-svn: 189886
The reason that I am turning off this optimization is that there is an
additional case where a block can escape that has come up. Specifically, this
occurs when a block is used in a scope outside of its current scope.
This can cause a captured retainable object pointer whose life is preserved by
the objc_retainBlock to be deallocated before the block is invoked.
An example of the code needed to trigger the bug is:
----
\#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
void (^somethingToDoLater)();
{
NSObject *obj = [NSObject new];
somethingToDoLater = ^{
[obj self]; // Crashes here
};
}
NSLog(@"test.");
somethingToDoLater();
return 0;
}
----
In the next commit, I remove all the dead code that results from this.
Once I put in the fixing commit I will bring back the tests that I deleted in
this commit.
rdar://14802782.
rdar://14868830.
llvm-svn: 189869
1) If the width of vectorization list candidate is bigger than vector reg width, we will break it down to fit the vector reg.
2) We do not vectorize the width which is not power of two.
The performance result shows it will help some spec benchmarks. mesa improved 6.97% and ammp improved 1.54%.
llvm-svn: 189830
The existing code missed some edge cases when e.g. we're going to emit sqrtf but
only the availability of sqrt was checked. This happens on odd platforms like
windows.
llvm-svn: 189724
PR17026. Also avoid undefined shifts and shift amounts larger than 64 bits
(those are always undef because we can't represent integer types that large).
llvm-svn: 189672
When unrolling is disabled in the pass manager, the loop vectorizer should also
not unroll loops. This will allow the -fno-unroll-loops option in Clang to
behave as expected (even for vectorizable loops). The loop vectorizer's
-force-vector-unroll option will (continue to) override the pass-manager
setting (including -force-vector-unroll=0 to force use of the internal
auto-selection logic).
In order to test this, I added a flag to opt (-disable-loop-unrolling) to force
disable unrolling through opt (the analog of -fno-unroll-loops in Clang). Also,
this fixes a small bug in opt where the loop vectorizer was enabled only after
the pass manager populated the queue of passes (the global_alias.ll test needed
a slight update to the RUN line as a result of this fix).
llvm-svn: 189499
The builder inserts from before the insert point,
not after, so this would insert before the last
instruction in the bundle instead of after it.
I'm not sure if this can actually be a problem
with any of the current insertions.
llvm-svn: 189285
DICompositeType will have an identifier field at position 14. For now, the
field is set to null in DIBuilder.
For DICompositeTypes where the template argument field (the 13th field)
was optional, modify DIBuilder to make sure the template argument field is set.
Now DICompositeType has 15 fields.
Update DIBuilder to use NULL instead of "i32 0" for null value of a MDNode.
Update verifier to check that DICompositeType has 15 fields and the last
field is null or a MDString.
Update testing cases to include an extra field for DICompositeType.
The identifier field will be used by type uniquing so a front end can
genearte a DICompositeType with a unique identifer.
llvm-svn: 189282
This patch enables unrolling of loops when vectorization is legal but not profitable.
We add a new class InnerLoopUnroller, that extends InnerLoopVectorizer and replaces some of the vector-specific logic with scalars.
This patch does not introduce any runtime regressions and improves the following workloads:
SingleSource/Benchmarks/Shootout/matrix -22.64%
SingleSource/Benchmarks/Shootout-C++/matrix -13.06%
External/SPEC/CINT2006/464_h264ref/464_h264ref -3.99%
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding -1.95%
llvm-svn: 189281
The current version of StripDeadDebugInfo became stale and no longer actually
worked since it was expecting an older version of debug info.
This patch updates it to use DebugInfoFinder and the modern DebugInfo classes as
much as possible to make it more redundent to such changes. Additionally, the
only place where that was avoided (the code where we replace the old sets with
the new), I call verify on the DIContextUnit implying that if the format changes
and my live set changes no longer make sense an assert will be hit. In order to
ensure that that occurs I have included a test case.
The actual stripping of the dead debug info follows the same strategy as was
used before in this class: find the live set and replace the old set in the
given compile unit (which may contain dead global variables/functions) with the
new live one.
llvm-svn: 189078
A single metadata will not span multiple lines. This also helps me with
my script to automatic update the testing cases.
A debug info testing case should have a llvm.dbg.cu.
Do not use hard-coded id for debug nodes.
llvm-svn: 189033
using GEPs. Previously, it used a number of different heuristics for
analyzing the GEPs. Several of these were conservatively correct, but
failed to fall back to SCEV even when SCEV might have given a reasonable
answer. One was simply incorrect in how it was formulated.
There was good code already to recursively evaluate the constant offsets
in GEPs, look through pointer casts, etc. I gathered this into a form
code like the SLP code can use in a previous commit, which allows all of
this code to become quite simple.
There is some performance (compile time) concern here at first glance as
we're directly attempting to walk both pointers constant GEP chains.
However, a couple of thoughts:
1) The very common cases where there is a dynamic pointer, and a second
pointer at a constant offset (usually a stride) from it, this code
will actually not do any unnecessary work.
2) InstCombine and other passes work very hard to collapse constant
GEPs, so it will be rare that we iterate here for a long time.
That said, if there remain performance problems here, there are some
obvious things that can improve the situation immensely. Doing
a vectorizer-pass-wide memoizer for each individual layer of pointer
values, their base values, and the constant offset is likely to be able
to completely remove redundant work and strictly limit the scaling of
the work to scrape these GEPs. Since this optimization was not done on
the prior version (which would still benefit from it), I've not done it
here. But if folks have benchmarks that slow down it should be straight
forward for them to add.
I've added a test case, but I'm not really confident of the amount of
testing done for different access patterns, strides, and pointer
manipulation.
llvm-svn: 189007
Update iterator when the SLP vectorizer changes the instructions in the basic
block by restarting the traversal of the basic block.
Patch by Yi Jiang!
Fixes PR 16899.
llvm-svn: 188832