The test case added in r265023 is failing on ninja-x64-msvc-RA-centos6.
Update the test to make less specific assumptions on code generation.
llvm-svn: 265026
PPCSimplifyAddress contains this code:
IntegerType *OffsetTy = ((VT == MVT::i32) ? Type::getInt32Ty(*Context)
: Type::getInt64Ty(*Context));
to determine the type to be used for an index register, if one needs
to be created. However, the "VT" here is the type of the data being
loaded or stored, *not* the type of an address. This means that if
a data element of type i32 is accessed using an index that does not
not fit into 32 bits, a wrong address is computed here.
Note that PPCFastISel is only ever used on 64-bit currently, so the type
of an address is actually *always* MVT::i64. Other parts of the code,
even in this same PPCSimplifyAddress routine, already rely on that fact.
Thus, this patch changes the code to simply unconditionally use
Type::getInt64Ty(*Context) as OffsetTy.
llvm-svn: 265023
This patch corresponds to review:
http://reviews.llvm.org/D18032
This patch provides asm implementation for the following instructions:
lwat, ldat, stwat, stdat, ldmx, mcrxrx
llvm-svn: 265022
Summary:
This change will handle missing store pair opportunity where the first store
instruction stores zero followed by the non-zero store. For example, this change
will convert :
str wzr, [x8]
str w1, [x8, #4]
into:
stp wzr, w1, [x8]
Reviewers: jmolloy, t.p.northover, mcrosier
Subscribers: flyingforyou, aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18570
llvm-svn: 265021
The fast isel pass currently emits a COPY_TO_REGCLASS node to convert
from a F4RC to a F8RC register class during conversion of a
floating-point number to integer. There is actually no support in the
common code instruction printers to emit COPY_TO_REGCLASS nodes, so the
PowerPC back-end has special code there to simply ignore
COPY_TO_REGCLASS.
This is correct *if and only if* the source and destination registers of
COPY_TO_REGCLASS are the same (except for the different register class).
But nothing guarantees this to be the case, and if the register
allocator does end up allocating source and destination to different
registers after all, the back-end simply generates incorrect code. I've
included a test case that shows such incorrect code generation.
However, it seems that COPY_TO_REGCLASS is actually not intended to be
used at the MI layer at all. It is used during SelectionDAG, but always
lowered to a plain COPY before emitting MI. Other back-end's fast isel
passes never emit COPY_TO_REGCLASS at all. I suspect it is simply wrong
for the PowerPC back-end to emit it here.
This patch changes the PowerPC back-end to directly emit COPY instead of
COPY_TO_REGCLASS and removes the special handling in the instruction
printers.
Differential Revision: http://reviews.llvm.org/D18605
llvm-svn: 265020
Summary:
There are too many instructions to exhaustively test so addiu and lwc2 are
used as representative examples.
It should be noted that many memory instructions that should have simm16
range checking do not because it is also necessary to support the macro
of the same name which accepts simm32. The range checks for these occur in
the macro expansion.
Reviewers: vkalintiris
Subscribers: dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D18437
llvm-svn: 265019
Summary:
ldc2/sdc2 now emit slightly worse diagnostics for MIPS-I. The problem
is that they don't trigger the custom parser because all the candidates
are disabled by feature bits. On all other subtargets, the diagnostics are
accurate but are subject to the usual issues of needing to report multiple
ways to correct the code (e.g. smaller offset, enable a CPU feature) but
only being able to report one error.
Reviewers: vkalintiris
Subscribers: dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D18436
llvm-svn: 265018
Summary:
Also, made test_mi10.s formatting consistent with the majority of the
MC tests.
Reviewers: vkalintiris
Subscribers: dsanders, llvm-commits
Differential Revision: http://reviews.llvm.org/D18435
llvm-svn: 265014
Summary:
The bug was that microMIPS's [ls]w[lr]e instructions claimed to support a
12-bit offset when it is only 9-bit.
Reviewers: vkalintiris
Subscribers: llvm-commits, dsanders
Differential Revision: http://reviews.llvm.org/D18434
llvm-svn: 265010
PPC has a vector popcount, this lets the vectorizer use the correct cost
for it. Tweak X86 test to use an intrinsic that's actually scalarized (we
have a somewhat efficient lowering for vector popcount using SSE, the
cost model finds that now).
llvm-svn: 265005
* LLVMDisposeMessage lives in llvm-c/Core.h, include this file where necessary
* LLVMAddTargetData has been removed, follow suit in the bindings
Differential Revision: http://reviews.llvm.org/D18633
llvm-svn: 265001
When dealing with complex<float>, and similar structures with two
single-precision floating-point numbers, especially when such things are being
passed around by value, we'll sometimes end up loading both float values by
extracting them from one 64-bit integer load. It looks like this:
t13: i64,ch = load<LD8[%ref.tmp]> t0, t6, undef:i64
t16: i64 = srl t13, Constant:i32<32>
t17: i32 = truncate t16
t18: f32 = bitcast t17
t19: i32 = truncate t13
t20: f32 = bitcast t19
The problem, especially before the P8 where those bitcasts aren't legal (and
get expanded via the stack), is that it would have been better to use two
floating-point loads directly. Here we add a target-specific DAGCombine to do
just that. In short, we turn:
ld 3, 0(5)
stw 3, -8(1)
rldicl 3, 3, 32, 32
stw 3, -4(1)
lfs 3, -4(1)
lfs 0, -8(1)
into:
lfs 3, 4(5)
lfs 0, 0(5)
llvm-svn: 264988
The test case was defining and using a function 'notExported()', but
the FileCheck checks were checking for the name 'not_exported'. This
changes the test to use 'notExported' across the board. Also, the test
defined a function 'not_defined()', but doesn't have any checks related
to it. For consistency, this name is changed to 'notDefined'. A later
commit will add checks for 'notDefined'.
Patch by Warren Ristow!
llvm-svn: 264984
Summary:
As discussed on llvm-dev[1].
This change adds the basic boilerplate code around having this intrinsic
in LLVM:
- Changes in Intrinsics.td, and the IR Verifier
- A lowering pass to lower @llvm.experimental.guard to normal
control flow
- Inliner support
[1]: http://lists.llvm.org/pipermail/llvm-dev/2016-February/095523.html
Reviewers: reames, atrick, chandlerc, rnk, JosephTremoulet, echristo
Subscribers: mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18527
llvm-svn: 264976
The size savings are significant, and from what I can tell, both ICC and GCC do this.
Differential Revision: http://reviews.llvm.org/D18573
llvm-svn: 264966
r264884 introduced a helper to escape the backslashes in the source file
path, but I since discovered an existing mechanism to escape strings.
llvm-svn: 264936
Widening a PHI requires us to insert a trunc.
The logical place for this trunc is in the same BB as the PHI.
This is not possible if the BB is terminated by a catchswitch.
This fixes PR27133.
llvm-svn: 264926
Fix for issue introduced D17297, where we were breaking early from the loop detecting consecutive loads which could leave us thinking a consecutive load with zeros was possible.
llvm-svn: 264922
The TailDup transform was removed in r138841 in 2011, along with most
of the tests for it. This test, however, was missed. Probably because
it had already been XFAIL'd for 3 years at that point (since r52243!)
and continued to fail when the opt flag for -tailduplicate stopped
being valid.
llvm-svn: 264916
This change prevents the loop vectorizer from vectorizing when all of the vector
types it generates will be scalarized. I've run into this problem on the PPC's QPX
vector ISA, which only holds floating-point vector types. The loop vectorizer
will, however, happily vectorize loops with purely integer computation. Here's
an example:
LV: The Smallest and Widest types: 32 / 32 bits.
LV: The Widest register is: 256 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 1 For instruction: %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Scalar loop costs: 3.
LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 2 For instruction: %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 2 for VF 2 For instruction: store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 0 for VF 4 For instruction: %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction: %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 4 For instruction: %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 4 for VF 4 For instruction: store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 4 For instruction: %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 4 For instruction: %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 4 costs: 1.
...
LV: Selecting VF: 8.
LV: The target has 32 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #4 Interval # 1
LV(REG): At #5 Interval # 1
LV(REG): VF = 8
The problem is that the cost model here is not wrong, exactly. Since all of
these operations are scalarized, their cost (aside from the uniform ones) are
indeed VF*(scalar cost), just as the model suggests. In fact, the larger the VF
picked, the lower the relative overhead from the loop itself (and the
induction-variable update and check), and so in a sense, picking the largest VF
here is the right thing to do.
The problem is that vectorizing like this, where all of the vectors will be
scalarized in the backend, isn't really vectorizing, but rather interleaving.
By itself, this would be okay, but then the vectorizer itself also interleaves,
and that's where the problem manifests itself. There's aren't actually enough
scalar registers to support the normal interleave factor multiplied by a factor
of VF (8 in this example). In other words, the problem with this is that our
register-pressure heuristic does not account for scalarization.
While we might want to improve our register-pressure heuristic, I don't think
this is the right motivating case for that work. Here we have a more-basic
problem: The job of the vectorizer is to vectorize things (interleaving aside),
and if the IR it generates won't generate any actual vector code, then
something is wrong. Thus, if every type looks like it will be scalarized (i.e.
will be split into VF or more parts), then don't consider that VF.
This is not a problem specific to PPC/QPX, however. The problem comes up under
SSE on x86 too, and as such, this change fixes PR26837 too. I've added Sanjay's
reduced test case from PR26837 to this commit.
Differential Revision: http://reviews.llvm.org/D18537
llvm-svn: 264904
Summary:
This results in higher register usage, but should make it easier for
the compiler to hide latency.
This pass is a prerequisite for some more scheduler improvements, and I
think the increase register usage with this patch is acceptable, because
when combined with the scheduler improvements, the total register usage
will decrease.
shader-db stats:
2382 shaders in 478 tests
Totals:
SGPRS: 48672 -> 49088 (0.85 %)
VGPRS: 34148 -> 34847 (2.05 %)
Code Size: 1285816 -> 1289128 (0.26 %) bytes
LDS: 28 -> 28 (0.00 %) blocks
Scratch: 492544 -> 573440 (16.42 %) bytes per wave
Max Waves: 6856 -> 6846 (-0.15 %)
Wait states: 0 -> 0 (0.00 %)
Depends on D18451
Reviewers: nhaehnle, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18452
llvm-svn: 264876
For compatability with GAS, nop and nopr are recognized as alises for
bc and bcr, respectively. A mask of 0 turns these instructions
effectively into no-operations.
Reviewed by Ulrich Weigand.
llvm-svn: 264875
This reverts commit r264869. I am seeing Windows bot failures due to the
"\" in the path being mishandled at some point (seems to be interpreted
wrongly at some point and llvm-as | llvm-dis is yielding some junk
characters). Need to investigate.
llvm-svn: 264871
XOP's VPPERM has some great 'permute operations' that it can do as well as part of shuffling the bytes of a 128-bit vector - in this case we use it to perform BITREVERSE in a single instruction.
llvm-svn: 264870
Summary:
This change serializes out and in the SourceFileName to LLVM assembly
so that it is preserved through "llvm-dis | llvm-as". This is
necessary to ensure that the global identifiers created for local values
in the module summary index are the same even if the bitcode is
streamed out and read back from LLVM assembly.
Serializing the summary itself to LLVM assembly is in progress.
Reviewers: joker.eph
Subscribers: llvm-commits, joker.eph
Differential Revision: http://reviews.llvm.org/D18588
llvm-svn: 264869
We are currently doing a REALLY bad job of packing results of vector comparisons into the legalized <X x i1> result equivalents - a mixture of PACKSS/PMOVMSKB would be much better here.
llvm-svn: 264867
We already try not to truncate PHIs in computeMinimalBitwidths. LoopVectorize can't handle it and we really don't need to, because both induction and reduction PHIs are truncated by other means.
However, we weren't bailing out in all the places we should have, and we ended up by returning a PHI to be truncated, which has caused PR27018.
This fixes PR17018.
llvm-svn: 264852
operations.
Specifically, we had code that tried to badly approximate reconstructing
all of the possible variations on addressing modes in two x86
instructions based on those in one pseudo instruction. This is not the
first bug uncovered with doing this, so stop doing it altogether.
Instead generically and pedantically copy every operand from the address
over to both new instructions, and strip kill flags from any register
operands.
This fixes a subtle bug seen in the wild where we would mysteriously
drop parts of the addressing mode, causing for example the index
argument in the added test case to just be completely ignored.
Hypothetically, this was an extremely bad miscompile because it actually
caused a predictable and leveragable write of a 64bit quantity to an
unintended offset (the first element of the array intead of whatever
other element was intended). As a consequence, in theory this could even
have introduced security vulnerabilities.
However, this was only something that could happen with an atomic
floating point add. No other operation could trigger this bug, so it
seems extremely unlikely to have occured widely in the wild.
But it did in fact occur, and frequently in scientific applications
which were using relaxed atomic updates of a floating point value after
adding a delta. Those would end up being quite badly miscompiled by
LLVM, which is how we found this. Of course, this often looks like
a race condition in the code, but it was actually a miscompile.
I suspect that this whole RELEASE_FADD thing was a complete mistake.
There is no such operation, and I worry that anything other than add
will get remarkably worse codegeneration. But that's not for this
change....
llvm-svn: 264845
Prior to this patch, the MemorySSA caching visitor would cache all
calls that it visited. When paired with phi optimization, this can be
problematic. Consider:
define void @foo() {
; 1 = MemoryDef(liveOnEntry)
call void @clobberFunction()
br i1 undef, label %if.end, label %if.then
if.then:
; MemoryUse(??)
call void @readOnlyFunction()
; 2 = MemoryDef(1)
call void @clobberFunction()
br label %if.end
if.end:
; 3 = MemoryPhi(...)
; MemoryUse(?)
call void @readOnlyFunction()
ret void
}
When optimizing MemoryUse(?), we visit defs 1 and 2, so we note to
cache them later. We ultimately end up not being able to optimize
passed the Phi, so we set MemoryUse(?) to point to the Phi. We then
cache the clobbering call for def 1 to be the Phi.
This commit changes this behavior so that we wipe out any calls
added to VisistedCalls while visiting the defs of a phi we couldn't
optimize.
Aside: With this patch, we now can bootstrap clang/LLVM without a
single MemorySSA verifier failure. Woohoo. :)
llvm-svn: 264820
This patch teaches the caching MemorySSA walker a few things:
1. Not to walk Phis we've walked before. It seems that we tried to do
this before, but it didn't work so well in cases like:
define void @foo() {
%1 = alloca i8
%2 = alloca i8
br label %begin
begin:
; 3 = MemoryPhi({%0,liveOnEntry},{%end,2})
; 1 = MemoryDef(3)
store i8 0, i8* %2
br label %end
end:
; MemoryUse(?)
load i8, i8* %1
; 2 = MemoryDef(1)
store i8 0, i8* %2
br label %begin
}
Because we wouldn't put Phis in Q.Visited until we tried to visit them.
So, when trying to optimize MemoryUse(?):
- We would visit 3 above
- ...Which would make us put {%0,liveOnEntry} in Q.Visited
- ...Which would make us visit {%0,liveOnEntry}
- ...Which would make us put {%end,2} in Q.Visited
- ...Which would make us visit {%end,2}
- ...Which would make us visit 3
- ...Which would realize we've already visited everything in 3
- ...Which would make us conservatively return 3.
In the added test-case, (@looped_visitedonlyonce) this behavior would
cause us to give incorrect results. Specifically, we'd visit 4 twice
in the same query, but on the second visit, we'd skip while.cond because
it had been visited, visit if.then/if.then2, and cache "1" as the
clobbering def on the way back.
2. If we try to walk the defs of a {Phi,MemLoc} and see it has been
visited before, just hand back the Phi we're trying to optimize.
I promise this isn't as terrible as it seems. :)
We now insert {Phi,MemLoc} pairs just before walking the Phi's upward
defs. So, we check the cache for the {Phi,MemLoc} pair before checking
if we've already walked the Phi.
The {Phi,MemLoc} pair is (almost?) always guaranteed to have a cache
entry if we've already fully walked it, because we cache as we go.
So, if the {Phi,MemLoc} pair isn't in cache, either:
(a) we must be in the process of visiting it (in which case, we can't
give a better answer in a cache-as-we-go DFS walker)
(b) we visited it, but didn't cache it on the way back (...which seems
to require `ModifyingAccess` to not dominate `StartingAccess`,
so I'm 99% sure that would be an error. If it's not an error, I
haven't been able to get it to happen locally, so I suspect it's
rare.)
- - - - -
As a consequence of this change, we no longer skip upward defs of phis,
so we can kill the `VisitedOnlyOne` check. This gives us better accuracy
than we had before, at the cost of potentially doing a bit more work
when we have a loop.
llvm-svn: 264814
This is effectively NFC, minus the renaming of the options
(-cyclone-prefetch-distance -> -prefetch-distance).
The change was requested by Tim in D17943.
llvm-svn: 264806
We have known races on profile counters, which can be reproduced by enabling
-fsanitize=thread and -fprofile-instr-generate simultaneously on a
multi-threaded program. This patch avoids reporting those races by not
instrumenting the reads and writes coming from the instruction profiler.
llvm-svn: 264805
During ADCE, track which debug info scopes still have live references
from the code, and delete debug info intrinsics for the dead ones.
These intrinsics describe the locations of variables (in registers or
stack slots). If there's no code left corresponding to a variable's
scope, then there's no way to reference the variable in the debugger and
it doesn't matter what its value is.
I add a DEBUG printout when the described location in an SSA register,
in case it helps some trying to track down why locations get lost.
However, we still delete these; the scope itself isn't attached to any
real code, so the ship has already sailed.
llvm-svn: 264800
1. Removed the run line for mingw32 and made the Darwin triples unknown.
This is a test of 32-bit vs. 64-bit platform and the underlying hardware.
We have other tests for checking behavioral differences of the OS platform.
2. Changed the CPU specifiers to the attributes they were meant to represent.
Any CPU that doesn't have SSE4.2 is assumed to have slow unaligned 16-byte accesses,
so it won't use those here.
3. Although the stores really could all be CHECK-DAG, I left them as CHECK-NEXT to
show the strange behavior of the instruction scheduler in the SLOW_32 case.
4. The odd-looking instructions are due to the use of a null pointer in the IR, so
we have integer immediate store addresses. Cute.
llvm-svn: 264796
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/nm.html .
1) For Mach-O files the code was not printing the values in hex as is the default.
2) The values printed had leading zeros which they should not have.
3) The address for undefined symbols was printed as spaces instead of 0.
4) With the -A option with posix output for an archive did not use square
brackets around the archive member name.
rdar://25311883 and rdar://25299678
llvm-svn: 264778
Add function soft attribute to the generation of Jump Tables in CodeGen
as initial step towards clang support of gcc's no-jump-table support
Reviewers: hans, echristo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18321
llvm-svn: 264756
Fixed fp_to_uint instruction selection on KNL.
One pattern was missing for <4 x double> to <4 x i32>
Differential Revision: http://reviews.llvm.org/D18512
llvm-svn: 264701
When eliminating or merging almost empty basic blocks, the existence of non-trivial PHI nodes
is currently used to recognize potential loops of which the block is the header and keep the block.
However, the current algorithm fails if the loops' exit condition is evaluated only with volatile
values hence no PHI nodes in the header. Especially when such a loop is an outer loop of a nested
loop, the loop is collapsed into a single loop which prevent later optimizations from being
applied (e.g., transforming nested loops into simplified forms and loop vectorization).
The patch augments the existing PHI node-based check by adding a pre-test if the BB actually
belongs to a set of loop headers and not eliminating it if yes.
llvm-svn: 264697
Instead of using two feature bits, one to indicate the availability of the
popcnt[dw] instructions, and another to indicate whether or not they're fast,
use a single enum. This allows more consistent control via target attribute
strings, and via Clang's command line.
llvm-svn: 264690
Minimum density for both optsize and non optsize are now options
-sparse-jump-table-density (default 10) for non optsize functions
-dense-jump-table-density (default 40) for optsize functions, which
matches the current default. This improves several benchmarks at google
at the cost of a small codesize increase. For code compiled with -Os,
the old behavior continues
llvm-svn: 264689
Personality is copied as part of copyFunctionAttributes, but it is
invalid on a declaration. Remove the personality attribute it the
function body is not cloned.
Also add a verifier run over output modules in the llvm-split tool.
llvm-svn: 264667
If all a BUILD_VECTOR's source elements are the same bit (AND/XOR/OR) operation type and each has one constant operand, lower to a pair of BUILD_VECTOR and just apply the bit operation to the vectors.
The constant operands will form a constant vector meaning that we still only have a single BUILD_VECTOR to lower and we will have replaced all the scalarized operations with a single SSE equivalent.
Its not in our interest to start make a general purpose vectorizer from this, but I'm seeing enough of these scalar bit operations from the later legalization/scalarization stages to support them at least.
Differential Revision: http://reviews.llvm.org/D18492
llvm-svn: 264666
Function names in ObjC can have spaces in them. This interacts poorly
with name compression, which uses spaces to separate PGO names. Fix the
issue by using a different separator and update a test.
I chose "\01" as the separator because 1) it's non-printable, 2) we
strip it from PGO names, and 3) it's the next natural choice once "\00"
is discarded (that one's overloaded).
What's changed since the original commit?
- I fixed up the covmap-V2 binary format tests using a linux VM.
- I weakened the CHECK lines in instrprof-comdat.h to account for the
fact that there have been bugfixes to clang coverage. These will be
fixed up in a follow-up.
- I added an assert to make sure we don't get bitten by this again.
- I constructed the c-general.profraw file without name compression
enabled to appease some bots.
Differential Revision: http://reviews.llvm.org/D18516
llvm-svn: 264658
A DICompileUnit that is not listed in llvm.dbg.cu will cause assertion
failures and/or crashes in the backend. The Verifier should reject this.
rdar://problem/25369499
llvm-svn: 264657
This is a fix for PR26941.
When there is both a section and a global definition with the same
name, the global wins.
Section symbols are not added to the symbol table; section references
are left undefined and fixed up in the object writer unless they've
been satisfied by some other definition.
llvm-svn: 264649
On OS X El Capitan and iOS 9, the linker supports a new section
attribute, live_support, which allows dead stripping to remove dead
globals along with the ASAN metadata about them.
With this change __asan_global structures are emitted in a new
__DATA,__asan_globals section on Darwin.
Additionally, there is a __DATA,__asan_liveness section with the
live_support attribute. Each entry in this section is simply a tuple
that binds together the liveness of a global variable and its ASAN
metadata structure. Thus the metadata structure will be alive if and
only if the global it references is also alive.
Review: http://reviews.llvm.org/D16737
llvm-svn: 264645
Function names in ObjC can have spaces in them. This interacts poorly
with name compression, which uses spaces to separate PGO names. Fix the
issue by using a different separator and update a test.
I chose "\01" as the separator because 1) it's non-printable, 2) we
strip it from PGO names, and 3) it's the next natural choice once "\00"
is discarded (that one's overloaded).
This reverts the revert commit beaf3d18. What's changed?
- I fixed up the covmap-V2 binary format tests using a linux VM.
- I updated the expected counts in instrprof-comdat.h to account for
the fact that there have been bugfixes to clang coverage.
- I added an assert to make sure we don't get bitten by this again.
Differential Revision: http://reviews.llvm.org/D18516
llvm-svn: 264641
They do have a def machine operand.
Fixing the definition is necessary for an upcoming patch.
Differential Revision: http://reviews.llvm.org/D18384
llvm-svn: 264607
The A2 cores support the popcntw/popcntd instructions, but they're microcoded,
and slower than our default software emulation. Specifically, popcnt[dw] take
approximately 74 cycles, whereas our software emulation takes only 24-28
cycles.
I've added a new target feature to indicate a slow popcnt[dw], instead of just
removing the existing target feature from the a2/a2q processor models, because:
1. This allows us to return more accurate information via the TTI interface
(I recognize that this currently makes no practical difference)
2. Is hopefully easier to understand (it allows the core's features to match
its manual while still having the desired effect).
llvm-svn: 264600
When eliminating or merging almost empty basic blocks, the existence of non-trivial PHI nodes
is currently used to recognize potential loops of which the block is the header and keep the block.
However, the current algorithm fails if the loops' exit condition is evaluated only with volatile
values hence no PHI nodes in the header. Especially when such a loop is an outer loop of a nested
loop, the loop is collapsed into a single loop which prevent later optimizations from being
applied (e.g., transforming nested loops into simplified forms and loop vectorization).
The patch augments the existing PHI node-based check by adding a pre-test if the BB actually
belongs to a set of loop headers and not eliminating it if yes.
llvm-svn: 264596
MachineFunctionProperties represents a set of properties that a MachineFunction
can have at particular points in time. Existing examples of this idea are
MachineRegisterInfo::isSSA() and MachineRegisterInfo::tracksLiveness() which
will eventually be switched to use this mechanism.
This change introduces the AllVRegsAllocated property; i.e. the property that
all virtual registers have been allocated and there are no VReg operands
left.
With this mechanism, passes can declare that they require a particular property
to be set, or that they set or clear properties by implementing e.g.
MachineFunctionPass::getRequiredProperties(). The MachineFunctionPass base class
verifies that the requirements are met, and handles the setting and clearing
based on the delcarations. Passes can also directly query and update the current
properties of the MF if they want to have conditional behavior.
This change annotates the target-independent post-regalloc passes; future
changes will also annotate target-specific ones.
Reviewers: qcolombet, hfinkel
Differential Revision: http://reviews.llvm.org/D18421
llvm-svn: 264593
Summary:
This helps prevent load clustering from drastically increasing register
pressure by trying to cluster 4 SMRDx8 loads together. The limit of 16
bytes was chosen, because it seems like that was the original intent
of setting the limit to 4 instructions, but more analysis could show
that a different limit is better.
This fixes yields small decreases in register usage with shader-db, but
also helps avoid a large increase in register usage when lane mask
tracking is enabled in the machine scheduler, because lane mask tracking
enables more opportunities for load clustering.
shader-db stats:
2379 shaders in 477 tests
Totals:
SGPRS: 49744 -> 48600 (-2.30 %)
VGPRS: 34120 -> 34076 (-0.13 %)
Code Size: 1282888 -> 1283184 (0.02 %) bytes
LDS: 28 -> 28 (0.00 %) blocks
Scratch: 495616 -> 492544 (-0.62 %) bytes per wave
Max Waves: 6843 -> 6853 (0.15 %)
Wait states: 0 -> 0 (0.00 %)
Reviewers: nhaehnle, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18451
llvm-svn: 264589
ICMP instruction selection fails on SKX and KNL for i1 operand.
I use XOR to resolve:
(A == B) is equivalent to (A xor B) == 0
Differential Revision: http://reviews.llvm.org/D18511
llvm-svn: 264566
Spiritually reapply commit r264409 (reverted in r264410), albeit with a
bit of a redesign.
Firstly, avoid splitting the big blob into multiple chunks of strings.
r264409 imposed an arbitrary limit to avoid a massive allocation on the
shared 'Record' SmallVector. The bug with that commit only reproduced
when there were more than "chunk-size" strings. A test for this would
have been useless long-term, since we're liable to adjust the chunk-size
in the future.
Thus, eliminate the motivation for chunk-ing by storing the string sizes
in the blob. Here's the layout:
vbr6: # of strings
vbr6: offset-to-blob
blob:
[vbr6]: string lengths
[char]: concatenated strings
Secondly, make the output of llvm-bcanalyzer readable.
I noticed when debugging r264409 that llvm-bcanalyzer was outputting a
massive blob all in one line. Past a small number, the strings were
impossible to split in my head, and the lines were way too long. This
version adds support in llvm-bcanalyzer for pretty-printing.
<STRINGS abbrevid=4 op0=3 op1=9/> num-strings = 3 {
'abc'
'def'
'ghi'
}
From the original commit:
Inspired by Mehdi's similar patch, http://reviews.llvm.org/D18342, this
should (a) slightly reduce bitcode size, since there is less record
overhead, and (b) greatly improve reading speed, since blobs are super
cheap to deserialize.
llvm-svn: 264551
The implementation is fairly obvious. This is preparation for using
some blobs in bitcode.
For clarity (and perhaps future-proofing?), I moved the call to
JumpToBit in BitstreamCursor::readRecord ahead of calling
MemoryObject::getPointer, since JumpToBit can theoretically (a) read
bytes, which (b) invalidates the blob pointer.
This isn't strictly necessary the two memory objects we have:
- The return of RawMemoryObject::getPointer is valid until the memory
object is destroyed.
- StreamingMemoryObject::getPointer is valid until the next chunk is
read from the stream. Since the JumpToBit call is only going ahead
to a word boundary, we'll never load another chunk.
However, reordering makes it clear by inspection that the blob returned
by BitstreamCursor::readRecord will be valid.
I added some tests for StreamingMemoryObject::getPointer and
BitstreamCursor::readRecord.
llvm-svn: 264549
Summary:
Add a statistic to count the number of imported functions. Also, add a
new -print-imports option to emit a trace of imported functions, that
works even for an NDEBUG build.
Note that emitOptimizationRemark does not work for the above printing as
it expects a Function object and DebugLoc, neither of which we have
with summary-based importing.
This is part 2 of D18487, the first part was committed separately as
r264536.
Reviewers: joker.eph
Subscribers: llvm-commits, joker.eph
Differential Revision: http://reviews.llvm.org/D18487
llvm-svn: 264537
Intrinsic::maxnum and Intrinsic::minnum, along with the associated libc
function calls (fmax[f], etc.) generally map to function calls after lowering.
For some vector types with QPX at least, however, we can legally lower these,
and we don't need to prohibit CTR-based loops on their account.
It turned out, however, that the logic that checked the opcodes associated with
intrinsics was broken (it would set the Opcode variable, but that variable was
later checked only if set for some otherwise-external function call.
This fixes the latter problem and adds the FMAX/MINNUM mappings.
llvm-svn: 264532
Reject the following IR as malformed (assuming that %entry, %next are
not in a loop):
next:
%y = phi i32 [ 0, %entry ]
%x = phi i32 [ %y, %entry ]
Such PHI nodes came up in PR26718. While there was no consensus on
whether or not this is valid IR, most opinions on that bug and in a
discussion on the llvm-dev mailing list tended towards a
"strict interpretation" (term by Joseph Tremoulet) of PHI node uses.
Also, the language reference explicitly states that "the use of each
incoming value is deemed to occur on the edge from the corresponding
predecessor block to the current block" and
`DominatorTree::dominates(Instruction*, Use&)` uses this definition as
well.
For the code mentioned in PR15384, clang does not compile to such PHIs
(anymore?). The test case still hangs when replacing `%tmp6` with `%tmp`
in revisions before r176366 (where PR15384 has been fixed). The
occurrence of %tmp6 therefore was probably unintentional. Its value is
not used except in other PHIs.
Reviewers: majnemer, reames, JosephTremoulet, bkramer, grosser, jdoerfert, kparzysz, sanjoy
Differential Revision: http://reviews.llvm.org/D18443
llvm-svn: 264528
If you're building dwps from other dwps, it can be hard to track down a
duplicate CU ID if it comes from two compilations of the same file in
different modes, etc. By including the .dwo path (which is hopefully
more unique than the file path) it can help track down where the
duplicates came from.
llvm-svn: 264520
Currently this is to mainly to prevent scalarization of integer division by constants.
Differential Revision: http://reviews.llvm.org/D18307
llvm-svn: 264511
Summary:
Now that the summary contains the full reference/call graph, we can
replace the existing function importer that loads and inspect the IR
to iteratively walk the call graph by a traversal based purely on the
summary information. Decouple the actual importing decision from any
IR manipulation.
Reviewers: tejohnson
Subscribers: llvm-commits, joker.eph
Differential Revision: http://reviews.llvm.org/D18343
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 264503
A release fence acts as a publication barrier for stores within the current thread to become visible to other threads which might observe the release fence. It does not require the current thread to observe stores performed on other threads. As a result, we can allow store-load and load-load forwarding across a release fence.
We choose to be much more conservative about stores. In theory, nothing prevents us from shifting a store from after a release fence to before it, and then eliminating the preceeding (previously fenced) store. Doing this without actually moving the second store is likely also legal, but we chose to be conservative at this time.
The LangRef indicates only atomic loads and stores are effected by fences. This patch chooses to be far more conservative then that.
This is the GVN companion to http://reviews.llvm.org/D11434 which applied the same logic in EarlyCSE and has been baking in tree for a while now.
Differential Revision: http://reviews.llvm.org/D11436
llvm-svn: 264472
When encountering instructions with regmasks, instead of cleaning up all the
elements in MaybeDeadCopies map, remove only the instructions erased. By keeping
more instruction in MaybeDeadCopies, this change will expose more dead copies
across instructions with regmasks.
llvm-svn: 264462
When merging stores in DAGCombiner, add check to ensure that no
dependenices exist that would cause the construction of a cycle in our
DAG. This may happen if one store has a data dependence on another
instruction (e.g. a load) which itself has a (chain) dependence on
another store being merged. These stores cannot be merged safely and
doing so results in a cycle that is discovered in LegalizeDAG.
This test is only done in cases where Antialias analysis is used (UseAA)
as non-AA store merge candidates will be merged logically after all
loads which have been checked to not alias.
Reviewers: ahatanak, spatel, niravd, arsenm, hfinkel, tstellarAMD, jyknight
Subscribers: llvm-commits, tberghammer, danalbert, srhines
Differential Revision: http://reviews.llvm.org/D18336
llvm-svn: 264461
I didn't notice any significant changes in the actual checks here;
all of these tests already used FileCheck, so a script can batch
update them in one shot.
This commit is just to show the value of automating this process:
We have uniform formatting as opposed to a mish-mash of check
structure that changes based on individual prefs and the current
fashion. This makes it simpler to update when we find a bug or
make an enhancement.
llvm-svn: 264457
This changes RS4GC to lower calls to ``@llvm.experimental.deoptimize``
to gc.statepoints wrapping ``__llvm_deoptimize``, and changes
``callsGCLeafFunction`` to recognize ``@llvm.experimental.deoptimize``
as a non GC leaf function.
I've had to hard code the ``"__llvm_deoptimize"`` name in
RewriteStatepointsForGC; since ``TargetLibraryInfo`` is available only
during codegen. This isn't without precedent in the codebase, so I'm
not overtly concerned.
llvm-svn: 264456
It is possible to have a fallthrough MBB prior to MBB placement. The original
addition of the BB would result in reordering the BB as not preceding the
successor. Because of the fallthrough nature of the BB, we could end up
executing incorrect code or even a constant pool island! Insert the spliced BB
into the same location to avoid that.
Thanks to Tim Northover for invaluable hints and Fiora for the discussion on
what may have been occurring!
llvm-svn: 264454
64-bit, 32-bit and 16-bit move-immediate instructions are 7, 6, and 5 bytes,
respectively, whereas and/or with 8-bit immediate is only three bytes.
Since these instructions imply an additional memory read (which the CPU could
elide, but we don't think it does), restrict these patterns to minsize functions.
Differential Revision: http://reviews.llvm.org/D18374
llvm-svn: 264440
Now register parameters that aren't saved to the stack or CSRs are
considered dead after the first call. Previously the debugger would show
whatever was in the register.
Fixes PR26589
Reviewers: aprantl
Differential Revision: http://reviews.llvm.org/D17211
llvm-svn: 264429
Optimize output of MDStrings in bitcode. This emits them in big blocks
(currently 1024) in a pair of records:
- BULK_STRING_SIZES: the sizes of the strings in the block, and
- BULK_STRING_DATA: a single blob, which is the concatenation of all
the strings.
Inspired by Mehdi's similar patch, http://reviews.llvm.org/D18342, this
should (a) slightly reduce bitcode size, since there is less record
overhead, and (b) greatly improve reading speed, since blobs are super
cheap to deserialize.
I needed to add support for blobs to streaming input to get the test
suite passing.
- StreamingMemoryObject::getPointer reads ahead and returns the
address of the blob.
- To avoid a possible reallocation of StreamingMemoryObject::Bytes,
BitstreamCursor::readRecord needs to move the call to JumpToEnd
forward so that getPointer is the last bitstream operation.
llvm-svn: 264409
This is the same as r255936, with added logic for avoiding clobbering of the
red zone (PR26023).
Differential Revision: http://reviews.llvm.org/D18246
llvm-svn: 264375
Remove logic to upgrade !llvm.loop by changing the MDString tag
directly. This old logic would check (and change) arbitrary strings
that had nothing to do with loop metadata. Instead, check !llvm.loop
attachments directly, and change which strings get attached.
Rather than updating the assembly-based upgrade, drop it entirely. It
has been quite a while since we supported upgrading textual IR.
llvm-svn: 264373
We did not have an explicit branch to the continuation BB. When the check was
hoisted, this could permit control follow to fall through into the division
trap. Add the explicit branch to the continuation basic block to ensure that
code execution is correct.
llvm-svn: 264370