This method can be called with a '0' argument which checks the return
value. However, the method it calls doesn't expect '0' as a valid value. Call the
correct method when it's 0.
llvm-svn: 164735
If the offset is more than 24-bits, it won't fit in a scattered
relocation offset field, so we fall back to using a non-scattered
relocation.
rdar://12358909
llvm-svn: 164724
This opaque class will contain all of the attributes. All attribute queries will
go through this object. This object will also be uniqued in the LLVMContext.
Currently not used, so no implementation change.
llvm-svn: 164722
teach the callgraph logic to not create callgraph edges to intrinsics for invoke
instructions; it already skips this for call instructions. Fixes PR13903.
llvm-svn: 164707
- Put statistics in alphabetical order
- Don't use getZextValue when building TableInt, just use APInts
- Introduce Create{Z,S}ExtOrTrunc in IRBuilder.
llvm-svn: 164696
rewriter in SROA to carry a proper alignment. This involves
interrogating various sources of alignment, etc. This is a more complete
and principled fix to PR13920 as well as related bugs pointed out by Eli
in review and by inspection in the area.
Also by inspection fix the integer and vector promotion paths to create
aligned loads and stores. I still need to work up test cases for
these... Sorry for the delay, they were found purely by inspection.
llvm-svn: 164689
tables in bitmaps when they fit in a target-legal register.
This saves some space, and it also allows for building tables that would
otherwise be deemed too sparse.
One interesting case that this hits is example 7 from
http://blog.regehr.org/archives/320. We currently generate good code
for this when lowering the switch to the selection DAG: we build a
bitmask to decide whether to jump to one block or the other. My patch
will result in the same bitmask, but it removes the need for the jump,
as the return value can just be retrieved from the mask.
llvm-svn: 164684
For example, under a Linux chroot, /proc/ might not be mounted.
Therefor, we test if this file exist. If it is the case, use it (the current
behavior). Otherwise, we fall back to the detection used by *BSD.
The issue has been reported initially on the Debian bug tracker:
http://bugs.debian.org/674588
llvm-svn: 164676
This should really, really fix PR13916. For real this time. The
underlying bug is... a bit more subtle than I had imagined.
The setup is a code pattern that leads to an @llvm.memcpy call with two
equal pointers to an alloca in the source and dest. Now, not any pattern
will do. The alloca needs to be formed just so, and both pointers should
be wrapped in different bitcasts etc. When this precise pattern hits,
a funny sequence of events transpires. First, we correctly detect the
potential for overlap, and correctly optimize the memcpy. The first
time. However, we do simplify the set of users of the alloca, and that
causes us to run the alloca back through the SROA pass in case there are
knock-on simplifications. At this point, a curious thing has happened.
If we happen to have an i8 alloca, we have direct i8 pointer values. So
we don't bother creating a cast, we rewrite the arguments to the memcpy
to dircetly refer to the alloca.
Now, in an unrelated area of the pass, we have clever logic which
ensures that when visiting each User of a particular pointer derived
from an alloca, we only visit that User once, and directly inspect all
of its operands which refer to that particular pointer value. However,
the mechanism used to detect memcpy's with the potential to overlap
relied upon getting visited once per *Use*, not once per *User*. This is
always true *unless* the same exact value is both source and dest. It
turns out that almost nothing actually produces that pattern though.
We can hand craft test cases that more directly test this behavior of
course, and those are included. Also, note that there is a significant
missed optimization here -- we prove in many cases that there is
a non-volatile memcpy call with identical source and dest addresses. We
shouldn't prevent splitting the alloca in that case, and in fact we
should just remove such memcpy calls eagerly. I'll address that in
a subsequent commit.
llvm-svn: 164669
scalar-to-vector conversion that we cannot handle. For instance, when an invalid
constraint is used in an inline asm statement.
<rdar://problem/12284092>
llvm-svn: 164662
- Instead of embedding 'lock' into each mnemonic of atomic
instructions except 'xchg', we teach X86 assembly printer to output 'lock'
prefix similar to or consistent with code emitter.
llvm-svn: 164659
scalar-to-vector conversion that we cannot handle. For instance, when an invalid
constraint is used in an inline asm statement.
<rdar://problem/12284092>
llvm-svn: 164657
The Apple buildbots are set up to pass --target to configure for both
cross- and non-cross-compile builds, and the standard autoconf response
to this is to set the program prefix to '<target>-'. Until we can figure
out the proper way to handle this (don't pass --target? pass an explicit
--program-prefix=""? don't auto-populate program_prefix with target_alias?)
it's more important to keep the buildbots running.
This reverts r164633 / ba48ceb1a3802e20e781ef04ea2573ffae2ac414.
llvm-svn: 164651
only a missed optimization opportunity if the store is over-aligned, but a
miscompile if the store's new type has a higher natural alignment than the
memcpy did. Fixes PR13920!
llvm-svn: 164641
reason we were getting two of the same alloca is because of a memmove/memcpy
which had the same alloca in both the src and dest. Now we detect that case
directly. This has the same testcase as before, but fixes a clang test
CodeGenObjC/exceptions.m which runs clang -O2.
llvm-svn: 164636
Chandler, it's not obvious that it's okay that this alloca gets into the list
twice to begin with. Please review and see whether this is the fix you really
want, but I wanted to get a fix checked in quickly.
llvm-svn: 164634
Provide interface in TargetLowering to set or get the minimum number of basic
blocks whereby jump tables are generated for switch statements rather than an
if sequence.
getMinimumJumpTableEntries() defaults to 4.
setMinimumJumpTableEntries() allows target configuration.
This patch changes the default for the Hexagon architecture to 5
as it improves performance on some benchmarks.
llvm-svn: 164628
When a BL/BLX references a symbol in the same translation unit that is
out of range, use an external relocation. The linker will use this to
generate a branch island rather than a direct reference, allowing the
relocation to resolve correctly.
rdar://12359919
llvm-svn: 164615
to chains or cycles between PHIs and/or selects. Also add a couple of
really nice test cases reduced from Kostya's reports in PR13905 and
PR13906. Both are fixed by this patch.
llvm-svn: 164596
Previously it was only be able to detect problems if the pointer was a numerical
value (eg inttoptr i32 1 to i32*), but not if it was an alloca or globa. The
reason was the use of ComputeMaskedBits: imagine you have "alloca i8, align 2",
and ask ComputeMaskedBits what it knows about the bits of the alloca pointer.
It can tell you that the bottom bit is known zero (due to align 2) but it can't
tell you that bit 1 is known one. That's because the address could be an even
multiple of 2 rather than an odd multiple, eg it might be a multiple of 4. Thus
trying to use KnownOne is ineffective in the case of an alloca as it will never
have any bits set. Instead look explicitly for constant offsets from allocas
and globals.
llvm-svn: 164595
David (I think), but I would appreciate folks verifying that this fixes
the big crasher.
I'm still working on a reduced test case, but because this was causing
problems I wanted to get the fix checked in quickly.
llvm-svn: 164585
Even out-of-line jump tables can be in the code section, so mark them
as data-regions for those targets which support the directives.
rdar://12362871&12362974
llvm-svn: 164571
I also moved the SDKROOT setting into the make flags, since clearing it from
the environment isn't good enough to override a setting on the make command
line. That hasn't been a problem but it could be, and it's good to be
consistent with the way UNIVERSAL_SDK_PATH is handled.
llvm-svn: 164565
integer promotion analogous to vector promotion. When there is an
integer alloca being accessed both as its integer type and as a narrower
integer type, promote the narrower access to "insert" and "extract" the
smaller integer from the larger one, and make the integer alloca
a candidate for promotion.
In the new formulation, we don't care about target legal integer or use
thresholds to control things. Instead, we only perform this promotion to
an integer type which the frontend has already emitted a load or store
for. This bounds the scope and prevents optimization passes from
coalescing larger and larger entities into a single integer.
llvm-svn: 164479
across the uses of the alloca. It's entirely possible for negative
numbers to come up here, and in some rare cases simply doing the 2's
complement arithmetic isn't the correct decision. Notably, we can't zext
the index of the GEP. The definition of GEP is that these offsets are
sign extended or truncated to the size of the pointer, and then wrapping
2's complement arithmetic used.
This patch fixes an issue that comes up with *no* input from the
buildbots or bootstrap afaict. The only place where it manifested,
disturbingly, is Clang's own regression test suite. A reduced and
targeted collection of tests are added to cope with this. Note that I've
tried to pin down the potential cases of overflow, but may have missed
some cases. I've tried to add a few cases to test this, but its hard
because LLVM has quite limited support for >64bit constructs.
llvm-svn: 164475
This patch fixes load/store instructions to handle less common cases
like "asr #32", "rrx" properly throughout the MC layer.
Patch by Chris Lidbury.
llvm-svn: 164455
This silences several analyzer warnings within LLVM, and provides a slightly
nicer crash experience when someone calls isa<>, cast<>, or dyn_cast<> with
a null pointer.
llvm-svn: 164439
selects with a constant condition. This resulted in the operands
remaining live through the SROA rewriter. Most of the time, this just
caused some dead allocas to persist and get zapped by later passes, but
in one case found by Joerg, it caused a crash when we tried to *promote*
the alloca despite it having this dead use. We already have the
mechanisms in place to handle this, just wire select up to them.
llvm-svn: 164427
whether or not we want to print out backtrace information. Useful
for libraries that don't need backtrace information on a crash.
rdar://11844710
llvm-svn: 164426
care about it being an argument variable so that we can decide
that captured block and lambda vars that don't happen to
be arguments could be an argument pointer.
Add the object pointer for one case onto the subprogram die.
rdar://12001329
llvm-svn: 164419
We inserted a placeholder that was never replaced because the function was
already visited. Assert that all placeholders have been resolved when tearing
down the bitcode reader.
Fixes PR13895.
llvm-svn: 164369
The expression based expansion too often results in IR level optimizations
splitting the intermediate values into separate basic blocks, preventing
the formation of the VBSL instruction as the code author intended. In
particular, LICM would often hoist part of the computation out of a loop.
rdar://11011471
llvm-svn: 164340
A PHI can't create interference on its own. If two live ranges interfere
at a PHI, they must also interfere when leaving one of the PHI
predecessors.
llvm-svn: 164330
The old-fashioned many-to-one value mapping doesn't always work when
merging vector lanes. A value can map to multiple different values, and
it can even be necessary to insert new PHIs.
When a value number is defined by a copy from a value number that
required SSa update, include the live range of the copied value number
in the SSA update as well. It is not necessarily a copy of the original
value number any longer.
llvm-svn: 164329
We already have HoistThenElseCodeToIf, this patch implements
SinkThenElseCodeToEnd. When END block has only two predecessors and each
predecessor terminates with unconditional branches, we compare instructions in
IF and ELSE blocks backwards and check whether we can sink the common
instructions down.
rdar://12191395
llvm-svn: 164325
- Rewrite/merge pseudo-atomic instruction emitters to address the
following issue:
* Reduce one unnecessary load in spin-loop
previously the spin-loop looks like
thisMBB:
newMBB:
ld t1 = [bitinstr.addr]
op t2 = t1, [bitinstr.val]
not t3 = t2 (if Invert)
mov EAX = t1
lcs dest = [bitinstr.addr], t3 [EAX is implicit]
bz newMBB
fallthrough -->nextMBB
the 'ld' at the beginning of newMBB should be lift out of the loop
as lcs (or CMPXCHG on x86) will load the current memory value into
EAX. This loop is refined as:
thisMBB:
EAX = LOAD [MI.addr]
mainMBB:
t1 = OP [MI.val], EAX
LCMPXCHG [MI.addr], t1, [EAX is implicitly used & defined]
JNE mainMBB
sinkMBB:
* Remove immopc as, so far, all pseudo-atomic instructions has
all-register form only, there is no immedidate operand.
* Remove unnecessary attributes/modifiers in pseudo-atomic instruction
td
* Fix issues in PR13458
- Add comprehensive tests on atomic ops on various data types.
NOTE: Some of them are turned off due to missing functionality.
- Revise tests due to the new spin-loop generated.
llvm-svn: 164281
This fixes some obscure failure cases involving registers defined inside multiclasses or foreach constructs that would not receive a unique ID, and would end up being omitted from the AsmMatcher tables.
llvm-svn: 164251
A common coalescing conflict in vector code is lane insertion:
%dst = FOO
%src = BAR
%dst:ssub0 = COPY %src
The live range of %src interferes with the ssub0 lane of %dst, but that
lane is never read after %src would have clobbered it. That makes it
safe to merge the live ranges and eliminate the COPY:
%dst = FOO
%dst:ssub0 = BAR
This patch teaches the new coalescer to resolve conflicts where dead
vector lanes would be clobbered, at least as long as the clobbered
vector lanes don't escape the basic block.
llvm-svn: 164250
to improve compatibility with GNU as.
Based on a patch by PaX Team.
Fixed assertion failures on non-Darwin and added additional test cases.
llvm-svn: 164248
- Merge the processing of LOAD_ADD with other atomic load-arith
operations
- Separate the logic getting target constant for atomic-load-op and add
an optimization for atomic-load-add on i16 with negative value
- Optimize a minor case for atomic-fetch-add i16 with negative operand. Test
case is revised.
llvm-svn: 164243
lib/Target/PowerPC/PPCISelLowering.{h,cpp}
Rename LowerFormalArguments_Darwin to LowerFormalArguments_Darwin_Or_64SVR4.
Rename LowerFormalArguments_SVR4 to LowerFormalArguments_32SVR4.
Receive small structs right-justified in LowerFormalArguments_Darwin_Or_64SVR4.
Rename LowerCall_Darwin to LowerCall_Darwin_Or_64SVR4.
Rename LowerCall_SVR4 to LowerCall_32SVR4.
Pass small structs right-justified in LowerCall_Darwin_Or_64SVR4.
test/CodeGen/PowerPC/structsinregs.ll
New test.
llvm-svn: 164228
two variables where the first variable is returned and the second
ignored.
I don't think this occurs in practice (other passes should have cleaned
up the unused phi node), but it should still be handled correctly.
Also make the logic for determining if we should return early less
sketchy.
llvm-svn: 164225
provide insertion order iteration, instead of the old option of
DenseMap order iteration over keys and insertion order iteration over
values.
This is implemented by keeping two copies of each key.
llvm-svn: 164221
This is a follow-up from r163302, which added a transformation to
SimplifyCFG that turns some switches into loads from lookup tables.
It was pointed out that some targets, such as GPUs and deeply embedded
targets, might not find this appropriate, but SimplifyCFG doesn't have
enough information about the target to decide this.
This patch adds the reverse transformation to CodeGenPrep: it turns
loads from lookup tables back into switches for targets where we do not
build jump tables (assuming these are also the targets where lookup
tables are inappropriate).
Hopefully we will eventually get to have target information in
SimplifyCFG, and then this CodeGenPrep transformation can be removed.
llvm-svn: 164206
This was making it hard to scan my builds for new warnings. The
warning still fires with ToT clang. But if my workaround is unnecessary
for whatever reason, feel free to revert.
llvm-svn: 164201
Two deeply nested if's obscured that the sense of the conditions was
mixed up. Amazingly, TableGen's output is exactly the same even with the
sense of the tests fixed; it seems that all of TableGen's conversions
are symmetric so that the inverted sense was nonetheless correct "by
accident". As such, I couldn't come up with a test case.
If there does in fact exist a non-symmetric conversion in TableGen's
type system, then a test case should be prepared.
Despite the symmetry, both if's are left in place for robustness in the
face of future changes.
Review by Jakob.
llvm-svn: 164195
This is a generally useful utility; there's no reason to have it hidden
in CodeGenDAGPatterns.cpp.
Also, rename it to fit the other comparators in Record.h
Review by Jakob.
llvm-svn: 164189
from the dragonegg build bots when we turned on the full version of the
pass. Included a much reduced test case for this pesky bug, despite
bugpoint's uncooperative behavior.
Also, I audited all the similar code I could find and didn't spot any
other cases where this mistake cropped up.
llvm-svn: 164178
working on FCA splitting. Instead of refusing to form a common type when
there are uses of a subsection of the alloca as well as a use of the
entire alloca, just skip the subsection uses and continue looking for
a whole-alloca use with a type that we can use.
This produces slightly prettier IR I think, and also fixes the other
failure in the test.
llvm-svn: 164146
FCAs. This is essential in order to promote allocas that are used in
struct returns by frontends like Clang. The FCA load would block the
rest of the pass from firing, resulting is significant regressions with
the bullet benchmark in the nightly test suite.
Thanks to Duncan for repeated discussions about how best to do this, and
to both him and Benjamin for review.
This appears to have blocked many places where the pass tries to fire,
and so I'm expect somewhat different results with this fix added.
As with the last big patch, I'm including a change to enable the SROA by
default *temporarily*. Ben is going to remove this as soon as the LNT
bots pick up the patch. I'm just trying to get a round of LNT numbers
from the stable machines in the lab.
NOTE: Four clang tests are expected to fail in the brief window where
this is enabled. Sorry for the noise!
llvm-svn: 164119
Now where we used to call ReInitMCSubtargetInfo, we actually recompute
the same information as InitMCSubtargetInfo instead of only setting
the feature bits.
llvm-svn: 164105
aligned address. Based on patch by David Peixotto.
Also use vld1.64 / vst1.64 with 128-bit alignment to take advantage of alignment
hints. rdar://12090772, rdar://12238782
llvm-svn: 164089
Add LIS::pruneValue() and extendToIndices(). These two functions are
used by the register coalescer when merging two live ranges requires
more than a trivial value mapping as supported by LiveInterval::join().
The pruneValue() function can remove the part of a value number that is
going to conflict in join(). Afterwards, extendToIndices can restore the
live range, using any new dominating value numbers and updating the SSA
form.
Use this complex value mapping to support merging a register into a
vector lane that has a conflicting value, but the clobbered lane is
undef.
llvm-svn: 164074
These extra operands are not needed by register allocators using
VirtRegRewriter, and RAFast don't need them any longer.
By omitting the <imp-def> operands, it becomes possible for the new
register coalescer to track which lanes are valid and which are undef.
llvm-svn: 164073
Map the CodeGenSchedule object model onto data tables. The structure
of the data tables is defined in MC, so for convenience we include
MCSchedule.h. The alternative is maintaining a redundant copy of the
table structure definitions. Mapping the object model onto data tables
is sufficiently complicated that it should not be interleaved with
emitting source code. This avoids major problem with the backend for
itinerary generation.
llvm-svn: 164059
Keep GCC's warnings happy. It can't reason out that the state machine won't
ever hit the potentially uninitialized use in OPC_FilterValue.
llvm-svn: 164041
partition use lists a bit. No functionality changed.
These visitors are actually visiting a tuple of a Use and an offset into
the alloca. However, we use the InstVisitor to handle the dispatch over
the users, and so the Use and Offset are stored in class member
variables and set just before each call to visit(). This is fairly
awkward and makes the functions a bit harder to read, but its the only
real option we have until InstVisitor can be rewritten to use variadic
templates.
However, this pattern shouldn't be followed on the helper member
functions where there is no interface constraint from the visitor. We
already were passing the instruction as a normal parameter rather than
use the Use to get at it, start passing the offset as well. This will
become more important in subsequent patches as the offset will in some
cases change while visiting a single instruction.
llvm-svn: 164003
It had patterns for zext-loading and extending. This commit adds patterns for loading a wide type, performing a bitcast,
and extending. This is an odd pattern, but it is commonly used when writing code with intrinsics.
rdar://11897677
llvm-svn: 163995
The live range of an SSA value forms a sub-tree of the dominator tree.
That means the live ranges of two values overlap if and only if the def
of one value lies within the live range of the other.
This can be used to simplify the interference checking a bit: Visit each
def in the two registers about to be joined. Check for interference
against the value that is live in the other register at the def point
only. It is not necessary to scan the set of overlapping live ranges,
this interference check can be done while computing the value mapping
required for the final live range join.
The new algorithm is prepared to handle more complicated conflict
resolution - We can allow overlapping live ranges with different values
as long as the differing lanes are undef or unused in the other
register.
The implementation in this patch doesn't do that yet, it creates code
that is nearly identical to the old algorithm's, except:
- The new stripCopies() function sees through multiple copies while
the old RegistersDefinedFromSameValue() only can handle one.
- There are a few rare cases where the new algorithm can erase an
IMPLICIT_DEF instuction that RegistersDefinedFromSameValue() couldn't
handle.
llvm-svn: 163991
A value that is live in to a basic block should be returned by valueIn()
in LiveRangeQuery(getMBBStartIdx(MBB)), unless it is a PHI-def which
should be returned by valueDefined() instead.
Current code isn't using this functionality. Future code will.
llvm-svn: 163990
Kill flags are removed more and more aggressively during the register
allocation passes, it is better to get information from LiveIntervals.
llvm-svn: 163972
If a PHI value happens to be live out from the layout predecessor of its
def block, the def slot index will be in the middle of the segment:
%vreg11 = [192r,240B:0)[352r,416B:2)[416B,496r:1) 0@192r 1@480B-phi %2@352r
A LiveRangeQuery for 480 should return NULL from valueIn() since the
PHI value is defined at the block entry, not live in to the block.
No test case, future code depends on this functionality.
llvm-svn: 163971
new one, and add support for running the new pass in that mode and in
that slot of the pass manager. With this the new pass can completely
replace the old one within the pipeline.
The strategy for enabling or disabling the SSAUpdater logic is to do it
by making the requirement of the domtree analysis optional. By default,
it is required and we get the standard mem2reg approach. This is usually
the desired strategy when run in stand-alone situations. Within the
CGSCC pass manager, we disable requiring of the domtree analysis and
consequentially trigger fallback to the SSAUpdater promotion.
In theory this would allow the pass to re-use a domtree if one happened
to be available even when run in a mode that doesn't require it. In
practice, it lets us have a single pass rather than two which was
simpler for me to wrap my head around.
There is a hidden flag to force the use of the SSAUpdater code path for
the purpose of testing. The primary testing strategy is just to run the
existing tests through that path. One notable difference is that it has
custom code to handle lifetime markers, and one of the tests has been
enhanced to exercise that code.
This has survived a bootstrap and the test suite without serious
correctness issues, however my run of the test suite produced *very*
alarming performance numbers. I don't entirely understand or trust them
though, so more investigation is on-going.
To aid my understanding of the performance impact of the new SROA now
that it runs throughout the optimization pipeline, I'm enabling it by
default in this commit, and will disable it again once the LNT bots have
picked up one iteration with it. I want to get those bots (which are
much more stable) to evaluate the impact of the change before I jump to
any conclusions.
NOTE: Several Clang tests will fail because they run -O3 and check the
result's order of output. They'll go back to passing once I disable it
again.
llvm-svn: 163965
use load/store fragments defined in TargetSelectionDAG.td in place of them.
Unaligned loads/stores are either expanded or lowered to target-specific nodes,
so instruction selection should see only aligned load/store nodes.
No changes in functionality.
llvm-svn: 163960
destination.
Updated previous implementation to fix a case not covered:
// PBI: br i1 %x, TrueDest, BB
// BI: br i1 %y, TrueDest, FalseDest
The other case was handled correctly.
// PBI: br i1 %x, BB, FalseDest
// BI: br i1 %y, TrueDest, FalseDest
Also tried to use 64-bit arithmetic instead of APInt with scale to simplify the
computation. Let me know if you have other opinions about this.
llvm-svn: 163954
- The current_pos function is supposed to return all the written bytes, not the
current position of the underlying stream.
- This caused tell() to be broken whenever the underlying stream had buffered
content.
llvm-svn: 163948
This models the A9 processor at the level of instruction operands, as
opposed to the itinerary, which models each operation at the level of
pipeline stages.
The two primary motivations are:
1) Allow MachineScheduler to model A9 as an out-of-order processor. It
can now distinguish between hazards that force interlocking vs.
buffered resources.
2) Reduce long-term maintenance by allowing the itinerary and target
hooks to eventually be removed. Note that almost all of the complexity
in the new model exists to model instruction variants, which the
itinerary cannot handle. Instead the scheduler previously relied on
processor-specific target hooks which are incomplete and buggy.
llvm-svn: 163921
This patch introduces a possibility for Hexagon MI scheduler
to perform some target specific post- processing on the scheduling
DAG prior to scheduling.
llvm-svn: 163903
* wrap code blocks in \code ... \endcode;
* refer to parameter names in paragraphs correctly (\arg is not what most
people want -- it starts a new paragraph);
* use \param instead of \arg to document parameters in order to be consistent
with the rest of the codebase.
llvm-svn: 163902
pointless checks in here, bad asserts, and just confusing code. I've
also added a bit more to the comment to clarify what this function is
really trying to do as it was not obvious to Duncan when studying it.
Thanks to Duncan for helping me dig through the issue.
No real functionality changed here in practical cases, and certainly no
test case. This is just cleanup spotted by inspection.
llvm-svn: 163897
inspection by Duncan during review. My suspicion is that we would still
have returned 0 anyways in this case, but doing it sooner is better.
llvm-svn: 163895
deeply suspicious and likely to go away eventually. Also fix a bogus
comment about one of the checks in the vector GEP analysis. Based on
review from Duncan.
llvm-svn: 163894
Originally I had anticipated needing to thread this through more bits of
the SROA pass itself, but that ended up not happening. In the end, this
is a much simpler way to manange the variable.
llvm-svn: 163893
This is essentially a ground up re-think of the SROA pass in LLVM. It
was initially inspired by a few problems with the existing pass:
- It is subject to the bane of my existence in optimizations: arbitrary
thresholds.
- It is overly conservative about which constructs can be split and
promoted.
- The vector value replacement aspect is separated from the splitting
logic, missing many opportunities where splitting and vector value
formation can work together.
- The splitting is entirely based around the underlying type of the
alloca, despite this type often having little to do with the reality
of how that memory is used. This is especially prevelant with unions
and base classes where we tail-pack derived members.
- When splitting fails (often due to the thresholds), the vector value
replacement (again because it is separate) can kick in for
preposterous cases where we simply should have split the value. This
results in forming i1024 and i2048 integer "bit vectors" that
tremendously slow down subsequnet IR optimizations (due to large
APInts) and impede the backend's lowering.
The new design takes an approach that fundamentally is not susceptible
to many of these problems. It is the result of a discusison between
myself and Duncan Sands over IRC about how to premptively avoid these
types of problems and how to do SROA in a more principled way. Since
then, it has evolved and grown, but this remains an important aspect: it
fixes real world problems with the SROA process today.
First, the transform of SROA actually has little to do with replacement.
It has more to do with splitting. The goal is to take an aggregate
alloca and form a composition of scalar allocas which can replace it and
will be most suitable to the eventual replacement by scalar SSA values.
The actual replacement is performed by mem2reg (and in the future
SSAUpdater).
The splitting is divided into four phases. The first phase is an
analysis of the uses of the alloca. This phase recursively walks uses,
building up a dense datastructure representing the ranges of the
alloca's memory actually used and checking for uses which inhibit any
aspects of the transform such as the escape of a pointer.
Once we have a mapping of the ranges of the alloca used by individual
operations, we compute a partitioning of the used ranges. Some uses are
inherently splittable (such as memcpy and memset), while scalar uses are
not splittable. The goal is to build a partitioning that has the minimum
number of splits while placing each unsplittable use in its own
partition. Overlapping unsplittable uses belong to the same partition.
This is the target split of the aggregate alloca, and it maximizes the
number of scalar accesses which become accesses to their own alloca and
candidates for promotion.
Third, we re-walk the uses of the alloca and assign each specific memory
access to all the partitions touched so that we have dense use-lists for
each partition.
Finally, we build a new, smaller alloca for each partition and rewrite
each use of that partition to use the new alloca. During this phase the
pass will also work very hard to transform uses of an alloca into a form
suitable for promotion, including forming vector operations, speculating
loads throguh PHI nodes and selects, etc.
After splitting is complete, each newly refined alloca that is
a candidate for promotion to a scalar SSA value is run through mem2reg.
There are lots of reasonably detailed comments in the source code about
the design and algorithms, and I'm going to be trying to improve them in
subsequent commits to ensure this is well documented, as the new pass is
in many ways more complex than the old one.
Some of this is still a WIP, but the current state is reasonbly stable.
It has passed bootstrap, the nightly test suite, and Duncan has run it
successfully through the ACATS and DragonEgg test suites. That said, it
remains behind a default-off flag until the last few pieces are in
place, and full testing can be done.
Specific areas I'm looking at next:
- Improved comments and some code cleanup from reviews.
- SSAUpdater and enabling this pass inside the CGSCC pass manager.
- Some datastructure tuning and compile-time measurements.
- More aggressive FCA splitting and vector formation.
Many thanks to Duncan Sands for the thorough final review, as well as
Benjamin Kramer for lots of review during the process of writing this
pass, and Daniel Berlin for reviewing the data structures and algorithms
and general theory of the pass. Also, several other people on IRC, over
lunch tables, etc for lots of feedback and advice.
llvm-svn: 163883
This is mostly documentation for the new machine model. It is designed
to be flexible, easy to incrementally refine for a subtarget, and
provide all the information that MachineScheduler will need.
If all goes well, I will follow up with an example of the new model in
use for ARM.
llvm-svn: 163877