Among other stuff, this allows to use predefined .option.machine_version_major
/minor/stepping symbols in the directive.
Relevant test expanded at once (also file renamed for clarity).
Differential Revision: https://reviews.llvm.org/D28140
llvm-svn: 290710
This change adds a new intrinsic which is intended to provide memcpy functionality
with additional atomicity guarantees. Please refer to the review thread
or language reference for further details.
Differential Revision: https://reviews.llvm.org/D27133
llvm-svn: 290708
We bypassed the intrinsic and returned the passthru operand, but we should also add the intrinsic to the worklist since its now dead. This can allow DCE to find it sooner and remove it. Similar was done for InsertElement when the inserted element isn't demanded.
llvm-svn: 290704
The accidentally had trivially dead code. Also needed to adjust the rounding mode to not CUR_DIRECTION so the intrinsics don't get converted to native operations before going through SimplifyDemandedVectorElts.
llvm-svn: 290702
Apparently GCC targeting Windows breaks bitfields on static data members:
struct Foo {
unsigned X : 16;
static const int M = 42;
unsigned Y : 16;
};
static_assert(sizeof(Foo) == 4, "asdf"); // fails
Who knew.
llvm-svn: 290700
Summary:
The optimal iteration order for this problem is RPO order. We want to
process as many preds of a backedge as we can before we process the
backedge.
At the same time, as we add predicate handling, we want to be able to
touch instructions that are dominated by a given block by
ranges (because a change in value numbering a predicate possibly
affects all users we dominate that are using that predicate).
If we don't do it this way, we can't do value inference over
backedges (the paper covers this in depth).
The newgvn branch currently overshoots the last part, and guarantees
that it will touch *at least* the right set of instructions, but it
does touch more. This is because the bitvector instruction ranges are
currently generated in RPO order (so we take the max and the min of
the ranges of dominated blocks, which means there are some in the
middle we didn't have to touch that we did).
We can do better by sorting the dominator tree, and then just using
dominator tree order.
As a preliminary, the dominator tree has some RPO guarantees, but not
enough. It guarantees that for a given node, your idom must come
before you in the RPO ordering. It guarantees no relative RPO ordering
for siblings. We add siblings in whatever order they appear in the module.
So that is what we fix.
We sort the children array of the domtree into RPO order, and then use
the dominator tree for ordering, instead of RPO, since the dominator
tree is now a valid RPO ordering.
Note: This would help any other pass that iterates a forward problem
in dominator tree order. Most of them are single pass. It will still
maximize whatever result they compute. We could also build the
dominator tree in this order, but our incremental updates would still
put it out of sort order, and recomputing the sort order is almost as
hard as general incremental updates of the domtree.
Also note that the sorting does not affect any tests, etc. Nothing
depends on domtree order, including the verifier, the equals
functions for domtree nodes, etc.
How much could this matter, you ask?
Here are the current numbers.
This is generated by running NewGVN over all files in LLVM.
Note that once we propagate equalities, the differences go up by an
order of magnitude or two (IE instead of 29, the max ends up in the
thousands, since the worst case we add a factor of N, where N is the
number of branch predicates). So while it doesn't look that stark for
the default ordering, it gets *much much* worse. There are also
programs in the wild where the difference is already pretty stark
(2 iterations vs hundreds).
RPO ordering:
759040 Number of iterations is 1
112908 Number of iterations is 2
Default dominator tree ordering:
755081 Number of iterations is 1
116234 Number of iterations is 2
603 Number of iterations is 3
27 Number of iterations is 4
2 Number of iterations is 5
1 Number of iterations is 7
Dominator tree sorted:
759040 Number of iterations is 1
112908 Number of iterations is 2
<yay!>
Really bad ordering (sort domtree siblings in postorder. not quite the
worst possible, but yeah):
754008 Number of iterations is 1
21 Number of iterations is 10
8 Number of iterations is 11
6 Number of iterations is 12
5 Number of iterations is 13
2 Number of iterations is 14
2 Number of iterations is 15
3 Number of iterations is 16
1 Number of iterations is 17
2 Number of iterations is 18
96642 Number of iterations is 2
1 Number of iterations is 20
2 Number of iterations is 21
1 Number of iterations is 22
1 Number of iterations is 29
17266 Number of iterations is 3
2598 Number of iterations is 4
798 Number of iterations is 5
273 Number of iterations is 6
186 Number of iterations is 7
80 Number of iterations is 8
42 Number of iterations is 9
Reviewers: chandlerc, davide
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28129
llvm-svn: 290699
I added one for Value back in r262045, and I'm starting to think we
should have these for any class with bitfields whose memory efficiency
really matters.
llvm-svn: 290698
Summary:
Follow-up to r290691, where I introduced HasLLVMReservedName. rnk
pointed out that that patch added an extra word to GlobalValue on MSVC,
because it doesn't pack bitfields with different types.
This patch moves HasLLVMReservedName into the existing bitfield, where
we appear to have plenty of bits to spare.
Reviewers: rnk
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28149
llvm-svn: 290696
Summary:
We were already using 32-bit jump table entries, but this was a
consequence of the default PIC model on Win64, and not an intentional
design decision. This patch ensures that we always use 32-bit label
difference jump table entries on Win64 regardless of the PIC model. This
is a good idea because it saves executable size and object file size.
Moving the jump tables to .rdata cleans up the disassembled object code
and reduces the available ROP targets, but it requires adding one more
RIP-relative lea to the code. COFF doesn't have relocations to express
the difference between two arbitrary symbols, so we can't use the jump
table label in the label difference like we do elsewhere.
Fixes PR31488
Reviewers: majnemer, compnerd
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28141
llvm-svn: 290694
The Bitstream reader and writer are limited to handle a "size_t" at
most, which means that we can't backpatch and read back a 64bits
value on 32 bits platform.
llvm-svn: 290693
Summary:
Previously isIntrinsic() called getName(). This involves a hashtable
lookup, so is nontrivially expensive. And isIntrinsic() is called
frequently, particularly by dyn_cast<IntrinsicInstr>.
This patch steals a bit of IntID and uses that to store whether or not
getName() starts with "llvm."
Reviewers: bogner, arsenm, joker-eph
Subscribers: sanjoy, llvm-commits
Differential Revision: https://reviews.llvm.org/D22949
llvm-svn: 290691
This index record the position for each metadata record in
the bitcode, so that the reader will be able to lazy-load
on demand each individual record.
We also make sure that every abbrev is emitted upfront so
that the block can be skipped while reading.
I don't plan to commit this before having the reader
counterpart, but I figured this can be reviewed mostly
independently.
Recommit r290684 (was reverted in r290686 because a test
was broken) after adding a threshold to avoid emitting
the index when unnecessary (little amount of metadata).
This optimization "hides" a limitation of the ability
to backpatch in the bitstream: we can only backpatch
safely when the position has been flushed. So if we emit
an index for one metadata, it is possible that (part of)
the offset placeholder hasn't been flushed and the backpatch
will fail.
Differential Revision: https://reviews.llvm.org/D28083
llvm-svn: 290690
Summary:
Make kLargeMalloc big enough to be handled by secondary allocator
and small enough to fit into quarantine for all configurations.
It become too big to fit into quarantine on Android after D27873.
Reviewers: eugenis
Patch by Alex Shlyapnikov.
Subscribers: danalbert, llvm-commits, kubabrecka
Differential Revision: https://reviews.llvm.org/D28142
llvm-svn: 290689
emplace_back is not faster if it is equivalent to push_back. In this cases emplaced value had the
same type that the one stored in container. It is ugly and it might be even slower (see
Scott Meyers presentation about emplacement).
llvm-svn: 290685
Summary:
This index record the position for each metadata record in
the bitcode, so that the reader will be able to lazy-load
on demand each individual record.
We also make sure that every abbrev is emitted upfront so
that the block can be skipped while reading.
I don't plan to commit this before having the reader
counterpart, but I figured this can be reviewed mostly
independently.
Reviewers: pcc, tejohnson
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28083
llvm-svn: 290684
Jump table emission can switch to .rdata before
WinException::endFunction gets called. Just remember the appropriate
text section we started in and reset back to it when we end the
function. We were already switching sections back from .xdata anyway.
Fixes the first problem in PR31488, so that now COFF switch tables can
live in .rdata if we want them to.
llvm-svn: 290678
Summary:
We can simply import all external values with summaries included in
the individual index file created for the distributed backend job,
as only those are added to the individual index file created by the
WriteIndexesThinBackend (in addition to summaries for the original
module, which are skipped here).
While computing the cross module imports on this index would come to
the same conclusion as the original thin link import logic, it is
unnecessary work. And when tuning, it avoids the need to pass the
same function importing parameters (e.g. -import-instr-limit) to
both the thin link and the backends (otherwise they won't make the
same decisions).
Reviewers: mehdi_amini, pcc
Subscribers: cfe-commits
Differential Revision: https://reviews.llvm.org/D28139
llvm-svn: 290674
This is an orthogonal and separated layer instead of being embedded
inside the pass manager. While it adds a small amount of complexity, it
is fairly minimal and the composability and control seems worth the
cost.
The logic for this ends up being nicely isolated and targeted. It should
be easy to experiment with different iteration strategies wrapped around
the CGSCC bottom-up walk using this kind of facility.
The mechanism used to track devirtualization is the simplest one I came
up with. I think it handles most of the cases the existing iteration
machinery handles, but I haven't done a *very* in depth analysis. It
does however match the basic intended semantics, and we can tweak or
tune its exact behavior incrementally as necessary. One thing that we
may want to revisit is freshly building the value handle set on each
iteration. While I don't think this will be a significant cost (it is
strictly fewer value handles but more churn of value handes than the old
call graph), it is conceivable that we'll want a somewhat more clever
tracking mechanism. My hope is to layer that on as a follow up patch
with data supporting any implementation complexity it adds.
This code also provides for a basic count heuristic: if the number of
indirect calls decreases and the number of direct calls increases for
a given function in the SCC, we assume devirtualization is responsible.
This matches the heuristics currently used in the legacy pass manager.
Differential Revision: https://reviews.llvm.org/D23114
llvm-svn: 290665
analyses when we're about to break apart an SCC.
We can't wait until after breaking apart the SCC to invalidate things:
1) Which SCC do we then invalidate? All of them?
2) Even if we invalidate all of them, a newly created SCC may not have
a proxy that will convey the invalidation to functions!
Previously we only invalidated one of the SCCs and too late. This led to
stale analyses remaining in the cache. And because the caching strategy
actually works, they would get used and chaos would ensue.
Doing invalidation early is somewhat pessimizing though if we *know*
that the SCC structure won't change. So it turns out that the design to
make the mutation API force the caller to know the *kind* of mutation in
advance was indeed 100% correct and we didn't do enough of it. So this
change also splits two cases of switching a call edge to a ref edge into
two separate APIs so that callers can clearly test for this and take the
easy path without invalidating when appropriate. This is particularly
important in this case as we expect most inlines to be between functions
in separate SCCs and so the common case is that we don't have to so
aggressively invalidate analyses.
The LCG API change in turn needed some basic cleanups and better testing
in its unittest. No interesting functionality changed there other than
more coverage of the returned sequence of SCCs.
While this seems like an obvious improvement over the current state, I'd
like to revisit the core concept of invalidating within the CG-update
layer at all. I'm wondering if we would be better served forcing the
callers to handle the invalidation beforehand in the cases that they
can handle it. An interesting example is when we want to teach the
inliner to *update and preserve* analyses. But we can cross that bridge
when we get there.
With this patch, the new pass manager an build all of the LLVM test
suite at -O3 and everything passes. =D I haven't bootstrapped yet and
I'm sure there are still plenty of bugs, but this gives a nice baseline
so I'm going to increasingly focus on fleshing out the missing
functionality, especially the bits that are just turned off right now in
order to let us establish this baseline.
llvm-svn: 290664
There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers.
The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled.
Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky
Differential Revision: https://reviews.llvm.org/D27901
llvm-svn: 290663
In C++03 libc++ emulates nullptr_t using a class, and #define's nullptr.
However this makes nullptr_t mangle differently between C++03 and C++11.
This breaks any function ABI which takes nullptr_t.
Thanfully Clang provides __nullptr in all dialects. This patch adds
an ABI option to switch to using __nullptr in C++03. In a perfect world
I would like to turn this on by default, since it's just ABI breaking fix
to an ABI breaking bug.
llvm-svn: 290662