patterns.
This is causing Clang to miscompile itself for 32-bit x86 somehow, and likely
also on ARM and PPC. I really don't know how, but reverting now that I've
confirmed this is actually the culprit. I have a reproduction as well and so
should be able to restore this shortly.
This reverts commit r223764.
Original commit log follows:
Teach instcombine to canonicalize "element extraction" from a load of an
integer and "element insertion" into a store of an integer into actual
element extraction, element insertion, and vector loads and stores.
Previously various parts of LLVM (including instcombine itself) would
introduce integer loads and stores into the code as a way of opaquely
loading and storing "bits". In some cases (such as a memcpy of
std::complex<float> object) we will eventually end up using those bits
in non-integer types. In order for SROA to effectively promote the
allocas involved, it splits these "store a bag of bits" integer loads
and stores up into the constituent parts. However, for non-alloca loads
and tsores which remain, it uses integer math to recombine the values
into a large integer to load or store.
All of this would be "fine", except that it forces LLVM to go through
integer math to combine and split up values. While this makes perfect
sense for integers (and in fact is critical for bitfields to end up
lowering efficiently) it is *terrible* for non-integer types, especially
floating point types. We have a much more canonical way of representing
the act of concatenating the bits of two SSA values in LLVM: a vector
and insertelement. This patch teaching InstCombine to use this
representation.
With this patch applied, LLVM will no longer introduce integer math into
the critical path of every loop over std::complex<float> operations such
as those that make up the hot path of ... oh, most HPC code, Eigen, and
any other heavy linear algebra library.
For the record, I looked *extensively* at fixing this in other parts of
the compiler, but it just doesn't work:
- We really do want to canonicalize memcpy and other bit-motion to
integer loads and stores. SSA values are tremendously more powerful
than "copy" intrinsics. Not doing this regresses massive amounts of
LLVM's scalar optimizer.
- We really do need to split up integer loads and stores of this form in
SROA or every memcpy of a trivially copyable struct will prevent SSA
formation of the members of that struct. It essentially turns off
SROA.
- The closest alternative is to actually split the loads and stores when
partitioning with SROA, but this has all of the downsides historically
discussed of splitting up loads and stores -- the wide-store
information is fundamentally lost. We would also see performance
regressions for bitfield-heavy code and other places where the
integers aren't really intended to be split without seemingly
arbitrary logic to treat integers totally differently.
- We *can* effectively fix this in instcombine, so it isn't that hard of
a choice to make IMO.
llvm-svn: 223813
Split `Metadata` away from the `Value` class hierarchy, as part of
PR21532. Assembly and bitcode changes are in the wings, but this is the
bulk of the change for the IR C++ API.
I have a follow-up patch prepared for `clang`. If this breaks other
sub-projects, I apologize in advance :(. Help me compile it on Darwin
I'll try to fix it. FWIW, the errors should be easy to fix, so it may
be simpler to just fix it yourself.
This breaks the build for all metadata-related code that's out-of-tree.
Rest assured the transition is mechanical and the compiler should catch
almost all of the problems.
Here's a quick guide for updating your code:
- `Metadata` is the root of a class hierarchy with three main classes:
`MDNode`, `MDString`, and `ValueAsMetadata`. It is distinct from
the `Value` class hierarchy. It is typeless -- i.e., instances do
*not* have a `Type`.
- `MDNode`'s operands are all `Metadata *` (instead of `Value *`).
- `TrackingVH<MDNode>` and `WeakVH` referring to metadata can be
replaced with `TrackingMDNodeRef` and `TrackingMDRef`, respectively.
If you're referring solely to resolved `MDNode`s -- post graph
construction -- just use `MDNode*`.
- `MDNode` (and the rest of `Metadata`) have only limited support for
`replaceAllUsesWith()`.
As long as an `MDNode` is pointing at a forward declaration -- the
result of `MDNode::getTemporary()` -- it maintains a side map of its
uses and can RAUW itself. Once the forward declarations are fully
resolved RAUW support is dropped on the ground. This means that
uniquing collisions on changing operands cause nodes to become
"distinct". (This already happened fairly commonly, whenever an
operand went to null.)
If you're constructing complex (non self-reference) `MDNode` cycles,
you need to call `MDNode::resolveCycles()` on each node (or on a
top-level node that somehow references all of the nodes). Also,
don't do that. Metadata cycles (and the RAUW machinery needed to
construct them) are expensive.
- An `MDNode` can only refer to a `Constant` through a bridge called
`ConstantAsMetadata` (one of the subclasses of `ValueAsMetadata`).
As a side effect, accessing an operand of an `MDNode` that is known
to be, e.g., `ConstantInt`, takes three steps: first, cast from
`Metadata` to `ConstantAsMetadata`; second, extract the `Constant`;
third, cast down to `ConstantInt`.
The eventual goal is to introduce `MDInt`/`MDFloat`/etc. and have
metadata schema owners transition away from using `Constant`s when
the type isn't important (and they don't care about referring to
`GlobalValue`s).
In the meantime, I've added transitional API to the `mdconst`
namespace that matches semantics with the old code, in order to
avoid adding the error-prone three-step equivalent to every call
site. If your old code was:
MDNode *N = foo();
bar(isa <ConstantInt>(N->getOperand(0)));
baz(cast <ConstantInt>(N->getOperand(1)));
bak(cast_or_null <ConstantInt>(N->getOperand(2)));
bat(dyn_cast <ConstantInt>(N->getOperand(3)));
bay(dyn_cast_or_null<ConstantInt>(N->getOperand(4)));
you can trivially match its semantics with:
MDNode *N = foo();
bar(mdconst::hasa <ConstantInt>(N->getOperand(0)));
baz(mdconst::extract <ConstantInt>(N->getOperand(1)));
bak(mdconst::extract_or_null <ConstantInt>(N->getOperand(2)));
bat(mdconst::dyn_extract <ConstantInt>(N->getOperand(3)));
bay(mdconst::dyn_extract_or_null<ConstantInt>(N->getOperand(4)));
and when you transition your metadata schema to `MDInt`:
MDNode *N = foo();
bar(isa <MDInt>(N->getOperand(0)));
baz(cast <MDInt>(N->getOperand(1)));
bak(cast_or_null <MDInt>(N->getOperand(2)));
bat(dyn_cast <MDInt>(N->getOperand(3)));
bay(dyn_cast_or_null<MDInt>(N->getOperand(4)));
- A `CallInst` -- specifically, intrinsic instructions -- can refer to
metadata through a bridge called `MetadataAsValue`. This is a
subclass of `Value` where `getType()->isMetadataTy()`.
`MetadataAsValue` is the *only* class that can legally refer to a
`LocalAsMetadata`, which is a bridged form of non-`Constant` values
like `Argument` and `Instruction`. It can also refer to any other
`Metadata` subclass.
(I'll break all your testcases in a follow-up commit, when I propagate
this change to assembly.)
llvm-svn: 223802
replaceDbgDeclareForAlloca() replaces an alloca by a value storing the
address of what was the alloca. If there is a dbg.declare corresponding
to that alloca, we need to lower it to a dbg.value describing the additional
dereference operation to be performed to get to the underlying variable.
This is done by adding a DW_OP_deref to the complex location part of the
location description. This deref was added to the end of the operation list,
which is wrong. The expression applies to what is described by the
dbg.{declare,value}, and as we are changing this, we need to apply the
DW_OP_deref as the first operation in the list.
Part of the fix for rdar://19162268.
llvm-svn: 223799
integer and "element insertion" into a store of an integer into actual
element extraction, element insertion, and vector loads and stores.
Previously various parts of LLVM (including instcombine itself) would
introduce integer loads and stores into the code as a way of opaquely
loading and storing "bits". In some cases (such as a memcpy of
std::complex<float> object) we will eventually end up using those bits
in non-integer types. In order for SROA to effectively promote the
allocas involved, it splits these "store a bag of bits" integer loads
and stores up into the constituent parts. However, for non-alloca loads
and tsores which remain, it uses integer math to recombine the values
into a large integer to load or store.
All of this would be "fine", except that it forces LLVM to go through
integer math to combine and split up values. While this makes perfect
sense for integers (and in fact is critical for bitfields to end up
lowering efficiently) it is *terrible* for non-integer types, especially
floating point types. We have a much more canonical way of representing
the act of concatenating the bits of two SSA values in LLVM: a vector
and insertelement. This patch teaching InstCombine to use this
representation.
With this patch applied, LLVM will no longer introduce integer math into
the critical path of every loop over std::complex<float> operations such
as those that make up the hot path of ... oh, most HPC code, Eigen, and
any other heavy linear algebra library.
For the record, I looked *extensively* at fixing this in other parts of
the compiler, but it just doesn't work:
- We really do want to canonicalize memcpy and other bit-motion to
integer loads and stores. SSA values are tremendously more powerful
than "copy" intrinsics. Not doing this regresses massive amounts of
LLVM's scalar optimizer.
- We really do need to split up integer loads and stores of this form in
SROA or every memcpy of a trivially copyable struct will prevent SSA
formation of the members of that struct. It essentially turns off
SROA.
- The closest alternative is to actually split the loads and stores when
partitioning with SROA, but this has all of the downsides historically
discussed of splitting up loads and stores -- the wide-store
information is fundamentally lost. We would also see performance
regressions for bitfield-heavy code and other places where the
integers aren't really intended to be split without seemingly
arbitrary logic to treat integers totally differently.
- We *can* effectively fix this in instcombine, so it isn't that hard of
a choice to make IMO.
Differential Revision: http://reviews.llvm.org/D6548
llvm-svn: 223764
Introduce the ``llvm.instrprof_increment`` intrinsic and the
``-instrprof`` pass. These provide the infrastructure for writing
counters for profiling, as in clang's ``-fprofile-instr-generate``.
The implementation of the instrprof pass is ported directly out of the
CodeGenPGO classes in clang, and with the followup in clang that rips
that code out to use these new intrinsics this ends up being NFC.
Doing the instrumentation this way opens some doors in terms of
improving the counter performance. For example, this will make it
simple to experiment with alternate lowering strategies, and allows us
to try handling profiling specially in some optimizations if we want
to.
Finally, this drastically simplifies the frontend and puts all of the
lowering logic in one place.
llvm-svn: 223672
Do not realign origin address if the corresponding application
address is at least 4-byte-aligned.
Saves 2.5% code size in track-origins mode.
llvm-svn: 223464
Added instcombine optimizations for BSWAP with AND/OR/XOR ops:
OP( BSWAP(x), BSWAP(y) ) -> BSWAP( OP(x, y) )
OP( BSWAP(x), CONSTANT ) -> BSWAP( OP(x, BSWAP(CONSTANT) ) )
Since its just a one liner, I've also added BSWAP to the DAGCombiner equivalent as well:
fold (OP (bswap x), (bswap y)) -> (bswap (OP x, y))
Refactored bswap-fold tests to use FileCheck instead of just checking that the bswaps had gone.
Differential Revision: http://reviews.llvm.org/D6407
llvm-svn: 223349
This allows cases like float x; fmin(1.0, x); to be optimized to fminf(1.0f, x);
rdar://19049359
Differential Revision: http://reviews.llvm.org/D6496
llvm-svn: 223270
This change makes MemorySanitizer instrumentation a bit more strict
about instructions that have no origin id assigned to them.
This would have caught the bug that was fixed in r222918.
This is re-commit of r222997, reverted in r223211, with 3 more
missing origins added.
llvm-svn: 223236
Try to convert two compares of a signed range check into a single unsigned compare.
Examples:
(icmp sge x, 0) & (icmp slt x, n) --> icmp ult x, n
(icmp slt x, 0) | (icmp sgt x, n) --> icmp ugt x, n
llvm-svn: 223224
Remove an unnecessary `MDNode::replaceAllUsesWith()`. In the preceding
line, `TheLoop->setLoopID()` visits all backedges and sets the new loop
ID. This sufficiently updates the loop metadata.
Metadata RAUW is going away as part of PR21532.
llvm-svn: 223210
We were assuming that each back-edge in a region represented a unique
loop, which is not always the case. We need to use LoopInfo to
correctly determine which back-edges are loops.
llvm-svn: 223199
Patch by Ben Gamari!
This redefines the `prefix` attribute introduced previously and
introduces a `prologue` attribute. There are a two primary usecases
that these attributes aim to serve,
1. Function prologue sigils
2. Function hot-patching: Enable the user to insert `nop` operations
at the beginning of the function which can later be safely replaced
with a call to some instrumentation facility
3. Runtime metadata: Allow a compiler to insert data for use by the
runtime during execution. GHC is one example of a compiler that
needs this functionality for its tables-next-to-code functionality.
Previously `prefix` served cases (1) and (2) quite well by allowing the user
to introduce arbitrary data at the entrypoint but before the function
body. Case (3), however, was poorly handled by this approach as it
required that prefix data was valid executable code.
Here we redefine the notion of prefix data to instead be data which
occurs immediately before the function entrypoint (i.e. the symbol
address). Since prefix data now occurs before the function entrypoint,
there is no need for the data to be valid code.
The previous notion of prefix data now goes under the name "prologue
data" to emphasize its duality with the function epilogue.
The intention here is to handle cases (1) and (2) with prologue data and
case (3) with prefix data.
References
----------
This idea arose out of discussions[1] with Reid Kleckner in response to a
proposal to introduce the notion of symbol offsets to enable handling of
case (3).
[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-May/073235.html
Test Plan: testsuite
Differential Revision: http://reviews.llvm.org/D6454
llvm-svn: 223189
This is the third patch in a small series. It contains the CodeGen support for lowering the gc.statepoint intrinsic sequences (223078) to the STATEPOINT pseudo machine instruction (223085). The change also includes the set of helper routines and classes for working with gc.statepoints, gc.relocates, and gc.results since the lowering code uses them.
With this change, gc.statepoints should be functionally complete. The documentation will follow in the fourth change, and there will likely be some cleanup changes, but interested parties can start experimenting now.
I'm not particularly happy with the amount of code or complexity involved with the lowering step, but at least it's fairly well isolated. The statepoint lowering code is split into it's own files and anyone not working on the statepoint support itself should be able to ignore it.
During the lowering process, we currently spill aggressively to stack. This is not entirely ideal (and we have plans to do better), but it's functional, relatively straight forward, and matches closely the implementations of the patchpoint intrinsics. Most of the complexity comes from trying to keep relocated copies of values in the same stack slots across statepoints. Doing so avoids the insertion of pointless load and store instructions to reshuffle the stack. The current implementation isn't as effective as I'd like, but it is functional and 'good enough' for many common use cases.
In the long term, I'd like to figure out how to integrate the statepoint lowering with the register allocator. In principal, we shouldn't need to eagerly spill at all. The register allocator should do any spilling required and the statepoint should simply record that fact. Depending on how challenging that turns out to be, we may invest in a smarter global stack slot assignment mechanism as a stop gap measure.
Reviewed by: atrick, ributzka
llvm-svn: 223137
Follow up from r222926. Also handle multiple destinations from merged
cases on multiple and subsequent phi instructions.
rdar://problem/19106978
llvm-svn: 223135
Load instructions are inserted into loop preheaders when sinking stores
and later removed if not used by the SSA updater. Avoid sinking if the
loop has no preheader and avoid crashes. This fixes one more side effect
of not handling indirectbr instructions properly on LoopSimplify.
llvm-svn: 223119
An unreachable default destination can be exploited by other optimizations, and
SDag lowering is now prepared to handle them efficiently.
For example, branches to the unreachable destination will be optimized away,
such as in the case of range checks for switch lookup tables.
On 64-bit Linux, this reduces the size of a clang bootstrap by 80 kB (and
Chromium by 30 kB).
llvm-svn: 223050
This change makes MemorySanitizer instrumentation a bit more strict
about instructions that have no origin id assigned to them.
This would have caught the bug that was fixed in r222918.
No functional change.
llvm-svn: 222997
This reverts commit r222632 (and follow-up r222636), which caused a host
of LNT failures on an internal bot. I'll respond to the commit on the
list with a reproduction of one of the failures.
Conflicts:
lib/Target/X86/X86TargetTransformInfo.cpp
llvm-svn: 222936
We may be in a situation where the icmps might not be near each other in
a tree of or instructions. Try to dig out related compare instructions
and see if they combine.
N.B. This won't fire on deep trees of compares because rewritting the
tree might end up creating a net increase of IR. We may have to resort
to something more sophisticated if this is a real problem.
llvm-svn: 222928
Loop simplify skips exit-block insertion when exits contain indirectbr
instructions. This leads to an assertion in LICM when trying to sink
stores out of non-dedicated loop exits containing indirectbr
instructions. This patch fix this issue by re-checking for dedicated
exits in LICM prior to store sink attempts.
Differential Revision: http://reviews.llvm.org/D6414
rdar://problem/18943047
llvm-svn: 222927
Switch cases statements with sequential values that branch to the same
destination BB may often be handled together in a single new source BB.
In this scenario we need to remove remaining incoming values from PHI
instructions in the destination BB, as to match the number of source
branches.
Differential Revision: http://reviews.llvm.org/D6415
rdar://problem/19040894
llvm-svn: 222926
MSan does not assign origin for instrumentation temps (i.e. the ones that do
not come from the application code), but "select" instrumentation erroneously
tried to use one of those.
https://code.google.com/p/memory-sanitizer/issues/detail?id=78
llvm-svn: 222918
Fixed missing dominance check.
Original commit message:
This optimization tries to reuse the generated compare instruction, if there is a comparison against the default value after the switch.
Example:
if (idx < tablesize)
r = table[idx]; // table does not contain default_value
else
r = default_value;
if (r != default_value)
...
Is optimized to:
cond = idx < tablesize;
if (cond)
r = table[idx];
else
r = default_value;
if (cond)
...
Jump threading will then eliminate the second if(cond).
llvm-svn: 222891