This helps in cases involving bitfields where an AND is exposed by
legalization.
Differential Revision: https://reviews.llvm.org/D30472
llvm-svn: 297249
We cannot leave the identity copies 'select true, arg, undef' that this pass
inserts for arguments to simplify handling of values on swifterror arguments.
swifterror arguments have restrictions on their uses.
rdar://30839288
llvm-svn: 297197
The original patch r296865 was reverted as it broke the chromium builds for
Android https://bugs.llvm.org/show_bug.cgi?id=32134, this patch reapplies
r296865 with a fix to make sure it doesn't cause the build regression.
The problem was that intrinsic selection on int_arm_get_fpscr was failing in
ISel this was because the code to manually select this intrinsic still thought
it was the version with no side-effects (INTRINSIC_WO_CHAIN) which is wrong as
it doesn't semantically match the definition in the tablegen code which says it
does have side-effects, I've fixed this by updating the intrinsic type to
INTRINSIC_W_CHAIN (has side-effects). I've also added a test for this based on
Hans original reproducer.
Differential Revision: https://reviews.llvm.org/D30645
llvm-svn: 297137
Summary: Previously, it had always been materialized as a push/pop sequence.
Reviewers: labrinea, jroelofs
Reviewed By: jroelofs
Subscribers: llvm-commits, rengolin
Differential Revision: https://reviews.llvm.org/D30648
llvm-svn: 297134
A bit more painful than G_INSERT because it was more widely used, but this
should simplify the handling of extract operations in most locations.
llvm-svn: 297100
The intrinsics __builtin_arm_get_fpscr and __builtin_arm_set_fpscr read and
write to the fpscr (Floating-Point Status and Control Register) register.
A bug exists in the __builtin_arm_get_fpscr intrinsic definition in llvm which
treats this intrinsic as a IntroNoMem which means it's not a memory access and
doesn't have any other side-effects. Having this property on this intrinsic
means that various optimizations can be done on this such as common
sub-expression elimination with other reads. This can cause issues if there has
been write to this register, e.g.
void foo(int *p) {
p[0] = __builtin_arm_get_fpscr();
__builtin_arm_set_fpscr(1);
p[1] = __builtin_arm_get_fpscr();
}
in the above example the second read is currently CSE'd into the first read,
this is because llvm isn't aware that the write done by __builtin_arm_set_fpscr
effects the same register that __builtin_arm_get_fpscr reads from, to fix this
problem I've removed the property IntrNoMem so that __builtin_arm_get_fpscr is
treated as a memory access.
Differential Revision: https://reviews.llvm.org/D30542
llvm-svn: 296865
This patch causes compile times for some patterns to explode. I have
a (large, unreduced) test case that slows down by more than 20x and
several test cases slow down by 2x. I'm sending some of the test cases
directly to Nirav and following up with more details in the review log,
but this should unblock anyone else hitting this.
llvm-svn: 296862
In ARMPreAllocLoadStoreOpt::RescheduleOps, LastOp should be the last
operation which we want to merge. If we break out of the loop because
an operation has the wrong offset, we shouldn't use that operation
as LastOp.
This patch fixes some cases where we would move stores to the wrong
insert point.
Re-commit with a fix to increment NumMove in the right place.
Differential Revision: https://reviews.llvm.org/D30124
llvm-svn: 296815
This bug was introduced with:
https://reviews.llvm.org/rL296699
There may be a way to loosen the restriction, but for now just bail out
on any opaque constant.
The tests show that opacity is target-specific. This goes back to cost
calculations in ConstantHoisting based on TTI->getIntImmCost().
llvm-svn: 296768
Original commit message:
[ARM] Fix insert point for store rescheduling.
In ARMPreAllocLoadStoreOpt::RescheduleOps, LastOp should be the last
operation which we want to merge. If we break out of the loop because
an operation has the wrong offset, we shouldn't use that operation as
LastOp.
This patch fixes some cases where we would sink stores for no reason.
llvm-svn: 296718
In ARMPreAllocLoadStoreOpt::RescheduleOps, LastOp should be the last
operation which we want to merge. If we break out of the loop because
an operation has the wrong offset, we shouldn't use that operation as
LastOp.
This patch fixes some cases where we would sink stores for no reason.
Differential Revision: https://reviews.llvm.org/D30124
llvm-svn: 296708
This code starts from the high end of the sorted vector of offsets, and
works backwards: it tries to find contiguous offsets, process them, then
pops them from the end of the vector. Most of the code agrees with this
order of processing, but one loop doesn't: it instead processes elements
from the low end of the vector (which are nodes with unrelated offsets).
Fix that loop to process the correct elements.
This has a few implications. One, we don't incorrectly return early when
processing multiple groups of offsets in the same block (which allows
rescheduling prera-ldst-insertpt.mir). Two, we pick the correct insert
point for loads, so they're correctly sorted (which affects the
scheduling of vldm-liveness.ll). I think it might also impact some of
the heuristics slightly.
Differential Revision: https://reviews.llvm.org/D30368
llvm-svn: 296701
This is part of the ongoing attempt to improve select codegen for all targets and select
canonicalization in IR (see D24480 for more background). The transform is a subset of what
is done in InstCombine's FoldOpIntoSelect().
I first noticed a regression in the x86 avx512-insert-extract.ll tests with a patch that
hopes to convert more selects to basic math ops. This appears to be a general missing DAG
transform though, so I added tests for all standard binops in rL296621
(PowerPC was chosen semi-randomly; it has scripted FileCheck support, but so do ARM and x86).
The poor output for "sel_constants_shl_constant" is tracked with:
https://bugs.llvm.org/show_bug.cgi?id=32105
Differential Revision: https://reviews.llvm.org/D30502
llvm-svn: 296699
Summary:
Avoids tons of prologue boilerplate when arguments are passed in memory
and left in memory. This can happen in a debug build or in a release
build when an argument alloca is escaped. This will dramatically affect
the code size of x86 debug builds, because X86 fast isel doesn't handle
arguments passed in memory at all. It only handles the x86_64 case of up
to 6 basic register parameters.
This is implemented by analyzing the entry block before ISel to identify
copy elision candidates. A copy elision candidate is an argument that is
used to fully initialize an alloca before any other possibly escaping
uses of that alloca. If an argument is a copy elision candidate, we set
a flag on the InputArg. If the the target generates loads from a fixed
stack object that matches the size and alignment requirements of the
alloca, the SelectionDAG builder will delete the stack object created
for the alloca and replace it with the fixed stack object. The load is
left behind to satisfy any remaining uses of the argument value. The
store is now dead and is therefore elided. The fixed stack object is
also marked as mutable, as it may now be modified by the user, and it
would be invalid to rematerialize the initial load from it.
Supersedes D28388
Fixes PR26328
Reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans
Subscribers: igorb, llvm-commits
Differential Revision: https://reviews.llvm.org/D29668
llvm-svn: 296683
Resubmit r295336 after the bug with non-zero offset patterns on BE targets is fixed (r296336).
Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.
Reviewed By: filcab
Differential Revision: https://reviews.llvm.org/D29591
llvm-svn: 296651
Lower i1, i8 and i16 call parameters by extending them before storing them on
the stack. Also make sure we encode the correct, extended size in the
corresponding memory operand, and that we compute the correct stack size in the
end.
The latter is a bit more complicated because we used to compute the stack size
in the getStackAddress method, based on the Size and Offset of the parameters.
However, if the last parameter is sign extended, we'd be using the wrong,
non-extended size, and we'd end up with a smaller stack than we need to hold the
extended value. Instead of hacking this up based on the value of Size in
getStackAddress, we move our stack size handling logic to assignArg, where we
have access to the CCState which knows everything we could possibly want to know
about the stack. This way we don't need to duplicate any knowledge or resort to
any ugly hacks.
On this same occasion, update the IRTranslator test to check the sizes of the
stores everywhere, not just for sign extended paramteres.
llvm-svn: 296631
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296476
Lower i32, float and double parameters that need to live on the stack. This
boils down to creating some G_GEPs starting from the stack pointer and storing
the values there. During the process we also keep track of the stack size and
use the final value in the ADJCALLSTACKDOWN/UP instructions.
We currently assert for smaller types, since they usually require extensions.
They will be handled in a separate patch.
llvm-svn: 296473
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296416
The transform in question claims to be doing:
// fold (add (select cc, 0, c), x) -> (select cc, x, (add, x, c))
...starting in PerformADDCombineWithOperands(), but it wasn't actually checking for a setcc node
for the sext/zext patterns.
This is exactly the opposite of a transform I'd like to add to DAGCombiner's foldSelectOfConstants(),
so I was seeing infinite loops with my draft of a patch applied.
The changes in select_const.ll look positive (less instructions). The change in arm-and-tst-peephole.ll
is unrelated. We're changing the input IR in that test to preserve the intent of the test, but that's
not affected by this code change.
Differential Revision:
https://reviews.llvm.org/D30355
llvm-svn: 296389
This pattern is essentially a i16 load from p+1 address:
%p1.i16 = bitcast i8* %p to i16*
%p2.i8 = getelementptr i8, i8* %p, i64 2
%v1 = load i16, i16* %p1.i16
%v2.i8 = load i8, i8* %p2.i8
%v2 = zext i8 %v2.i8 to i16
%v1.shl = shl i16 %v1, 8
%res = or i16 %v1.shl, %v2
Current implementation would identify %v1 load as the first byte load and would mistakenly emit a i16 load from %p1.i16 address. This patch adds a check that the first byte is loaded from a non-zero offset of the first load address. This way this address can be used as the base address for the combined value. Otherwise just give up combining.
llvm-svn: 296336
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296252
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296149
The motivation for filling out these select-of-constants cases goes back to D24480,
where we discussed removing an IR fold from add(zext) --> select. And that goes back to:
https://reviews.llvm.org/rL75531https://reviews.llvm.org/rL159230
The idea is that we should always canonicalize patterns like this to a select-of-constants
in IR because that's the smallest IR and the best for value tracking. Note that we currently
do the opposite in some cases (like the cases in *this* patch). Ie, the proposed folds in
this patch already exist in InstCombine today:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSelect.cpp#L1151
As this patch shows, most targets generate better machine code for simple ext/add/not ops
rather than a select of constants. So the follow-up steps to make this less of a patchwork
of special-case folds and missing IR canonicalization:
1. Have DAGCombiner convert any select of constants into ext/add/not ops.
2 Have InstCombine canonicalize in the other direction (create more selects).
Differential Revision: https://reviews.llvm.org/D30180
llvm-svn: 296137
clang will generate IR like this for input using packed bitfields;
very simple semantically, but it's a bit tricky to actually
generate good code.
llvm-svn: 296080
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296060
FastISel wasn't checking the isFPOnlySP subtarget feature before emitting
double-precision operations, so it got completely invalid CodeGen for doubles
on Cortex-M4F.
The normal ISel testing wasn't spectacular either so I added a second RUN line
to improve that while I was in the area.
llvm-svn: 296031
Introduce a common ValueHandler for call returns and formal arguments, and
inherit two different versions for handling the differences (at the moment the
only difference is the way physical registers are marked as used).
llvm-svn: 295973
Add support for lowering calls with parameters than can fit into regs. Use the
same ValueHandler that we used for function returns, but rename it to match its
new, extended purpose.
llvm-svn: 295971
The ARMConstantIslandPass didn't have support for handling accesses to
constant island objects through ARM::t2LDRBpci instructions. This adds
support for that.
This fixes PR31997.
llvm-svn: 295964