As noted in D34071, there are some IR optimization opportunities that could be
handled by normal IR passes if this expansion wasn't happening so late in CGP.
Regardless of that, it seems wasteful to knowingly produce suboptimal IR here,
so I'm proposing this change:
%s = sub i32 %x, %y
%r = icmp ne %s, 0
=>
%r = icmp ne %x, %y
Changing the predicate to 'eq' mimics what InstCombine would do, so that's just
an efficiency improvement if we decide this expansion should happen sooner.
The fact that the PowerPC backend doesn't eliminate the 'subf.' might be
something for PPC folks to investigate separately.
Differential Revision: https://reviews.llvm.org/D34416
llvm-svn: 306471
When SelectionDAG merges consecutive stores and loads in MergeConsecutiveStores, it does not set dereferenceable flag for a created load instruction. This results in an assertion failure if SelectionDAG commonizes this load instruction with other load instructions, as well as it may miss optimization opportunities.
This patch sat dereferenceable flag for the newly created load instruction if all the load instructions to be merged are dereferenceable.
Differential Revision: https://reviews.llvm.org/D34679
llvm-svn: 306404
PowerPC backend does not pass the current optimization level to SelectionDAGISel and so SelectionDAGISel works with the default optimization level regardless of the current optimization level.
This patch makes the PowerPC backend set the optimization level correctly.
Differential Revision: https://reviews.llvm.org/D34615
llvm-svn: 306367
When SelectionDAG expands memcpy (or memmove) call into a sequence of load and store instructions, it disregards dereferenceable flag even the source pointer is known to be dereferenceable.
This results in an assertion failure if SelectionDAG commonizes a load instruction generated for memcpy with another load instruction for the source pointer.
This patch makes SelectionDAG to set the dereferenceable flag for the load instructions properly to avoid the assertion failure.
Differential Revision: https://reviews.llvm.org/D34467
llvm-svn: 306209
Define target hook isReallyTriviallyReMaterializable() to explicitly specify
PowerPC instructions that are trivially rematerializable. This will allow
the MachineLICM pass to accurately identify PPC instructions that should always
be hoisted.
Differential Revision: https://reviews.llvm.org/D34255
llvm-svn: 305932
This does some improvements/cleanup to the recently introduced
scavengeRegisterBackwards() functionality:
- Rewrite findSurvivorBackwards algorithm to use the existing
LiveRegUnit::accumulateBackward() code. This also avoids the Available
and Candidates bitset and just need 1 LiveRegUnit instance
(= 1 bitset).
- Pick registers in allocation order instead of register number order.
llvm-svn: 305817
This is the last step needed to avoid regressions for x86 before we flip the switch to allow
expansion of the smallest set of memcpy() via CGP. The DAG version checks for constant strings,
so we need to do that here too.
FWIW, the 2 constant test is not handled by LibCallSimplifier::optimizeMemCmp() because that
code is limited to 8-bit constant arrays. LibCallSimplifier will also fail to optimize some 1
constant tests because its alignment requirements are too strict (shouldn't require alignment
for a constant operand).
Differential Revision: https://reviews.llvm.org/D34071
llvm-svn: 305734
Re-apply r276044/r279124/r305516. Fixed a problem where we would refuse
to place spills as the very first instruciton of a basic block and thus
artifically increase pressure (test in
test/CodeGen/PowerPC/scavenging.mir:spill_at_begin)
This is a variant of scavengeRegister() that works for
enterBasicBlockEnd()/backward(). The benefit of the backward mode is
that it is not affected by incomplete kill flags.
This patch also changes
PrologEpilogInserter::doScavengeFrameVirtualRegs() to use the register
scavenger in backwards mode.
Differential Revision: http://reviews.llvm.org/D21885
llvm-svn: 305625
Revert because of reports of some PPC input starting to spill when it
was predicted that it wouldn't and no spillslot was reserved.
This reverts commit r305516.
llvm-svn: 305566
Re-apply r276044/r279124. Trying to reproduce or disprove the ppc64
problems reported in the stage2 build last time, which I cannot
reproduce right now.
This is a variant of scavengeRegister() that works for
enterBasicBlockEnd()/backward(). The benefit of the backward mode is
that it is not affected by incomplete kill flags.
This patch also changes
PrologEpilogInserter::doScavengeFrameVirtualRegs() to use the register
scavenger in backwards mode.
Differential Revision: http://reviews.llvm.org/D21885
llvm-svn: 305516
Add condition for MachineLICM to safely hoist instructions that utilize
non constant registers that are reserved.
On PPC, global variable access is done through the table of contents (TOC)
which is always in register X2. The ABI reserves this register in any
functions that have calls or access global variables.
A call through a function pointer involves saving, changing and restoring
this register around the call and thus MachineLICM does not consider it to
be invariant. We can however guarantee the register is preserved across the
call and thus is invariant.
Differential Revision: https://reviews.llvm.org/D33562
llvm-svn: 305490
This patch fixes a potential verification error (64-bit register operands for cmpw) with -verify-machineinstrs.
Differential Revision: https://reviews.llvm.org/D34208
llvm-svn: 305479
Power9 has instructions that will reverse the bytes within an element for all
sizes (half-word, word, double-word and quad-word). These can be used for the
vec_revb builtins in altivec.h. However, we implement these to match vector
shuffle nodes as that will cover both the builtins and vector shuffles that
occur in the SDAG through other means.
Differential Revision: https://reviews.llvm.org/D33690
llvm-svn: 305214
Note that if we need the result of both the divide and the modulo then we
compute the modulo based on the result of the divide and not using the new
hardware instruction.
Commit on behalf of STEFAN PINTILIE.
Differential Revision: https://reviews.llvm.org/D33940
llvm-svn: 305210
In PPCBoolRetToInt bool value is changed to i32 type. On ppc64 it may introduce an extra zero extension for the return value. This patch changes the integer type to i64 to avoid the zero extension on ppc64.
This patch fixed PR32442.
Differential Revision: https://reviews.llvm.org/D31407
llvm-svn: 305001
In SDAG, we don't expand libcalls with a nobuiltin attribute.
It's not clear if that's correct from the existing code comment:
"Don't do the check if marked as nobuiltin for some reason."
...adding a test here either way to show that there is currently
a different behavior implemented in the CGP-based expansion.
llvm-svn: 304991
The test diff for PowerPC shows we can better optimize if this case is one block.
For x86, there's would be a substantial difference if CGP expansion was enabled because branches are assumed
cheap and SDAG can't optimize across blocks.
Instead of this:
_cmp_eq8:
movq (%rdi), %rax
cmpq (%rsi), %rax
je LBB23_1
## BB#2: ## %res_block
movl $1, %ecx
jmp LBB23_3
LBB23_1:
xorl %ecx, %ecx
LBB23_3: ## %endblock
xorl %eax, %eax
testl %ecx, %ecx
sete %al
retq
We get this:
cmp_eq8:
movq (%rdi), %rcx
xorl %eax, %eax
cmpq (%rsi), %rcx
sete %al
retq
And that matches the optimal codegen that we get from the current expansion in SelectionDAGBuilder::visitMemCmpCall().
If this looks right, then I just need to confirm that vector-sized expansion will work from here, and we can enable
CGP memcmp() expansion for x86. Ie, we'll bypass the power-of-2 special cases currently optimized in SDAG because we
can lower the IR produced here optimally.
Differential Revision: https://reviews.llvm.org/D34005
llvm-svn: 304987
This could be viewed as another shortcoming of the DAGCombiner:
when both operands of a compare are zexted from the same source
type, we should be able to compare the original types.
The effect on PowerPC perf is likely unnoticeable, but there's a
visible regression for x86 if we feed the suboptimal IR for memcmp
expansion to the DAG:
_cmp_eq4_zexted_to_i64:
movl (%rdi), %ecx
movl (%rsi), %edx
xorl %eax, %eax
cmpq %rdx, %rcx
sete %al
_cmp_eq4_better:
movl (%rdi), %ecx
xorl %eax, %eax
cmpl (%rsi), %ecx
sete %al
llvm-svn: 304923
I'd like to enable CGP memcmp expansion for x86, but the output from CGP would regress the
special cases (memcmp(x,y,N) != 0 for N=1,2,4,8,16,32 bytes) that we already handle.
I'm not sure if we'll actually be able to produce the optimal code given the block-at-a-time
limitation in the DAG. We might have to just avoid those special-cases here in CGP. But
regardless of that, I think this is a win for the more general cases.
http://rise4fun.com/Alive/cbQ
Differential Revision: https://reviews.llvm.org/D33963
llvm-svn: 304849
3 of the tests were testing exactly the same thing: memcmp(x, y, 16) != 0.
I changed that to test 4, 7, and 16 bytes, so we can see how those differ.
llvm-svn: 304838
This pass allows to run the register scavenging independently of
PrologEpilogInserter to allow targeted testing.
Also adds some basic register scavenging tests.
llvm-svn: 304606
Fixes PPCTTIImpl::getCacheLineSize() returning the wrong cache line size for
newer ppc processors.
Commiting on behalf of Stefan Pintilie.
Differential Revision: https://reviews.llvm.org/D33656
llvm-svn: 304317
This patch does an inline expansion of memcmp.
It changes the memcmp library call into an inline expansion when the size is
known at compile time and is under a target specified threshold.
This expansion is implemented in CodeGenPrepare and expands into straight line
code. The target specifies a maximum load size and the expansion works by using
this size to load the two sources, compare, and exit early if a difference is
found. It also has a special case when the memcmp result is used in a compare
to zero equality.
Differential Revision: https://reviews.llvm.org/D28637
llvm-svn: 304313
There are some VectorShuffle Nodes in SDAG which can be selected to XXPERMDI
Instruction, this patch recognizes them and does the selection to improve
the PPC performance.
Differential Revision: https://reviews.llvm.org/D33404
llvm-svn: 304298
This patch builds upon https://reviews.llvm.org/rL302810 to add
handling for bitwise logical operations in general purpose registers.
The idea is to keep the values in GPRs as long as possible - only
extracting them to a condition register bit when no further operations
are to be done.
Differential Revision: https://reviews.llvm.org/D31851
llvm-svn: 304282
Summary:
AntiDepBreaker intends to add all live-outs, including the implicit
CSRs, in StartBlock. r299124 was done without understanding that
intention.
Now with the live-ins propagated correctly (D32464), we can revert this change.
Reviewers: MatzeB, qcolombet
Subscribers: nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D33697
llvm-svn: 304251
Re-commit r303938 and r303954 with a fix for addLiveIns(): the internal
addPristines() function must be called on an empty set or it may
accidentally reset saved registers.
- addLiveOutsNoPristines() needs to add callee saved registers that are
actually saved and restored somewhere to the set (they are not
pristine).
- Cleanup/rewrite the code for addLiveOuts()/addLiveOutsNoPristines().
This fixes the problem from D32156.
Differential Revision: https://reviews.llvm.org/D32464
llvm-svn: 304001
I forgot to forward the chain, causing some missing instruction
dependencies. The test crashes the compiler without this patch.
Inspired by the test case, D33519 also tries to remove the extra sync.
Differential Revision: https://reviews.llvm.org/D33573
llvm-svn: 303931
There are some VectorShuffle Nodes in SDAG which can be selected to XXSLDWI
instruction, this patch recognizes them and does the selection to improve the
PPC performance.
llvm-svn: 303822
PPC backend eliminates compare instructions by using record-form instructions in PPCInstrInfo::optimizeCompareInstr, which is called from peephole optimization pass.
This patch improves this optimization to eliminate more compare instructions in two types of common case.
- comparison against a constant 1 or -1
The record-form instructions set CR bit based on signed comparison against 0. So, the current implementation does not exploit the record-form instruction for comparison against a non-zero constant.
This patch enables record-form optimization for constant of 1 or -1 if possible; it changes the condition "greater than -1" into "greater than or equal to 0" and "less than 1" into "less than or equal to 0".
With this patch, compare can be eliminated in the following code sequence, as an example.
uint64_t a, b;
if ((a | b) & 0x8000000000000000ull) { ... }
else { ... }
- andi for 32-bit comparison on PPC64
Since record-form instructions execute 64-bit signed comparison and so we have limitation in eliminating 32-bit comparison, i.e. with cmplwi, using the record-form. The original implementation already has such checks but andi. is not recognized as an instruction which executes implicit zero extension and hence safe to convert into record-form if used for equality check.
%1 = and i32 %a, 10
%2 = icmp ne i32 %1, 0
br i1 %2, label %foo, label %bar
In this simple example, LLVM generates andi. + cmplwi + beq on PPC64.
This patch make it possible to eliminate the cmplwi for this case.
I added andi. for optimization targets if it is safe to do so.
Differential Revision: https://reviews.llvm.org/D30081
llvm-svn: 303500
When legalizing vector operations on vNi128, they will be split to v1i128
because that is a legal type on ppc64, but then the compiler will crash in
selection dag because it fails to select for these operations. This patch fixes
shift operations. Logical shift right and left shift can be performed in the
vector unit, but algebraic shift right requires being split.
Differential Revision: https://reviews.llvm.org/D32774
llvm-svn: 303307
The variables MinGPR/MinG8R were not updated properly when resetting the
offsets, which in the included testcase lead to saving the CR register
in the same location as R30.
This fixes another issue reported in PR26519.
Differential Revision: https://reviews.llvm.org/D33017
llvm-svn: 303257
Summary:
This fixes pr32392.
The lowering pipeline is:
llvm.ppc.cfence in IR -> PPC::CFENCE8 in isel -> Actual instructions in
expandPostRAPseudo.
The reason why expandPostRAPseudo is chosen is because previous passes
are likely eliminating instructions like cmpw 3, 3 (early CSE) and bne-
7, .+4 (some branch pass(s)).
Differential Revision: https://reviews.llvm.org/D32763
llvm-svn: 303205
Summary:
In SelectionDAG, when a store is immediately chained to another store
to the same address, elide the first store as it has no observable
effects. This is causes small improvements dealing with intrinsics
lowered to stores.
Test notes:
* Many testcases overwrite store addresses multiple times and needed
minor changes, mainly making stores volatile to prevent the
optimization from optimizing the test away.
* Many X86 test cases optimized out instructions associated with
associated with va_start.
* Note that test_splat in CodeGen/AArch64/misched-stp.ll no longer has
dependencies to check and can probably be removed and potentially
replaced with another test.
Reviewers: rnk, john.brawn
Subscribers: aemerson, rengolin, qcolombet, jyknight, nemanjai, nhaehnle, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D33206
llvm-svn: 303198
At O3 we are more willing to increase size if we believe it will improve
performance. The current threshold for tail-duplication of 2 instructions is
conservative, and can be relaxed at O3.
Benchmark results:
llvm test-suite:
6% improvement in aha, due to duplication of loop latch
3% improvement in hexxagon
2% slowdown in lpbench. Seems related, but couldn't completely diagnose.
Internal google benchmark:
Produces 4% improvement on internal google protocol buffer serialization
benchmarks.
Differential-Revision: https://reviews.llvm.org/D32324
llvm-svn: 303084
According to Power ISA V3.0 document, the first source operand of mtvsrdd is constant 0 if r0 is specified. So the corresponding register constraint should be g8rc_nox0.
This bug caused wrong output generated by 401.bzip2 when -mcpu=power9 and fdo are specified.
Differential Revision: https://reviews.llvm.org/D32880
llvm-svn: 302834
This patch is the first in a series of patches to provide code gen for
doing compares in GPRs when the compare result is required in a GPR.
It adds the infrastructure to select GPR sequences for i1->i32 and i1->i64
extensions. This first patch handles equality comparison on i32 operands with
the result sign or zero extended.
Differential Revision: https://reviews.llvm.org/D31847
llvm-svn: 302810
Using arguments with attribute inalloca creates problems for verification
of machine representation. This attribute instructs the backend that the
argument is prepared in stack prior to CALLSEQ_START..CALLSEQ_END
sequence (see http://llvm.org/docs/InAlloca.htm for details). Frame size
stored in CALLSEQ_START in this case does not count the size of this
argument. However CALLSEQ_END still keeps total frame size, as caller can
be responsible for cleanup of entire frame. So CALLSEQ_START and
CALLSEQ_END keep different frame size and the difference is treated by
MachineVerifier as stack error. Currently there is no way to distinguish
this case from actual errors.
This patch adds additional argument to CALLSEQ_START and its
target-specific counterparts to keep size of stack that is set up prior to
the call frame sequence. This argument allows MachineVerifier to calculate
actual frame size associated with frame setup instruction and correctly
process the case of inalloca arguments.
The changes made by the patch are:
- Frame setup instructions get the second mandatory argument. It
affects all targets that use frame pseudo instructions and touched many
files although the changes are uniform.
- Access to frame properties are implemented using special instructions
rather than calls getOperand(N).getImm(). For X86 and ARM such
replacement was made previously.
- Changes that reflect appearance of additional argument of frame setup
instruction. These involve proper instruction initialization and
methods that access instruction arguments.
- MachineVerifier retrieves frame size using method, which reports sum of
frame parts initialized inside frame instruction pair and outside it.
The patch implements approach proposed by Quentin Colombet in
https://bugs.llvm.org/show_bug.cgi?id=27481#c1.
It fixes 9 tests failed with machine verifier enabled and listed
in PR27481.
Differential Revision: https://reviews.llvm.org/D32394
llvm-svn: 302527
This happened on the PPC32/SVR4 path and was discovered when building
FreeBSD on PPC32. It was a typo-class error in the frame lowering code.
This fixes PR26519.
llvm-svn: 302183
Summary:
This is the corresponding llvm change to D28037 to ensure no performance
regression.
Reviewers: bogner, kbarton, hfinkel, iteratee, echristo
Subscribers: nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D28329
llvm-svn: 301990
Fixes PR30730.
This is a re-commit of a pulled commit. The commit was pulled because some
software projects contained uses of Altivec vectors that violated alignment
requirements. Known issues have now been fixed.
Committing on behalf of Lei Huang.
Differential Revision: https://reviews.llvm.org/D26861
llvm-svn: 301892
Summary:
In some cases LLVM (especially the SLP vectorizer) will create vectors
that are 256 bytes (or larger). Given that this is intentional[0] is
likely to get more common, this patch updates the StackMap binary
format to deal with the spill locations for said vectors.
This change also bumps the stack map version from 2 to 3.
[0]: https://reviews.llvm.org/D32533#738350
Reviewers: reames, kavon, skatkov, javed.absar
Subscribers: mcrosier, nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D32629
llvm-svn: 301615
When functions are terminated by unreachable instructions, the last
instruction might trigger a CFI instruction to be generated. However,
emitting it would be be illegal since the function (and thus the FDE
the CFI is in) has already ended with the previous instruction.
Darwin's dwarfdump --verify --eh-frame complains about this and the
specification supports this.
Relevant bits from the DWARF 5 standard (6.4 Call Frame Information):
"[The] address_range [field in an FDE]: The number of bytes of
program instructions described by this entry."
"Row creation instructions: [...]
The new location value is always greater than the current one."
The first quotation implies that a CFI cannot describe a target
address outside of the enclosing FDE's range.
rdar://problem/26244988
Differential Revision: https://reviews.llvm.org/D32246
llvm-svn: 301219
This allows forming more 'not' ops, so we get improvements for ISAs that have and-not.
Follow-up to:
https://reviews.llvm.org/rL300725
llvm-svn: 300763
Check the legality of ISD::[US]MULO to see whether
Intrinsic::[us]mul_with_overflow will legalize into a function call (and, thus,
will use the CTR register). Fixes PR32485.
Patch by Tim Neumann!
Differential Revision: https://reviews.llvm.org/D31790
llvm-svn: 299910
The new codepath has been in the tree for years, and there isn't any
reason to use two codepaths here.
Differential Revision: https://reviews.llvm.org/D30596
llvm-svn: 299723
This is a generic combine enabled via target hook to reduce icmp logic as discussed in:
https://bugs.llvm.org/show_bug.cgi?id=32401
It's likely that other targets will want to enable this hook for scalar transforms,
and there are probably other patterns that can use bitwise logic to reduce comparisons.
Note that we are missing an IR canonicalization for these patterns, and we will probably
prefer the pair-of-compares form in IR (shorter, more likely to fold).
Differential Revision: https://reviews.llvm.org/D31483
llvm-svn: 299542
The code already allowed vector types in via "isInteger" (which might want
a more specific name), so use splat-friendly constant predicates to match
those types.
llvm-svn: 299304
(and (setlt X, 0), (setlt Y, 0)) --> (setlt (and X, Y), 0)
We have 7 similar folds, but this one got away. The fact that the
x86 test with a branch didn't change is probably a separate bug. We
may also be missing this and the related folds in instcombine.
llvm-svn: 299252
These are the same tests added for x86 with r299238,
but PPC doesn't specify all branches as cheap, so we
see different patterns in tests with branches.
llvm-svn: 299244
Now alternatively to the TargetOption.AllowFPOpFusion global flag, FMUL->FADD
can also use the per operation FMF to allow fusion.
The idea here is not to port everything to the new scheme (e.g. fused
multiply-and-sub will be ported later) but that this work all the way from
clang.
The transformation is conditionalized on *both* the FADD and the FMUL having
the FMF contract flag.
Differential Revision: https://reviews.llvm.org/D31169
llvm-svn: 299096
In PPCBoolRetToInt bool value is changed to i32 type. On ppc64 it may introduce an extra zero extension for the return value. This patch changes the integer type to i64 to avoid the zero extension on ppc64.
This patch fixed PR32442.
Differential Revision: https://reviews.llvm.org/D31407
llvm-svn: 298955
Summary:
For the following CFG:
A->B
B->C
A->C
If there is another edge B->D, then ABC should not be considered as triangle.
Reviewers: davidxl, iteratee
Reviewed By: iteratee
Subscribers: nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D31310
llvm-svn: 298661
Summary: Add tests for all atomic operations for powerpc64le, so that all changes can be easily examined.
Reviewers: kbarton, hfinkel, echristo
Subscribers: mehdi_amini, nemanjai, llvm-commits
Differential Revision: https://reviews.llvm.org/D31285
llvm-svn: 298614
I had ajusted the test case before when testing a chain of length 2, and then
reverted it with rL296845 when I switched to 3 triangles. After running
benchmarks and examining generated code at length 2 I forgot to put the test
back.
llvm-svn: 298000
mfvrd and mffprd are both alias to mfvrsd.
This patch enables correct parsing of the aliases, but we still emit a mfvrsd.
Committing on behalf of brunoalr (Bruno Rosa).
Differential Revision: https://reviews.llvm.org/D29177
llvm-svn: 297849
Recommiting with compiler time improvements
Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 297695
After inspection, it's an UB in our code base. Someone cast a var-arg
function pointer to a non-var-arg one. :/
Re-commit r296771 to continue testing on the patch.
Sorry for the trouble!
llvm-svn: 297256
This reverts commit r296771.
We found some wide spread test failures internally. I'm working on a
testcase. Politely revert the patch in the mean time. :)
llvm-svn: 297124
select Cond, C +/- 1, C --> add(ext Cond), C -- with a target hook.
This is part of the ongoing process to obsolete D24480. The motivation is to
canonicalize to select IR in InstCombine whenever possible, so we need to have a way to
undo that easily in codegen.
PowerPC is an obvious winner for this kind of transform because it has fast and complete
bit-twiddling abilities but generally lousy conditional execution perf (although this might
have changed in recent implementations).
x86 also sees some wins, but the effect is limited because these transforms already mostly
exist in its target-specific combineSelectOfTwoConstants(). The fact that we see any x86
changes just shows that that code is a mess of special-case holes. We may be able to remove
some of that logic now.
My guess is that other targets will want to enable this hook for most cases. The likely
follow-ups would be to add value type and/or the constants themselves as parameters for the
hook. As the tests in select_const.ll show, we can transform any select-of-constants to
math/logic, but the general transform for any 2 constants needs one more instruction
(multiply or 'and').
ARM is one target that I think may not want this for most cases. I see infinite loops there
because it wants to use selects to enable conditionally executed instructions.
Differential Revision: https://reviews.llvm.org/D30537
llvm-svn: 296977
This patch causes compile times for some patterns to explode. I have
a (large, unreduced) test case that slows down by more than 20x and
several test cases slow down by 2x. I'm sending some of the test cases
directly to Nirav and following up with more details in the review log,
but this should unblock anyone else hitting this.
llvm-svn: 296862
For chains of triangles with small join blocks that can be tail duplicated, a
simple calculation of probabilities is insufficient. Tail duplication
can be profitable in 3 different ways for these cases:
1) The post-dominators marked 50% are actually taken 56% (This shrinks with
longer chains)
2) The chains are statically correlated. Branch probabilities have a very
U-shaped distribution.
[http://nrs.harvard.edu/urn-3:HUL.InstRepos:24015805]
If the branches in a chain are likely to be from the same side of the
distribution as their predecessor, but are independent at runtime, this
transformation is profitable. (Because the cost of being wrong is a small
fixed cost, unlike the standard triangle layout where the cost of being
wrong scales with the # of triangles.)
3) The chains are dynamically correlated. If the probability that a previous
branch was taken positively influences whether the next branch will be
taken
We believe that 2 and 3 are common enough to justify the small margin in 1.
The code pre-scans a function's CFG to identify this pattern and marks the edges
so that the standard layout algorithm can use the computed results.
llvm-svn: 296845
This patch fixes pr32063.
Current code in PPCTargetLowering::PerformDAGCombine can transform
bswap
store
into a single PPCISD::STBRX instruction. but it doesn't consider the case that the operand size of bswap may be larger than store size. When it occurs, we need 2 modifications,
1 For the last operand of PPCISD::STBRX, we should not use DAG.getValueType(N->getOperand(1).getValueType()), instead we should use cast<StoreSDNode>(N)->getMemoryVT().
2 Before PPCISD::STBRX, we need to shift the original operand of bswap to the right side.
Differential Revision: https://reviews.llvm.org/D30362
llvm-svn: 296811
This patch reduces the stack frame size by not allocating the parameter area if
it is not required. In the current implementation LowerFormalArguments_64SVR4
already handles the parameter area, but LowerCall_64SVR4 does not
(when calculating the stack frame size). What this patch does is make
LowerCall_64SVR4 consistent with LowerFormalArguments_64SVR4.
Committing on behalf of Hiroshi Inoue.
Differential Revision: https://reviews.llvm.org/D29881
llvm-svn: 296771
This is part of the ongoing attempt to improve select codegen for all targets and select
canonicalization in IR (see D24480 for more background). The transform is a subset of what
is done in InstCombine's FoldOpIntoSelect().
I first noticed a regression in the x86 avx512-insert-extract.ll tests with a patch that
hopes to convert more selects to basic math ops. This appears to be a general missing DAG
transform though, so I added tests for all standard binops in rL296621
(PowerPC was chosen semi-randomly; it has scripted FileCheck support, but so do ARM and x86).
The poor output for "sel_constants_shl_constant" is tracked with:
https://bugs.llvm.org/show_bug.cgi?id=32105
Differential Revision: https://reviews.llvm.org/D30502
llvm-svn: 296699
This patch adds a MachineSSA pass that coalesces blocks that branch
on the same condition.
Committing on behalf of Lei Huang.
Differential Revision: https://reviews.llvm.org/D28249
llvm-svn: 296670
Resubmit r295336 after the bug with non-zero offset patterns on BE targets is fixed (r296336).
Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.
Reviewed By: filcab
Differential Revision: https://reviews.llvm.org/D29591
llvm-svn: 296651
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296476
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296416
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296252
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296149
Provide a 64-bit pattern to use SUBFIC for subtracting from a 16-bit immediate.
The corresponding pattern already exists for 32-bit integers.
Committing on behalf of Hiroshi Inoue.
Differential Revision: https://reviews.llvm.org/D29387
llvm-svn: 296144
Emit clrrdi (extended mnemonic for rldicr) for AND-ing with masks that
clear bits from the right hand size.
Committing on behalf of Hiroshi Inoue.
Differential Revision: https://reviews.llvm.org/D29388
llvm-svn: 296143