2007-03-31 12:06:36 +08:00
|
|
|
//===- CodeGenPrepare.cpp - Prepare a function for code generation --------===//
|
|
|
|
//
|
|
|
|
// The LLVM Compiler Infrastructure
|
|
|
|
//
|
2007-12-30 04:36:04 +08:00
|
|
|
// This file is distributed under the University of Illinois Open Source
|
|
|
|
// License. See LICENSE.TXT for details.
|
2007-03-31 12:06:36 +08:00
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
//
|
|
|
|
// This pass munges the code in the input function to better prepare it for
|
2008-05-09 01:46:35 +08:00
|
|
|
// SelectionDAG-based code generation. This works around limitations in it's
|
|
|
|
// basic-block-at-a-time approach. It should eventually be removed.
|
2007-03-31 12:06:36 +08:00
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2014-02-22 08:07:45 +08:00
|
|
|
#include "llvm/CodeGen/Passes.h"
|
2012-12-04 00:50:05 +08:00
|
|
|
#include "llvm/ADT/DenseMap.h"
|
|
|
|
#include "llvm/ADT/SmallSet.h"
|
|
|
|
#include "llvm/ADT/Statistic.h"
|
|
|
|
#include "llvm/Analysis/InstructionSimplify.h"
|
2015-01-15 10:16:27 +08:00
|
|
|
#include "llvm/Analysis/TargetLibraryInfo.h"
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
#include "llvm/Analysis/TargetTransformInfo.h"
|
2015-10-20 05:59:12 +08:00
|
|
|
#include "llvm/Analysis/ValueTracking.h"
|
2014-03-04 19:01:28 +08:00
|
|
|
#include "llvm/IR/CallSite.h"
|
2013-01-02 19:36:10 +08:00
|
|
|
#include "llvm/IR/Constants.h"
|
|
|
|
#include "llvm/IR/DataLayout.h"
|
|
|
|
#include "llvm/IR/DerivedTypes.h"
|
2014-01-13 17:26:24 +08:00
|
|
|
#include "llvm/IR/Dominators.h"
|
2013-01-02 19:36:10 +08:00
|
|
|
#include "llvm/IR/Function.h"
|
2014-03-04 18:40:04 +08:00
|
|
|
#include "llvm/IR/GetElementPtrTypeIterator.h"
|
2013-01-02 19:36:10 +08:00
|
|
|
#include "llvm/IR/IRBuilder.h"
|
|
|
|
#include "llvm/IR/InlineAsm.h"
|
|
|
|
#include "llvm/IR/Instructions.h"
|
|
|
|
#include "llvm/IR/IntrinsicInst.h"
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
#include "llvm/IR/MDBuilder.h"
|
2014-03-04 19:08:18 +08:00
|
|
|
#include "llvm/IR/PatternMatch.h"
|
2015-01-15 07:27:07 +08:00
|
|
|
#include "llvm/IR/Statepoint.h"
|
2014-03-04 19:17:44 +08:00
|
|
|
#include "llvm/IR/ValueHandle.h"
|
2014-03-04 19:26:31 +08:00
|
|
|
#include "llvm/IR/ValueMap.h"
|
2007-03-31 12:06:36 +08:00
|
|
|
#include "llvm/Pass.h"
|
2010-08-17 09:34:49 +08:00
|
|
|
#include "llvm/Support/CommandLine.h"
|
2007-12-06 07:58:20 +08:00
|
|
|
#include "llvm/Support/Debug.h"
|
2012-06-29 20:38:19 +08:00
|
|
|
#include "llvm/Support/raw_ostream.h"
|
|
|
|
#include "llvm/Target/TargetLowering.h"
|
2014-04-12 08:59:48 +08:00
|
|
|
#include "llvm/Target/TargetSubtargetInfo.h"
|
2012-06-29 20:38:19 +08:00
|
|
|
#include "llvm/Transforms/Utils/BasicBlockUtils.h"
|
|
|
|
#include "llvm/Transforms/Utils/BuildLibCalls.h"
|
2012-09-05 02:22:17 +08:00
|
|
|
#include "llvm/Transforms/Utils/BypassSlowDivision.h"
|
2012-06-29 20:38:19 +08:00
|
|
|
#include "llvm/Transforms/Utils/Local.h"
|
2015-01-13 01:22:43 +08:00
|
|
|
#include "llvm/Transforms/Utils/SimplifyLibCalls.h"
|
2007-03-31 12:06:36 +08:00
|
|
|
using namespace llvm;
|
2008-11-25 12:42:10 +08:00
|
|
|
using namespace llvm::PatternMatch;
|
2007-03-31 12:06:36 +08:00
|
|
|
|
2014-04-22 10:02:50 +08:00
|
|
|
#define DEBUG_TYPE "codegenprepare"
|
|
|
|
|
2011-01-06 01:27:27 +08:00
|
|
|
STATISTIC(NumBlocksElim, "Number of blocks eliminated");
|
2011-03-21 09:19:09 +08:00
|
|
|
STATISTIC(NumPHIsElim, "Number of trivial PHIs eliminated");
|
|
|
|
STATISTIC(NumGEPsElim, "Number of GEPs converted to casts");
|
2011-01-06 01:27:27 +08:00
|
|
|
STATISTIC(NumCmpUses, "Number of uses of Cmp expressions replaced with uses of "
|
|
|
|
"sunken Cmps");
|
|
|
|
STATISTIC(NumCastUses, "Number of uses of Cast expressions replaced with uses "
|
|
|
|
"of sunken Casts");
|
|
|
|
STATISTIC(NumMemoryInsts, "Number of memory instructions whose address "
|
|
|
|
"computations were sunk");
|
2011-03-21 09:19:09 +08:00
|
|
|
STATISTIC(NumExtsMoved, "Number of [s|z]ext instructions combined with loads");
|
|
|
|
STATISTIC(NumExtUses, "Number of uses of [s|z]ext instructions optimized");
|
|
|
|
STATISTIC(NumRetsDup, "Number of return instructions duplicated");
|
2011-08-18 08:50:51 +08:00
|
|
|
STATISTIC(NumDbgValueMoved, "Number of debug value instructions moved");
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
STATISTIC(NumSelectsExpanded, "Number of selects turned into branches");
|
2014-03-29 16:22:29 +08:00
|
|
|
STATISTIC(NumAndCmpsMoved, "Number of and/cmp's pushed into branches");
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
STATISTIC(NumStoreExtractExposed, "Number of store(extractelement) exposed");
|
2010-10-01 04:51:52 +08:00
|
|
|
|
2011-03-12 05:52:04 +08:00
|
|
|
static cl::opt<bool> DisableBranchOpts(
|
|
|
|
"disable-cgp-branch-opts", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Disable branch optimizations in CodeGenPrepare"));
|
|
|
|
|
2015-01-15 07:27:07 +08:00
|
|
|
static cl::opt<bool>
|
|
|
|
DisableGCOpts("disable-cgp-gc-opts", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Disable GC optimizations in CodeGenPrepare"));
|
|
|
|
|
2012-05-06 22:25:16 +08:00
|
|
|
static cl::opt<bool> DisableSelectToBranch(
|
|
|
|
"disable-cgp-select2branch", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Disable select to branch conversion."));
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
|
2014-04-12 08:59:48 +08:00
|
|
|
static cl::opt<bool> AddrSinkUsingGEPs(
|
|
|
|
"addr-sink-using-gep", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Address sinking in CGP using GEPs."));
|
|
|
|
|
2014-03-29 16:22:29 +08:00
|
|
|
static cl::opt<bool> EnableAndCmpSinking(
|
|
|
|
"enable-andcmp-sinking", cl::Hidden, cl::init(true),
|
|
|
|
cl::desc("Enable sinkinig and/cmp into branches."));
|
|
|
|
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
static cl::opt<bool> DisableStoreExtract(
|
|
|
|
"disable-cgp-store-extract", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Disable store(extract) optimizations in CodeGenPrepare"));
|
|
|
|
|
|
|
|
static cl::opt<bool> StressStoreExtract(
|
|
|
|
"stress-cgp-store-extract", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Stress test store(extract) optimizations in CodeGenPrepare"));
|
|
|
|
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
static cl::opt<bool> DisableExtLdPromotion(
|
|
|
|
"disable-cgp-ext-ld-promotion", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Disable ext(promotable(ld)) -> promoted(ext(ld)) optimization in "
|
|
|
|
"CodeGenPrepare"));
|
|
|
|
|
|
|
|
static cl::opt<bool> StressExtLdPromotion(
|
|
|
|
"stress-cgp-ext-ld-promotion", cl::Hidden, cl::init(false),
|
|
|
|
cl::desc("Stress test ext(promotable(ld)) -> promoted(ext(ld)) "
|
|
|
|
"optimization in CodeGenPrepare"));
|
|
|
|
|
2008-09-24 13:32:41 +08:00
|
|
|
namespace {
|
2014-02-07 05:44:56 +08:00
|
|
|
typedef SmallPtrSet<Instruction *, 16> SetOfInstrs;
|
2015-08-01 01:00:39 +08:00
|
|
|
typedef PointerIntPair<Type *, 1, bool> TypeIsSExt;
|
2014-11-13 09:44:51 +08:00
|
|
|
typedef DenseMap<Instruction *, TypeIsSExt> InstrToOrigTy;
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
class TypePromotionTransaction;
|
2014-02-07 05:44:56 +08:00
|
|
|
|
2009-09-02 14:11:42 +08:00
|
|
|
class CodeGenPrepare : public FunctionPass {
|
2013-06-20 05:07:11 +08:00
|
|
|
const TargetMachine *TM;
|
2007-03-31 12:06:36 +08:00
|
|
|
const TargetLowering *TLI;
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
const TargetTransformInfo *TTI;
|
2011-12-01 11:08:23 +08:00
|
|
|
const TargetLibraryInfo *TLInfo;
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// As we scan instructions optimizing them, this is the next instruction
|
|
|
|
/// to optimize. Transforms that can invalidate this should update it.
|
2011-01-15 15:14:54 +08:00
|
|
|
BasicBlock::iterator CurInstIterator;
|
2008-12-20 02:03:11 +08:00
|
|
|
|
2011-03-21 09:19:09 +08:00
|
|
|
/// Keeps track of non-local addresses that have been sunk into a block.
|
|
|
|
/// This allows us to avoid inserting duplicate code for blocks with
|
|
|
|
/// multiple load/stores of the same address.
|
2013-05-08 17:00:10 +08:00
|
|
|
ValueMap<Value*, Value*> SunkAddrs;
|
2011-01-06 08:42:50 +08:00
|
|
|
|
2015-06-18 04:44:32 +08:00
|
|
|
/// Keeps track of all instructions inserted for the current function.
|
|
|
|
SetOfInstrs InsertedInsts;
|
2014-02-07 05:44:56 +08:00
|
|
|
/// Keeps track of the type of the related instruction before their
|
|
|
|
/// promotion for the current function.
|
|
|
|
InstrToOrigTy PromotedInsts;
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// True if CFG is modified in any way.
|
2011-03-24 23:35:25 +08:00
|
|
|
bool ModifiedDT;
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// True if optimizing for size.
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
bool OptSize;
|
|
|
|
|
2015-07-08 02:45:17 +08:00
|
|
|
/// DataLayout for the Function being processed.
|
|
|
|
const DataLayout *DL;
|
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
public:
|
2007-05-06 21:37:16 +08:00
|
|
|
static char ID; // Pass identification, replacement for typeid
|
2014-04-14 08:51:57 +08:00
|
|
|
explicit CodeGenPrepare(const TargetMachine *TM = nullptr)
|
2015-07-08 02:45:17 +08:00
|
|
|
: FunctionPass(ID), TM(TM), TLI(nullptr), TTI(nullptr), DL(nullptr) {
|
2010-10-20 01:21:58 +08:00
|
|
|
initializeCodeGenPreparePass(*PassRegistry::getPassRegistry());
|
|
|
|
}
|
2014-03-07 17:26:03 +08:00
|
|
|
bool runOnFunction(Function &F) override;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2014-03-07 17:26:03 +08:00
|
|
|
const char *getPassName() const override { return "CodeGen Prepare"; }
|
2012-12-21 09:48:14 +08:00
|
|
|
|
2014-03-07 17:26:03 +08:00
|
|
|
void getAnalysisUsage(AnalysisUsage &AU) const override {
|
2014-01-13 21:07:17 +08:00
|
|
|
AU.addPreserved<DominatorTreeWrapperPass>();
|
2015-01-15 18:41:28 +08:00
|
|
|
AU.addRequired<TargetLibraryInfoWrapperPass>();
|
[PM] Change the core design of the TTI analysis to use a polymorphic
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
llvm-svn: 227669
2015-01-31 11:43:40 +08:00
|
|
|
AU.addRequired<TargetTransformInfoWrapperPass>();
|
2009-09-16 17:26:52 +08:00
|
|
|
}
|
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
private:
|
2015-09-22 07:03:16 +08:00
|
|
|
bool eliminateFallThrough(Function &F);
|
|
|
|
bool eliminateMostlyEmptyBlocks(Function &F);
|
|
|
|
bool canMergeBlocks(const BasicBlock *BB, const BasicBlock *DestBB) const;
|
|
|
|
void eliminateMostlyEmptyBlock(BasicBlock *BB);
|
|
|
|
bool optimizeBlock(BasicBlock &BB, bool& ModifiedDT);
|
|
|
|
bool optimizeInst(Instruction *I, bool& ModifiedDT);
|
|
|
|
bool optimizeMemoryInst(Instruction *I, Value *Addr,
|
2015-06-05 00:17:38 +08:00
|
|
|
Type *AccessTy, unsigned AS);
|
2015-09-22 07:03:16 +08:00
|
|
|
bool optimizeInlineAsmInst(CallInst *CS);
|
|
|
|
bool optimizeCallInst(CallInst *CI, bool& ModifiedDT);
|
|
|
|
bool moveExtToFormExtLoad(Instruction *&I);
|
|
|
|
bool optimizeExtUses(Instruction *I);
|
|
|
|
bool optimizeSelectInst(SelectInst *SI);
|
|
|
|
bool optimizeShuffleVectorInst(ShuffleVectorInst *SI);
|
|
|
|
bool optimizeExtractElementInst(Instruction *Inst);
|
|
|
|
bool dupRetToEnableTailCallOpts(BasicBlock *BB);
|
|
|
|
bool placeDbgValues(Function &F);
|
2014-03-29 16:22:29 +08:00
|
|
|
bool sinkAndCmp(Function &F);
|
2015-09-22 07:03:16 +08:00
|
|
|
bool extLdPromotion(TypePromotionTransaction &TPT, LoadInst *&LI,
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
Instruction *&Inst,
|
|
|
|
const SmallVectorImpl<Instruction *> &Exts,
|
2015-03-11 05:48:15 +08:00
|
|
|
unsigned CreatedInstCost);
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
bool splitBranchCondition(Function &F);
|
2015-01-15 07:27:07 +08:00
|
|
|
bool simplifyOffsetableRelocate(Instruction &I);
|
2015-09-16 02:32:14 +08:00
|
|
|
void stripInvariantGroupMetadata(Instruction &I);
|
2007-03-31 12:06:36 +08:00
|
|
|
};
|
2015-06-23 17:49:53 +08:00
|
|
|
}
|
2007-05-02 05:15:47 +08:00
|
|
|
|
2007-05-03 09:11:54 +08:00
|
|
|
char CodeGenPrepare::ID = 0;
|
2014-06-11 15:04:37 +08:00
|
|
|
INITIALIZE_TM_PASS(CodeGenPrepare, "codegenprepare",
|
|
|
|
"Optimize for code generation", false, false)
|
2007-03-31 12:06:36 +08:00
|
|
|
|
2013-06-20 05:07:11 +08:00
|
|
|
FunctionPass *llvm::createCodeGenPreparePass(const TargetMachine *TM) {
|
|
|
|
return new CodeGenPrepare(TM);
|
2007-03-31 12:06:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
bool CodeGenPrepare::runOnFunction(Function &F) {
|
2014-04-01 01:43:35 +08:00
|
|
|
if (skipOptnoneFunction(F))
|
|
|
|
return false;
|
|
|
|
|
2015-07-08 02:45:17 +08:00
|
|
|
DL = &F.getParent()->getDataLayout();
|
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
bool EverMadeChange = false;
|
2014-02-07 05:44:56 +08:00
|
|
|
// Clear per function information.
|
2015-06-18 04:44:32 +08:00
|
|
|
InsertedInsts.clear();
|
2014-02-07 05:44:56 +08:00
|
|
|
PromotedInsts.clear();
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2011-03-24 23:35:25 +08:00
|
|
|
ModifiedDT = false;
|
2014-08-05 05:25:23 +08:00
|
|
|
if (TM)
|
2015-01-27 09:01:38 +08:00
|
|
|
TLI = TM->getSubtargetImpl(F)->getTargetLowering();
|
2015-01-15 18:41:28 +08:00
|
|
|
TLInfo = &getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
|
2015-02-01 20:01:35 +08:00
|
|
|
TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
|
2015-08-12 03:39:36 +08:00
|
|
|
OptSize = F.optForSize();
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2012-09-05 02:22:17 +08:00
|
|
|
/// This optimization identifies DIV instructions that can be
|
|
|
|
/// profitably bypassed and carried out with a shorter, faster divide.
|
2013-03-05 02:13:57 +08:00
|
|
|
if (!OptSize && TLI && TLI->isSlowDivBypassed()) {
|
2012-10-05 05:33:40 +08:00
|
|
|
const DenseMap<unsigned int, unsigned int> &BypassWidths =
|
|
|
|
TLI->getBypassSlowDivWidths();
|
2012-09-15 05:25:34 +08:00
|
|
|
for (Function::iterator I = F.begin(); I != F.end(); I++)
|
2012-10-05 05:33:40 +08:00
|
|
|
EverMadeChange |= bypassSlowDivision(F, I, BypassWidths);
|
2012-09-05 02:22:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Eliminate blocks that contain only PHI nodes and an
|
2007-04-02 09:35:34 +08:00
|
|
|
// unconditional branch.
|
2015-09-22 07:03:16 +08:00
|
|
|
EverMadeChange |= eliminateMostlyEmptyBlocks(F);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2011-08-18 08:50:51 +08:00
|
|
|
// llvm.dbg.value is far away from the value then iSel may not be able
|
2012-07-24 18:51:42 +08:00
|
|
|
// handle it properly. iSel will drop llvm.dbg.value if it can not
|
2011-08-18 08:50:51 +08:00
|
|
|
// find a node corresponding to the value.
|
2015-09-22 07:03:16 +08:00
|
|
|
EverMadeChange |= placeDbgValues(F);
|
2011-08-18 08:50:51 +08:00
|
|
|
|
2014-03-29 16:22:29 +08:00
|
|
|
// If there is a mask, compare against zero, and branch that can be combined
|
|
|
|
// into a single target instruction, push the mask and compare into branch
|
|
|
|
// users. Do this before OptimizeBlock -> OptimizeInst ->
|
|
|
|
// OptimizeCmpExpression, which perturbs the pattern being searched for.
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
if (!DisableBranchOpts) {
|
2014-03-29 16:22:29 +08:00
|
|
|
EverMadeChange |= sinkAndCmp(F);
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
EverMadeChange |= splitBranchCondition(F);
|
|
|
|
}
|
2014-03-29 16:22:29 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
bool MadeChange = true;
|
2007-03-31 12:06:36 +08:00
|
|
|
while (MadeChange) {
|
|
|
|
MadeChange = false;
|
2012-09-19 15:48:16 +08:00
|
|
|
for (Function::iterator I = F.begin(); I != F.end(); ) {
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *BB = &*I++;
|
2014-12-28 16:54:45 +08:00
|
|
|
bool ModifiedDTOnIteration = false;
|
2015-09-22 07:03:16 +08:00
|
|
|
MadeChange |= optimizeBlock(*BB, ModifiedDTOnIteration);
|
2015-01-15 07:27:07 +08:00
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
// Restart BB iteration if the dominator tree of the Function was changed
|
|
|
|
if (ModifiedDTOnIteration)
|
|
|
|
break;
|
2011-03-21 09:19:09 +08:00
|
|
|
}
|
2007-03-31 12:06:36 +08:00
|
|
|
EverMadeChange |= MadeChange;
|
|
|
|
}
|
2011-01-06 08:42:50 +08:00
|
|
|
|
|
|
|
SunkAddrs.clear();
|
|
|
|
|
2011-03-12 05:52:04 +08:00
|
|
|
if (!DisableBranchOpts) {
|
|
|
|
MadeChange = false;
|
2012-03-04 18:46:01 +08:00
|
|
|
SmallPtrSet<BasicBlock*, 8> WorkList;
|
2015-01-09 04:44:33 +08:00
|
|
|
for (BasicBlock &BB : F) {
|
|
|
|
SmallVector<BasicBlock *, 2> Successors(succ_begin(&BB), succ_end(&BB));
|
|
|
|
MadeChange |= ConstantFoldTerminator(&BB, true);
|
2012-03-04 18:46:01 +08:00
|
|
|
if (!MadeChange) continue;
|
|
|
|
|
|
|
|
for (SmallVectorImpl<BasicBlock*>::iterator
|
|
|
|
II = Successors.begin(), IE = Successors.end(); II != IE; ++II)
|
|
|
|
if (pred_begin(*II) == pred_end(*II))
|
|
|
|
WorkList.insert(*II);
|
|
|
|
}
|
|
|
|
|
2012-11-29 07:23:48 +08:00
|
|
|
// Delete the dead blocks and any of their dead successors.
|
2012-12-06 08:30:20 +08:00
|
|
|
MadeChange |= !WorkList.empty();
|
2012-11-29 07:23:48 +08:00
|
|
|
while (!WorkList.empty()) {
|
|
|
|
BasicBlock *BB = *WorkList.begin();
|
|
|
|
WorkList.erase(BB);
|
|
|
|
SmallVector<BasicBlock*, 2> Successors(succ_begin(BB), succ_end(BB));
|
|
|
|
|
|
|
|
DeleteDeadBlock(BB);
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2012-11-29 07:23:48 +08:00
|
|
|
for (SmallVectorImpl<BasicBlock*>::iterator
|
|
|
|
II = Successors.begin(), IE = Successors.end(); II != IE; ++II)
|
|
|
|
if (pred_begin(*II) == pred_end(*II))
|
|
|
|
WorkList.insert(*II);
|
|
|
|
}
|
2011-03-12 05:52:04 +08:00
|
|
|
|
2012-08-14 13:19:07 +08:00
|
|
|
// Merge pairs of basic blocks with unconditional branches, connected by
|
|
|
|
// a single edge.
|
|
|
|
if (EverMadeChange || MadeChange)
|
2015-09-22 07:03:16 +08:00
|
|
|
MadeChange |= eliminateFallThrough(F);
|
2012-08-14 13:19:07 +08:00
|
|
|
|
2011-03-12 05:52:04 +08:00
|
|
|
EverMadeChange |= MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-01-15 07:27:07 +08:00
|
|
|
if (!DisableGCOpts) {
|
|
|
|
SmallVector<Instruction *, 2> Statepoints;
|
|
|
|
for (BasicBlock &BB : F)
|
|
|
|
for (Instruction &I : BB)
|
|
|
|
if (isStatepoint(I))
|
|
|
|
Statepoints.push_back(&I);
|
|
|
|
for (auto &I : Statepoints)
|
|
|
|
EverMadeChange |= simplifyOffsetableRelocate(*I);
|
|
|
|
}
|
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
return EverMadeChange;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Merge basic blocks which are connected by a single edge, where one of the
|
|
|
|
/// basic blocks has a single successor pointing to the other basic block,
|
|
|
|
/// which has a single predecessor.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::eliminateFallThrough(Function &F) {
|
2012-08-14 13:19:07 +08:00
|
|
|
bool Changed = false;
|
|
|
|
// Scan all of the blocks in the function, except for the entry block.
|
2014-03-02 20:27:27 +08:00
|
|
|
for (Function::iterator I = std::next(F.begin()), E = F.end(); I != E;) {
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *BB = &*I++;
|
2012-08-14 13:19:07 +08:00
|
|
|
// If the destination block has a single pred, then this is a trivial
|
|
|
|
// edge, just collapse it.
|
|
|
|
BasicBlock *SinglePred = BB->getSinglePredecessor();
|
|
|
|
|
2012-09-29 07:58:57 +08:00
|
|
|
// Don't merge if BB's address is taken.
|
|
|
|
if (!SinglePred || SinglePred == BB || BB->hasAddressTaken()) continue;
|
2012-08-14 13:19:07 +08:00
|
|
|
|
|
|
|
BranchInst *Term = dyn_cast<BranchInst>(SinglePred->getTerminator());
|
|
|
|
if (Term && !Term->isConditional()) {
|
|
|
|
Changed = true;
|
2012-08-21 13:55:22 +08:00
|
|
|
DEBUG(dbgs() << "To merge:\n"<< *SinglePred << "\n\n\n");
|
2012-08-14 13:19:07 +08:00
|
|
|
// Remember if SinglePred was the entry block of the function.
|
|
|
|
// If so, we will need to move BB back to the entry position.
|
|
|
|
bool isEntry = SinglePred == &SinglePred->getParent()->getEntryBlock();
|
2015-03-19 07:17:28 +08:00
|
|
|
MergeBasicBlockIntoOnlyPred(BB, nullptr);
|
2012-08-14 13:19:07 +08:00
|
|
|
|
|
|
|
if (isEntry && BB != &BB->getParent()->getEntryBlock())
|
|
|
|
BB->moveBefore(&BB->getParent()->getEntryBlock());
|
|
|
|
|
|
|
|
// We have erased a block. Update the iterator.
|
2015-10-10 02:44:40 +08:00
|
|
|
I = BB->getIterator();
|
2012-08-14 13:19:07 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return Changed;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Eliminate blocks that contain only PHI nodes, debug info directives, and an
|
|
|
|
/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split
|
|
|
|
/// edges in ways that are non-optimal for isel. Start by eliminating these
|
|
|
|
/// blocks so we can split them the way we want them.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F) {
|
2007-04-02 09:35:34 +08:00
|
|
|
bool MadeChange = false;
|
|
|
|
// Note that this intentionally skips the entry block.
|
2014-03-02 20:27:27 +08:00
|
|
|
for (Function::iterator I = std::next(F.begin()), E = F.end(); I != E;) {
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *BB = &*I++;
|
2007-04-02 09:35:34 +08:00
|
|
|
|
|
|
|
// If this block doesn't end with an uncond branch, ignore it.
|
|
|
|
BranchInst *BI = dyn_cast<BranchInst>(BB->getTerminator());
|
|
|
|
if (!BI || !BI->isUnconditional())
|
|
|
|
continue;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2009-03-27 09:13:37 +08:00
|
|
|
// If the instruction before the branch (skipping debug info) isn't a phi
|
|
|
|
// node, then other stuff is happening here.
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock::iterator BBI = BI->getIterator();
|
2007-04-02 09:35:34 +08:00
|
|
|
if (BBI != BB->begin()) {
|
|
|
|
--BBI;
|
2009-03-27 09:13:37 +08:00
|
|
|
while (isa<DbgInfoIntrinsic>(BBI)) {
|
|
|
|
if (BBI == BB->begin())
|
|
|
|
break;
|
|
|
|
--BBI;
|
|
|
|
}
|
|
|
|
if (!isa<DbgInfoIntrinsic>(BBI) && !isa<PHINode>(BBI))
|
|
|
|
continue;
|
2007-04-02 09:35:34 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// Do not break infinite loops.
|
|
|
|
BasicBlock *DestBB = BI->getSuccessor(0);
|
|
|
|
if (DestBB == BB)
|
|
|
|
continue;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
if (!canMergeBlocks(BB, DestBB))
|
2007-04-02 09:35:34 +08:00
|
|
|
continue;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
eliminateMostlyEmptyBlock(BB);
|
2007-04-02 09:35:34 +08:00
|
|
|
MadeChange = true;
|
|
|
|
}
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Return true if we can merge BB into DestBB if there is a single
|
|
|
|
/// unconditional branch between them, and BB contains no other non-phi
|
2007-04-02 09:35:34 +08:00
|
|
|
/// instructions.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::canMergeBlocks(const BasicBlock *BB,
|
2007-04-02 09:35:34 +08:00
|
|
|
const BasicBlock *DestBB) const {
|
|
|
|
// We only want to eliminate blocks whose phi nodes are used by phi nodes in
|
|
|
|
// the successor. If there are more complex condition (e.g. preheaders),
|
|
|
|
// don't mess around with them.
|
|
|
|
BasicBlock::const_iterator BBI = BB->begin();
|
|
|
|
while (const PHINode *PN = dyn_cast<PHINode>(BBI++)) {
|
2014-03-09 11:16:01 +08:00
|
|
|
for (const User *U : PN->users()) {
|
|
|
|
const Instruction *UI = cast<Instruction>(U);
|
|
|
|
if (UI->getParent() != DestBB || !isa<PHINode>(UI))
|
2007-04-02 09:35:34 +08:00
|
|
|
return false;
|
2008-09-24 13:32:41 +08:00
|
|
|
// If User is inside DestBB block and it is a PHINode then check
|
|
|
|
// incoming value. If incoming value is not from BB then this is
|
2007-04-25 08:37:04 +08:00
|
|
|
// a complex condition (e.g. preheaders) we want to avoid here.
|
2014-03-09 11:16:01 +08:00
|
|
|
if (UI->getParent() == DestBB) {
|
|
|
|
if (const PHINode *UPN = dyn_cast<PHINode>(UI))
|
2007-04-25 08:37:04 +08:00
|
|
|
for (unsigned I = 0, E = UPN->getNumIncomingValues(); I != E; ++I) {
|
|
|
|
Instruction *Insn = dyn_cast<Instruction>(UPN->getIncomingValue(I));
|
|
|
|
if (Insn && Insn->getParent() == BB &&
|
|
|
|
Insn->getParent() != UPN->getIncomingBlock(I))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
2007-04-02 09:35:34 +08:00
|
|
|
}
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// If BB and DestBB contain any common predecessors, then the phi nodes in BB
|
|
|
|
// and DestBB may have conflicting incoming values for the block. If so, we
|
|
|
|
// can't merge the block.
|
|
|
|
const PHINode *DestBBPN = dyn_cast<PHINode>(DestBB->begin());
|
|
|
|
if (!DestBBPN) return true; // no conflict.
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// Collect the preds of BB.
|
2007-11-07 06:07:40 +08:00
|
|
|
SmallPtrSet<const BasicBlock*, 16> BBPreds;
|
2007-04-02 09:35:34 +08:00
|
|
|
if (const PHINode *BBPN = dyn_cast<PHINode>(BB->begin())) {
|
|
|
|
// It is faster to get preds from a PHI than with pred_iterator.
|
|
|
|
for (unsigned i = 0, e = BBPN->getNumIncomingValues(); i != e; ++i)
|
|
|
|
BBPreds.insert(BBPN->getIncomingBlock(i));
|
|
|
|
} else {
|
|
|
|
BBPreds.insert(pred_begin(BB), pred_end(BB));
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// Walk the preds of DestBB.
|
|
|
|
for (unsigned i = 0, e = DestBBPN->getNumIncomingValues(); i != e; ++i) {
|
|
|
|
BasicBlock *Pred = DestBBPN->getIncomingBlock(i);
|
|
|
|
if (BBPreds.count(Pred)) { // Common predecessor?
|
|
|
|
BBI = DestBB->begin();
|
|
|
|
while (const PHINode *PN = dyn_cast<PHINode>(BBI++)) {
|
|
|
|
const Value *V1 = PN->getIncomingValueForBlock(Pred);
|
|
|
|
const Value *V2 = PN->getIncomingValueForBlock(BB);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// If V2 is a phi node in BB, look up what the mapped value will be.
|
|
|
|
if (const PHINode *V2PN = dyn_cast<PHINode>(V2))
|
|
|
|
if (V2PN->getParent() == BB)
|
|
|
|
V2 = V2PN->getIncomingValueForBlock(Pred);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// If there is a conflict, bail out.
|
|
|
|
if (V1 != V2) return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Eliminate a basic block that has only phi's and an unconditional branch in
|
|
|
|
/// it.
|
2015-09-22 07:03:16 +08:00
|
|
|
void CodeGenPrepare::eliminateMostlyEmptyBlock(BasicBlock *BB) {
|
2007-04-02 09:35:34 +08:00
|
|
|
BranchInst *BI = cast<BranchInst>(BB->getTerminator());
|
|
|
|
BasicBlock *DestBB = BI->getSuccessor(0);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2010-01-05 09:27:11 +08:00
|
|
|
DEBUG(dbgs() << "MERGING MOSTLY EMPTY BLOCKS - BEFORE:\n" << *BB << *DestBB);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// If the destination block has a single pred, then this is a trivial edge,
|
|
|
|
// just collapse it.
|
2008-11-28 03:29:14 +08:00
|
|
|
if (BasicBlock *SinglePred = DestBB->getSinglePredecessor()) {
|
2008-11-29 03:54:49 +08:00
|
|
|
if (SinglePred != DestBB) {
|
|
|
|
// Remember if SinglePred was the entry block of the function. If so, we
|
|
|
|
// will need to move BB back to the entry position.
|
|
|
|
bool isEntry = SinglePred == &SinglePred->getParent()->getEntryBlock();
|
2015-03-19 07:17:28 +08:00
|
|
|
MergeBasicBlockIntoOnlyPred(DestBB, nullptr);
|
2008-11-29 03:54:49 +08:00
|
|
|
|
|
|
|
if (isEntry && BB != &BB->getParent()->getEntryBlock())
|
|
|
|
BB->moveBefore(&BB->getParent()->getEntryBlock());
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2010-01-05 09:27:11 +08:00
|
|
|
DEBUG(dbgs() << "AFTER:\n" << *DestBB << "\n\n\n");
|
2008-11-29 03:54:49 +08:00
|
|
|
return;
|
|
|
|
}
|
2007-04-02 09:35:34 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// Otherwise, we have multiple predecessors of BB. Update the PHIs in DestBB
|
|
|
|
// to handle the new incoming edges it is about to have.
|
|
|
|
PHINode *PN;
|
|
|
|
for (BasicBlock::iterator BBI = DestBB->begin();
|
|
|
|
(PN = dyn_cast<PHINode>(BBI)); ++BBI) {
|
|
|
|
// Remove the incoming value for BB, and remember it.
|
|
|
|
Value *InVal = PN->removeIncomingValue(BB, false);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// Two options: either the InVal is a phi node defined in BB or it is some
|
|
|
|
// value that dominates BB.
|
|
|
|
PHINode *InValPhi = dyn_cast<PHINode>(InVal);
|
|
|
|
if (InValPhi && InValPhi->getParent() == BB) {
|
|
|
|
// Add all of the input values of the input PHI as inputs of this phi.
|
|
|
|
for (unsigned i = 0, e = InValPhi->getNumIncomingValues(); i != e; ++i)
|
|
|
|
PN->addIncoming(InValPhi->getIncomingValue(i),
|
|
|
|
InValPhi->getIncomingBlock(i));
|
|
|
|
} else {
|
|
|
|
// Otherwise, add one instance of the dominating value for each edge that
|
|
|
|
// we will be adding.
|
|
|
|
if (PHINode *BBPN = dyn_cast<PHINode>(BB->begin())) {
|
|
|
|
for (unsigned i = 0, e = BBPN->getNumIncomingValues(); i != e; ++i)
|
|
|
|
PN->addIncoming(InVal, BBPN->getIncomingBlock(i));
|
|
|
|
} else {
|
2014-07-22 01:06:51 +08:00
|
|
|
for (pred_iterator PI = pred_begin(BB), E = pred_end(BB); PI != E; ++PI)
|
|
|
|
PN->addIncoming(InVal, *PI);
|
2007-04-02 09:35:34 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-04-02 09:35:34 +08:00
|
|
|
// The PHIs are now updated, change everything that refers to BB to use
|
|
|
|
// DestBB and remove BB.
|
|
|
|
BB->replaceAllUsesWith(DestBB);
|
|
|
|
BB->eraseFromParent();
|
2011-01-06 01:27:27 +08:00
|
|
|
++NumBlocksElim;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2010-01-05 09:27:11 +08:00
|
|
|
DEBUG(dbgs() << "AFTER:\n" << *DestBB << "\n\n\n");
|
2007-04-02 09:35:34 +08:00
|
|
|
}
|
|
|
|
|
2015-01-15 07:27:07 +08:00
|
|
|
// Computes a map of base pointer relocation instructions to corresponding
|
|
|
|
// derived pointer relocation instructions given a vector of all relocate calls
|
|
|
|
static void computeBaseDerivedRelocateMap(
|
|
|
|
const SmallVectorImpl<User *> &AllRelocateCalls,
|
|
|
|
DenseMap<IntrinsicInst *, SmallVector<IntrinsicInst *, 2>> &
|
|
|
|
RelocateInstMap) {
|
|
|
|
// Collect information in two maps: one primarily for locating the base object
|
|
|
|
// while filling the second map; the second map is the final structure holding
|
|
|
|
// a mapping between Base and corresponding Derived relocate calls
|
|
|
|
DenseMap<std::pair<unsigned, unsigned>, IntrinsicInst *> RelocateIdxMap;
|
|
|
|
for (auto &U : AllRelocateCalls) {
|
|
|
|
GCRelocateOperands ThisRelocate(U);
|
|
|
|
IntrinsicInst *I = cast<IntrinsicInst>(U);
|
2015-05-06 10:36:26 +08:00
|
|
|
auto K = std::make_pair(ThisRelocate.getBasePtrIndex(),
|
|
|
|
ThisRelocate.getDerivedPtrIndex());
|
2015-01-15 07:27:07 +08:00
|
|
|
RelocateIdxMap.insert(std::make_pair(K, I));
|
|
|
|
}
|
|
|
|
for (auto &Item : RelocateIdxMap) {
|
|
|
|
std::pair<unsigned, unsigned> Key = Item.first;
|
|
|
|
if (Key.first == Key.second)
|
|
|
|
// Base relocation: nothing to insert
|
|
|
|
continue;
|
|
|
|
|
|
|
|
IntrinsicInst *I = Item.second;
|
|
|
|
auto BaseKey = std::make_pair(Key.first, Key.first);
|
2015-02-27 10:24:16 +08:00
|
|
|
|
|
|
|
// We're iterating over RelocateIdxMap so we cannot modify it.
|
|
|
|
auto MaybeBase = RelocateIdxMap.find(BaseKey);
|
|
|
|
if (MaybeBase == RelocateIdxMap.end())
|
2015-01-15 07:27:07 +08:00
|
|
|
// TODO: We might want to insert a new base object relocate and gep off
|
|
|
|
// that, if there are enough derived object relocates.
|
|
|
|
continue;
|
2015-02-27 10:24:16 +08:00
|
|
|
|
|
|
|
RelocateInstMap[MaybeBase->second].push_back(I);
|
2015-01-15 07:27:07 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Accepts a GEP and extracts the operands into a vector provided they're all
|
|
|
|
// small integer constants
|
|
|
|
static bool getGEPSmallConstantIntOffsetV(GetElementPtrInst *GEP,
|
|
|
|
SmallVectorImpl<Value *> &OffsetV) {
|
|
|
|
for (unsigned i = 1; i < GEP->getNumOperands(); i++) {
|
|
|
|
// Only accept small constant integer operands
|
|
|
|
auto Op = dyn_cast<ConstantInt>(GEP->getOperand(i));
|
|
|
|
if (!Op || Op->getZExtValue() > 20)
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (unsigned i = 1; i < GEP->getNumOperands(); i++)
|
|
|
|
OffsetV.push_back(GEP->getOperand(i));
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Takes a RelocatedBase (base pointer relocation instruction) and Targets to
|
|
|
|
// replace, computes a replacement, and affects it.
|
|
|
|
static bool
|
|
|
|
simplifyRelocatesOffABase(IntrinsicInst *RelocatedBase,
|
|
|
|
const SmallVectorImpl<IntrinsicInst *> &Targets) {
|
|
|
|
bool MadeChange = false;
|
|
|
|
for (auto &ToReplace : Targets) {
|
|
|
|
GCRelocateOperands MasterRelocate(RelocatedBase);
|
|
|
|
GCRelocateOperands ThisRelocate(ToReplace);
|
|
|
|
|
2015-05-06 10:36:26 +08:00
|
|
|
assert(ThisRelocate.getBasePtrIndex() == MasterRelocate.getBasePtrIndex() &&
|
2015-01-15 07:27:07 +08:00
|
|
|
"Not relocating a derived object of the original base object");
|
2015-05-06 10:36:26 +08:00
|
|
|
if (ThisRelocate.getBasePtrIndex() == ThisRelocate.getDerivedPtrIndex()) {
|
2015-01-15 07:27:07 +08:00
|
|
|
// A duplicate relocate call. TODO: coalesce duplicates.
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2015-05-06 10:36:26 +08:00
|
|
|
Value *Base = ThisRelocate.getBasePtr();
|
|
|
|
auto Derived = dyn_cast<GetElementPtrInst>(ThisRelocate.getDerivedPtr());
|
2015-01-15 07:27:07 +08:00
|
|
|
if (!Derived || Derived->getPointerOperand() != Base)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
SmallVector<Value *, 2> OffsetV;
|
|
|
|
if (!getGEPSmallConstantIntOffsetV(Derived, OffsetV))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// Create a Builder and replace the target callsite with a gep
|
2015-05-12 07:47:30 +08:00
|
|
|
assert(RelocatedBase->getNextNode() && "Should always have one since it's not a terminator");
|
|
|
|
|
|
|
|
// Insert after RelocatedBase
|
|
|
|
IRBuilder<> Builder(RelocatedBase->getNextNode());
|
2015-01-15 07:27:07 +08:00
|
|
|
Builder.SetCurrentDebugLocation(ToReplace->getDebugLoc());
|
[RewriteStatepointsForGC] Fix a bug on creating gc_relocate for pointer to vector of pointers
Summary:
In RewriteStatepointsForGC pass, we create a gc_relocate intrinsic for
each relocated pointer, and the gc_relocate has the same type with the
pointer. During the creation of gc_relocate intrinsic, llvm requires to
mangle its type. However, llvm does not support mangling of all possible
types. RewriteStatepointsForGC will hit an assertion failure when it
tries to create a gc_relocate for pointer to vector of pointers because
mangling for vector of pointers is not supported.
This patch changes the way RewriteStatepointsForGC pass creates
gc_relocate. For each relocated pointer, we erase the type of pointers
and create an unified gc_relocate of type i8 addrspace(1)*. Then a
bitcast is inserted to convert the gc_relocate to the correct type. In
this way, gc_relocate does not need to deal with different types of
pointers and the unsupported type mangling is no longer a problem. This
change would also ease further merge when LLVM erases types of pointers
and introduces an unified pointer type.
Some minor changes are also introduced to gc_relocate related part in
InstCombineCalls, CodeGenPrepare, and Verifier accordingly.
Patch by Chen Li!
Reviewers: reames, AndyAyers, sanjoy
Reviewed By: sanjoy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9592
llvm-svn: 237009
2015-05-12 02:49:34 +08:00
|
|
|
|
|
|
|
// If gc_relocate does not match the actual type, cast it to the right type.
|
|
|
|
// In theory, there must be a bitcast after gc_relocate if the type does not
|
|
|
|
// match, and we should reuse it to get the derived pointer. But it could be
|
|
|
|
// cases like this:
|
|
|
|
// bb1:
|
|
|
|
// ...
|
|
|
|
// %g1 = call coldcc i8 addrspace(1)* @llvm.experimental.gc.relocate.p1i8(...)
|
|
|
|
// br label %merge
|
|
|
|
//
|
|
|
|
// bb2:
|
|
|
|
// ...
|
|
|
|
// %g2 = call coldcc i8 addrspace(1)* @llvm.experimental.gc.relocate.p1i8(...)
|
|
|
|
// br label %merge
|
|
|
|
//
|
|
|
|
// merge:
|
|
|
|
// %p1 = phi i8 addrspace(1)* [ %g1, %bb1 ], [ %g2, %bb2 ]
|
|
|
|
// %cast = bitcast i8 addrspace(1)* %p1 in to i32 addrspace(1)*
|
|
|
|
//
|
|
|
|
// In this case, we can not find the bitcast any more. So we insert a new bitcast
|
|
|
|
// no matter there is already one or not. In this way, we can handle all cases, and
|
|
|
|
// the extra bitcast should be optimized away in later passes.
|
|
|
|
Instruction *ActualRelocatedBase = RelocatedBase;
|
|
|
|
if (RelocatedBase->getType() != Base->getType()) {
|
|
|
|
ActualRelocatedBase =
|
|
|
|
cast<Instruction>(Builder.CreateBitCast(RelocatedBase, Base->getType()));
|
|
|
|
}
|
2015-03-25 06:38:16 +08:00
|
|
|
Value *Replacement = Builder.CreateGEP(
|
[RewriteStatepointsForGC] Fix a bug on creating gc_relocate for pointer to vector of pointers
Summary:
In RewriteStatepointsForGC pass, we create a gc_relocate intrinsic for
each relocated pointer, and the gc_relocate has the same type with the
pointer. During the creation of gc_relocate intrinsic, llvm requires to
mangle its type. However, llvm does not support mangling of all possible
types. RewriteStatepointsForGC will hit an assertion failure when it
tries to create a gc_relocate for pointer to vector of pointers because
mangling for vector of pointers is not supported.
This patch changes the way RewriteStatepointsForGC pass creates
gc_relocate. For each relocated pointer, we erase the type of pointers
and create an unified gc_relocate of type i8 addrspace(1)*. Then a
bitcast is inserted to convert the gc_relocate to the correct type. In
this way, gc_relocate does not need to deal with different types of
pointers and the unsupported type mangling is no longer a problem. This
change would also ease further merge when LLVM erases types of pointers
and introduces an unified pointer type.
Some minor changes are also introduced to gc_relocate related part in
InstCombineCalls, CodeGenPrepare, and Verifier accordingly.
Patch by Chen Li!
Reviewers: reames, AndyAyers, sanjoy
Reviewed By: sanjoy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9592
llvm-svn: 237009
2015-05-12 02:49:34 +08:00
|
|
|
Derived->getSourceElementType(), ActualRelocatedBase, makeArrayRef(OffsetV));
|
2015-01-15 07:27:07 +08:00
|
|
|
Instruction *ReplacementInst = cast<Instruction>(Replacement);
|
|
|
|
Replacement->takeName(ToReplace);
|
[RewriteStatepointsForGC] Fix a bug on creating gc_relocate for pointer to vector of pointers
Summary:
In RewriteStatepointsForGC pass, we create a gc_relocate intrinsic for
each relocated pointer, and the gc_relocate has the same type with the
pointer. During the creation of gc_relocate intrinsic, llvm requires to
mangle its type. However, llvm does not support mangling of all possible
types. RewriteStatepointsForGC will hit an assertion failure when it
tries to create a gc_relocate for pointer to vector of pointers because
mangling for vector of pointers is not supported.
This patch changes the way RewriteStatepointsForGC pass creates
gc_relocate. For each relocated pointer, we erase the type of pointers
and create an unified gc_relocate of type i8 addrspace(1)*. Then a
bitcast is inserted to convert the gc_relocate to the correct type. In
this way, gc_relocate does not need to deal with different types of
pointers and the unsupported type mangling is no longer a problem. This
change would also ease further merge when LLVM erases types of pointers
and introduces an unified pointer type.
Some minor changes are also introduced to gc_relocate related part in
InstCombineCalls, CodeGenPrepare, and Verifier accordingly.
Patch by Chen Li!
Reviewers: reames, AndyAyers, sanjoy
Reviewed By: sanjoy
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9592
llvm-svn: 237009
2015-05-12 02:49:34 +08:00
|
|
|
// If the newly generated derived pointer's type does not match the original derived
|
|
|
|
// pointer's type, cast the new derived pointer to match it. Same reasoning as above.
|
|
|
|
Instruction *ActualReplacement = ReplacementInst;
|
|
|
|
if (ReplacementInst->getType() != ToReplace->getType()) {
|
|
|
|
ActualReplacement =
|
|
|
|
cast<Instruction>(Builder.CreateBitCast(ReplacementInst, ToReplace->getType()));
|
|
|
|
}
|
|
|
|
ToReplace->replaceAllUsesWith(ActualReplacement);
|
2015-01-15 07:27:07 +08:00
|
|
|
ToReplace->eraseFromParent();
|
|
|
|
|
|
|
|
MadeChange = true;
|
|
|
|
}
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Turns this:
|
|
|
|
//
|
|
|
|
// %base = ...
|
|
|
|
// %ptr = gep %base + 15
|
|
|
|
// %tok = statepoint (%fun, i32 0, i32 0, i32 0, %base, %ptr)
|
|
|
|
// %base' = relocate(%tok, i32 4, i32 4)
|
|
|
|
// %ptr' = relocate(%tok, i32 4, i32 5)
|
|
|
|
// %val = load %ptr'
|
|
|
|
//
|
|
|
|
// into this:
|
|
|
|
//
|
|
|
|
// %base = ...
|
|
|
|
// %ptr = gep %base + 15
|
|
|
|
// %tok = statepoint (%fun, i32 0, i32 0, i32 0, %base, %ptr)
|
|
|
|
// %base' = gc.relocate(%tok, i32 4, i32 4)
|
|
|
|
// %ptr' = gep %base' + 15
|
|
|
|
// %val = load %ptr'
|
|
|
|
bool CodeGenPrepare::simplifyOffsetableRelocate(Instruction &I) {
|
|
|
|
bool MadeChange = false;
|
|
|
|
SmallVector<User *, 2> AllRelocateCalls;
|
|
|
|
|
|
|
|
for (auto *U : I.users())
|
|
|
|
if (isGCRelocate(dyn_cast<Instruction>(U)))
|
|
|
|
// Collect all the relocate calls associated with a statepoint
|
|
|
|
AllRelocateCalls.push_back(U);
|
|
|
|
|
|
|
|
// We need atleast one base pointer relocation + one derived pointer
|
|
|
|
// relocation to mangle
|
|
|
|
if (AllRelocateCalls.size() < 2)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// RelocateInstMap is a mapping from the base relocate instruction to the
|
|
|
|
// corresponding derived relocate instructions
|
|
|
|
DenseMap<IntrinsicInst *, SmallVector<IntrinsicInst *, 2>> RelocateInstMap;
|
|
|
|
computeBaseDerivedRelocateMap(AllRelocateCalls, RelocateInstMap);
|
|
|
|
if (RelocateInstMap.empty())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
for (auto &Item : RelocateInstMap)
|
|
|
|
// Item.first is the RelocatedBase to offset against
|
|
|
|
// Item.second is the vector of Targets to replace
|
|
|
|
MadeChange = simplifyRelocatesOffABase(Item.first, Item.second);
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2014-03-13 21:36:25 +08:00
|
|
|
/// SinkCast - Sink the specified cast instruction into its user blocks
|
|
|
|
static bool SinkCast(CastInst *CI) {
|
2007-03-31 12:06:36 +08:00
|
|
|
BasicBlock *DefBB = CI->getParent();
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
/// InsertedCasts - Only insert a cast in each block once.
|
2007-06-13 00:50:17 +08:00
|
|
|
DenseMap<BasicBlock*, CastInst*> InsertedCasts;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
bool MadeChange = false;
|
2014-03-09 11:16:01 +08:00
|
|
|
for (Value::user_iterator UI = CI->user_begin(), E = CI->user_end();
|
2007-03-31 12:06:36 +08:00
|
|
|
UI != E; ) {
|
|
|
|
Use &TheUse = UI.getUse();
|
|
|
|
Instruction *User = cast<Instruction>(*UI);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
// Figure out which BB this cast is used in. For PHI's this is the
|
|
|
|
// appropriate predecessor block.
|
|
|
|
BasicBlock *UserBB = User->getParent();
|
|
|
|
if (PHINode *PN = dyn_cast<PHINode>(User)) {
|
2014-03-09 11:16:01 +08:00
|
|
|
UserBB = PN->getIncomingBlock(TheUse);
|
2007-03-31 12:06:36 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
// Preincrement use iterator so we don't invalidate it.
|
|
|
|
++UI;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
// If this user is in the same block as the cast, don't change the cast.
|
|
|
|
if (UserBB == DefBB) continue;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
// If we have already inserted a cast into this block, use it.
|
|
|
|
CastInst *&InsertedCast = InsertedCasts[UserBB];
|
|
|
|
|
|
|
|
if (!InsertedCast) {
|
2011-08-17 04:45:24 +08:00
|
|
|
BasicBlock::iterator InsertPt = UserBB->getFirstInsertionPt();
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(InsertPt != UserBB->end());
|
|
|
|
InsertedCast = CastInst::Create(CI->getOpcode(), CI->getOperand(0),
|
|
|
|
CI->getType(), "", &*InsertPt);
|
2007-03-31 12:06:36 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// Replace a use of the cast with a use of the new cast.
|
2007-03-31 12:06:36 +08:00
|
|
|
TheUse = InsertedCast;
|
2015-04-11 06:25:36 +08:00
|
|
|
MadeChange = true;
|
2011-01-06 01:27:27 +08:00
|
|
|
++NumCastUses;
|
2007-03-31 12:06:36 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
// If we removed all uses, nuke the cast.
|
2008-01-21 00:51:46 +08:00
|
|
|
if (CI->use_empty()) {
|
2007-03-31 12:06:36 +08:00
|
|
|
CI->eraseFromParent();
|
2008-01-21 00:51:46 +08:00
|
|
|
MadeChange = true;
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// If the specified cast instruction is a noop copy (e.g. it's casting from
|
|
|
|
/// one pointer type to another, i32->i8 on PPC), sink it into user blocks to
|
|
|
|
/// reduce the number of virtual registers that must be created and coalesced.
|
2014-03-13 21:36:25 +08:00
|
|
|
///
|
|
|
|
/// Return true if any changes are made.
|
|
|
|
///
|
2015-07-09 10:09:04 +08:00
|
|
|
static bool OptimizeNoopCopyExpression(CastInst *CI, const TargetLowering &TLI,
|
|
|
|
const DataLayout &DL) {
|
2014-03-13 21:36:25 +08:00
|
|
|
// If this is a noop copy,
|
2015-07-09 10:09:04 +08:00
|
|
|
EVT SrcVT = TLI.getValueType(DL, CI->getOperand(0)->getType());
|
|
|
|
EVT DstVT = TLI.getValueType(DL, CI->getType());
|
2014-03-13 21:36:25 +08:00
|
|
|
|
|
|
|
// This is an fp<->int conversion?
|
|
|
|
if (SrcVT.isInteger() != DstVT.isInteger())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// If this is an extension, it will be a zero or sign extension, which
|
|
|
|
// isn't a noop.
|
|
|
|
if (SrcVT.bitsLT(DstVT)) return false;
|
|
|
|
|
|
|
|
// If these values will be promoted, find out what they will be promoted
|
|
|
|
// to. This helps us consider truncates on PPC as noop copies when they
|
|
|
|
// are.
|
|
|
|
if (TLI.getTypeAction(CI->getContext(), SrcVT) ==
|
|
|
|
TargetLowering::TypePromoteInteger)
|
|
|
|
SrcVT = TLI.getTypeToTransformTo(CI->getContext(), SrcVT);
|
|
|
|
if (TLI.getTypeAction(CI->getContext(), DstVT) ==
|
|
|
|
TargetLowering::TypePromoteInteger)
|
|
|
|
DstVT = TLI.getTypeToTransformTo(CI->getContext(), DstVT);
|
|
|
|
|
|
|
|
// If, after promotion, these are the same types, this is a noop copy.
|
|
|
|
if (SrcVT != DstVT)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return SinkCast(CI);
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Try to combine CI into a call to the llvm.uadd.with.overflow intrinsic if
|
|
|
|
/// possible.
|
2015-04-11 05:07:09 +08:00
|
|
|
///
|
|
|
|
/// Return true if any changes were made.
|
|
|
|
static bool CombineUAddWithOverflow(CmpInst *CI) {
|
|
|
|
Value *A, *B;
|
|
|
|
Instruction *AddI;
|
|
|
|
if (!match(CI,
|
|
|
|
m_UAddWithOverflow(m_Value(A), m_Value(B), m_Instruction(AddI))))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
Type *Ty = AddI->getType();
|
|
|
|
if (!isa<IntegerType>(Ty))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// We don't want to move around uses of condition values this late, so we we
|
|
|
|
// check if it is legal to create the call to the intrinsic in the basic
|
|
|
|
// block containing the icmp:
|
|
|
|
|
|
|
|
if (AddI->getParent() != CI->getParent() && !AddI->hasOneUse())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
#ifndef NDEBUG
|
|
|
|
// Someday m_UAddWithOverflow may get smarter, but this is a safe assumption
|
|
|
|
// for now:
|
|
|
|
if (AddI->hasOneUse())
|
|
|
|
assert(*AddI->user_begin() == CI && "expected!");
|
|
|
|
#endif
|
|
|
|
|
|
|
|
Module *M = CI->getParent()->getParent()->getParent();
|
|
|
|
Value *F = Intrinsic::getDeclaration(M, Intrinsic::uadd_with_overflow, Ty);
|
|
|
|
|
|
|
|
auto *InsertPt = AddI->hasOneUse() ? CI : AddI;
|
|
|
|
|
|
|
|
auto *UAddWithOverflow =
|
|
|
|
CallInst::Create(F, {A, B}, "uadd.overflow", InsertPt);
|
|
|
|
auto *UAdd = ExtractValueInst::Create(UAddWithOverflow, 0, "uadd", InsertPt);
|
|
|
|
auto *Overflow =
|
|
|
|
ExtractValueInst::Create(UAddWithOverflow, 1, "overflow", InsertPt);
|
|
|
|
|
|
|
|
CI->replaceAllUsesWith(Overflow);
|
|
|
|
AddI->replaceAllUsesWith(UAdd);
|
|
|
|
CI->eraseFromParent();
|
|
|
|
AddI->eraseFromParent();
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Sink the given CmpInst into user blocks to reduce the number of virtual
|
|
|
|
/// registers that must be created and coalesced. This is a clear win except on
|
|
|
|
/// targets with multiple condition code registers (PowerPC), where it might
|
|
|
|
/// lose; some adjustment may be wanted there.
|
2007-06-13 00:50:17 +08:00
|
|
|
///
|
|
|
|
/// Return true if any changes are made.
|
2015-04-11 05:07:09 +08:00
|
|
|
static bool SinkCmpExpression(CmpInst *CI) {
|
2007-06-13 00:50:17 +08:00
|
|
|
BasicBlock *DefBB = CI->getParent();
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Only insert a cmp in each block once.
|
2007-06-13 00:50:17 +08:00
|
|
|
DenseMap<BasicBlock*, CmpInst*> InsertedCmps;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
bool MadeChange = false;
|
2014-03-09 11:16:01 +08:00
|
|
|
for (Value::user_iterator UI = CI->user_begin(), E = CI->user_end();
|
2007-06-13 00:50:17 +08:00
|
|
|
UI != E; ) {
|
|
|
|
Use &TheUse = UI.getUse();
|
|
|
|
Instruction *User = cast<Instruction>(*UI);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// Preincrement use iterator so we don't invalidate it.
|
|
|
|
++UI;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// Don't bother for PHI nodes.
|
|
|
|
if (isa<PHINode>(User))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// Figure out which BB this cmp is used in.
|
|
|
|
BasicBlock *UserBB = User->getParent();
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// If this user is in the same block as the cmp, don't change the cmp.
|
|
|
|
if (UserBB == DefBB) continue;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// If we have already inserted a cmp into this block, use it.
|
|
|
|
CmpInst *&InsertedCmp = InsertedCmps[UserBB];
|
|
|
|
|
|
|
|
if (!InsertedCmp) {
|
2011-08-17 04:45:24 +08:00
|
|
|
BasicBlock::iterator InsertPt = UserBB->getFirstInsertionPt();
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(InsertPt != UserBB->end());
|
2008-09-24 13:32:41 +08:00
|
|
|
InsertedCmp =
|
2015-10-10 02:44:40 +08:00
|
|
|
CmpInst::Create(CI->getOpcode(), CI->getPredicate(),
|
|
|
|
CI->getOperand(0), CI->getOperand(1), "", &*InsertPt);
|
2007-06-13 00:50:17 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// Replace a use of the cmp with a use of the new cmp.
|
|
|
|
TheUse = InsertedCmp;
|
2015-04-11 06:25:36 +08:00
|
|
|
MadeChange = true;
|
2011-01-06 01:27:27 +08:00
|
|
|
++NumCmpUses;
|
2007-06-13 00:50:17 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
// If we removed all uses, nuke the cmp.
|
2015-04-11 06:25:36 +08:00
|
|
|
if (CI->use_empty()) {
|
2007-06-13 00:50:17 +08:00
|
|
|
CI->eraseFromParent();
|
2015-04-11 06:25:36 +08:00
|
|
|
MadeChange = true;
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2007-06-13 00:50:17 +08:00
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-04-11 05:07:09 +08:00
|
|
|
static bool OptimizeCmpExpression(CmpInst *CI) {
|
|
|
|
if (SinkCmpExpression(CI))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (CombineUAddWithOverflow(CI))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Check if the candidates could be combined with a shift instruction, which
|
|
|
|
/// includes:
|
2014-04-22 03:34:27 +08:00
|
|
|
/// 1. Truncate instruction
|
|
|
|
/// 2. And instruction and the imm is a mask of the low bits:
|
|
|
|
/// imm & (imm+1) == 0
|
2014-04-27 22:54:59 +08:00
|
|
|
static bool isExtractBitsCandidateUse(Instruction *User) {
|
2014-04-22 03:34:27 +08:00
|
|
|
if (!isa<TruncInst>(User)) {
|
|
|
|
if (User->getOpcode() != Instruction::And ||
|
|
|
|
!isa<ConstantInt>(User->getOperand(1)))
|
|
|
|
return false;
|
|
|
|
|
2014-04-22 09:20:34 +08:00
|
|
|
const APInt &Cimm = cast<ConstantInt>(User->getOperand(1))->getValue();
|
2014-04-22 03:34:27 +08:00
|
|
|
|
2014-04-22 09:20:34 +08:00
|
|
|
if ((Cimm & (Cimm + 1)).getBoolValue())
|
2014-04-22 03:34:27 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Sink both shift and truncate instruction to the use of truncate's BB.
|
2014-04-27 22:54:59 +08:00
|
|
|
static bool
|
2014-04-22 03:34:27 +08:00
|
|
|
SinkShiftAndTruncate(BinaryOperator *ShiftI, Instruction *User, ConstantInt *CI,
|
|
|
|
DenseMap<BasicBlock *, BinaryOperator *> &InsertedShifts,
|
2015-07-09 10:09:04 +08:00
|
|
|
const TargetLowering &TLI, const DataLayout &DL) {
|
2014-04-22 03:34:27 +08:00
|
|
|
BasicBlock *UserBB = User->getParent();
|
|
|
|
DenseMap<BasicBlock *, CastInst *> InsertedTruncs;
|
|
|
|
TruncInst *TruncI = dyn_cast<TruncInst>(User);
|
|
|
|
bool MadeChange = false;
|
|
|
|
|
|
|
|
for (Value::user_iterator TruncUI = TruncI->user_begin(),
|
|
|
|
TruncE = TruncI->user_end();
|
|
|
|
TruncUI != TruncE;) {
|
|
|
|
|
|
|
|
Use &TruncTheUse = TruncUI.getUse();
|
|
|
|
Instruction *TruncUser = cast<Instruction>(*TruncUI);
|
|
|
|
// Preincrement use iterator so we don't invalidate it.
|
|
|
|
|
|
|
|
++TruncUI;
|
|
|
|
|
|
|
|
int ISDOpcode = TLI.InstructionOpcodeToISD(TruncUser->getOpcode());
|
|
|
|
if (!ISDOpcode)
|
|
|
|
continue;
|
|
|
|
|
2014-07-29 18:20:22 +08:00
|
|
|
// If the use is actually a legal node, there will not be an
|
|
|
|
// implicit truncate.
|
|
|
|
// FIXME: always querying the result type is just an
|
|
|
|
// approximation; some nodes' legality is determined by the
|
|
|
|
// operand or other means. There's no good way to find out though.
|
2014-11-13 06:16:55 +08:00
|
|
|
if (TLI.isOperationLegalOrCustom(
|
2015-07-09 10:09:04 +08:00
|
|
|
ISDOpcode, TLI.getValueType(DL, TruncUser->getType(), true)))
|
2014-04-22 03:34:27 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
// Don't bother for PHI nodes.
|
|
|
|
if (isa<PHINode>(TruncUser))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
BasicBlock *TruncUserBB = TruncUser->getParent();
|
|
|
|
|
|
|
|
if (UserBB == TruncUserBB)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
BinaryOperator *&InsertedShift = InsertedShifts[TruncUserBB];
|
|
|
|
CastInst *&InsertedTrunc = InsertedTruncs[TruncUserBB];
|
|
|
|
|
|
|
|
if (!InsertedShift && !InsertedTrunc) {
|
|
|
|
BasicBlock::iterator InsertPt = TruncUserBB->getFirstInsertionPt();
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(InsertPt != TruncUserBB->end());
|
2014-04-22 03:34:27 +08:00
|
|
|
// Sink the shift
|
|
|
|
if (ShiftI->getOpcode() == Instruction::AShr)
|
2015-10-10 02:44:40 +08:00
|
|
|
InsertedShift = BinaryOperator::CreateAShr(ShiftI->getOperand(0), CI,
|
|
|
|
"", &*InsertPt);
|
2014-04-22 03:34:27 +08:00
|
|
|
else
|
2015-10-10 02:44:40 +08:00
|
|
|
InsertedShift = BinaryOperator::CreateLShr(ShiftI->getOperand(0), CI,
|
|
|
|
"", &*InsertPt);
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
// Sink the trunc
|
|
|
|
BasicBlock::iterator TruncInsertPt = TruncUserBB->getFirstInsertionPt();
|
|
|
|
TruncInsertPt++;
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(TruncInsertPt != TruncUserBB->end());
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
InsertedTrunc = CastInst::Create(TruncI->getOpcode(), InsertedShift,
|
2015-10-10 02:44:40 +08:00
|
|
|
TruncI->getType(), "", &*TruncInsertPt);
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
MadeChange = true;
|
|
|
|
|
|
|
|
TruncTheUse = InsertedTrunc;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Sink the shift *right* instruction into user blocks if the uses could
|
|
|
|
/// potentially be combined with this shift instruction and generate BitExtract
|
|
|
|
/// instruction. It will only be applied if the architecture supports BitExtract
|
|
|
|
/// instruction. Here is an example:
|
2014-04-22 03:34:27 +08:00
|
|
|
/// BB1:
|
|
|
|
/// %x.extract.shift = lshr i64 %arg1, 32
|
|
|
|
/// BB2:
|
|
|
|
/// %x.extract.trunc = trunc i64 %x.extract.shift to i16
|
|
|
|
/// ==>
|
|
|
|
///
|
|
|
|
/// BB2:
|
|
|
|
/// %x.extract.shift.1 = lshr i64 %arg1, 32
|
|
|
|
/// %x.extract.trunc = trunc i64 %x.extract.shift.1 to i16
|
|
|
|
///
|
|
|
|
/// CodeGen will recoginze the pattern in BB2 and generate BitExtract
|
|
|
|
/// instruction.
|
|
|
|
/// Return true if any changes are made.
|
|
|
|
static bool OptimizeExtractBits(BinaryOperator *ShiftI, ConstantInt *CI,
|
2015-07-09 10:09:04 +08:00
|
|
|
const TargetLowering &TLI,
|
|
|
|
const DataLayout &DL) {
|
2014-04-22 03:34:27 +08:00
|
|
|
BasicBlock *DefBB = ShiftI->getParent();
|
|
|
|
|
|
|
|
/// Only insert instructions in each block once.
|
|
|
|
DenseMap<BasicBlock *, BinaryOperator *> InsertedShifts;
|
|
|
|
|
2015-07-09 10:09:04 +08:00
|
|
|
bool shiftIsLegal = TLI.isTypeLegal(TLI.getValueType(DL, ShiftI->getType()));
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
bool MadeChange = false;
|
|
|
|
for (Value::user_iterator UI = ShiftI->user_begin(), E = ShiftI->user_end();
|
|
|
|
UI != E;) {
|
|
|
|
Use &TheUse = UI.getUse();
|
|
|
|
Instruction *User = cast<Instruction>(*UI);
|
|
|
|
// Preincrement use iterator so we don't invalidate it.
|
|
|
|
++UI;
|
|
|
|
|
|
|
|
// Don't bother for PHI nodes.
|
|
|
|
if (isa<PHINode>(User))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!isExtractBitsCandidateUse(User))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
BasicBlock *UserBB = User->getParent();
|
|
|
|
|
|
|
|
if (UserBB == DefBB) {
|
|
|
|
// If the shift and truncate instruction are in the same BB. The use of
|
|
|
|
// the truncate(TruncUse) may still introduce another truncate if not
|
|
|
|
// legal. In this case, we would like to sink both shift and truncate
|
|
|
|
// instruction to the BB of TruncUse.
|
|
|
|
// for example:
|
|
|
|
// BB1:
|
|
|
|
// i64 shift.result = lshr i64 opnd, imm
|
|
|
|
// trunc.result = trunc shift.result to i16
|
|
|
|
//
|
|
|
|
// BB2:
|
|
|
|
// ----> We will have an implicit truncate here if the architecture does
|
|
|
|
// not have i16 compare.
|
|
|
|
// cmp i16 trunc.result, opnd2
|
|
|
|
//
|
|
|
|
if (isa<TruncInst>(User) && shiftIsLegal
|
|
|
|
// If the type of the truncate is legal, no trucate will be
|
|
|
|
// introduced in other basic blocks.
|
2015-07-09 10:09:04 +08:00
|
|
|
&&
|
|
|
|
(!TLI.isTypeLegal(TLI.getValueType(DL, User->getType()))))
|
2014-04-22 03:34:27 +08:00
|
|
|
MadeChange =
|
2015-07-09 10:09:04 +08:00
|
|
|
SinkShiftAndTruncate(ShiftI, User, CI, InsertedShifts, TLI, DL);
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// If we have already inserted a shift into this block, use it.
|
|
|
|
BinaryOperator *&InsertedShift = InsertedShifts[UserBB];
|
|
|
|
|
|
|
|
if (!InsertedShift) {
|
|
|
|
BasicBlock::iterator InsertPt = UserBB->getFirstInsertionPt();
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(InsertPt != UserBB->end());
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
if (ShiftI->getOpcode() == Instruction::AShr)
|
2015-10-10 02:44:40 +08:00
|
|
|
InsertedShift = BinaryOperator::CreateAShr(ShiftI->getOperand(0), CI,
|
|
|
|
"", &*InsertPt);
|
2014-04-22 03:34:27 +08:00
|
|
|
else
|
2015-10-10 02:44:40 +08:00
|
|
|
InsertedShift = BinaryOperator::CreateLShr(ShiftI->getOperand(0), CI,
|
|
|
|
"", &*InsertPt);
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
MadeChange = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Replace a use of the shift with a use of the new shift.
|
|
|
|
TheUse = InsertedShift;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If we removed all uses, nuke the shift.
|
|
|
|
if (ShiftI->use_empty())
|
|
|
|
ShiftI->eraseFromParent();
|
|
|
|
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
// Translate a masked load intrinsic like
|
2014-12-28 16:54:45 +08:00
|
|
|
// <16 x i32 > @llvm.masked.load( <16 x i32>* %addr, i32 align,
|
|
|
|
// <16 x i1> %mask, <16 x i32> %passthru)
|
2015-08-09 02:27:36 +08:00
|
|
|
// to a chain of basic blocks, with loading element one-by-one if
|
2014-12-28 16:54:45 +08:00
|
|
|
// the appropriate mask bit is set
|
|
|
|
//
|
|
|
|
// %1 = bitcast i8* %addr to i32*
|
|
|
|
// %2 = extractelement <16 x i1> %mask, i32 0
|
|
|
|
// %3 = icmp eq i1 %2, true
|
|
|
|
// br i1 %3, label %cond.load, label %else
|
|
|
|
//
|
|
|
|
//cond.load: ; preds = %0
|
|
|
|
// %4 = getelementptr i32* %1, i32 0
|
|
|
|
// %5 = load i32* %4
|
|
|
|
// %6 = insertelement <16 x i32> undef, i32 %5, i32 0
|
|
|
|
// br label %else
|
|
|
|
//
|
|
|
|
//else: ; preds = %0, %cond.load
|
|
|
|
// %res.phi.else = phi <16 x i32> [ %6, %cond.load ], [ undef, %0 ]
|
|
|
|
// %7 = extractelement <16 x i1> %mask, i32 1
|
|
|
|
// %8 = icmp eq i1 %7, true
|
|
|
|
// br i1 %8, label %cond.load1, label %else2
|
|
|
|
//
|
|
|
|
//cond.load1: ; preds = %else
|
|
|
|
// %9 = getelementptr i32* %1, i32 1
|
|
|
|
// %10 = load i32* %9
|
|
|
|
// %11 = insertelement <16 x i32> %res.phi.else, i32 %10, i32 1
|
|
|
|
// br label %else2
|
|
|
|
//
|
|
|
|
//else2: ; preds = %else, %cond.load1
|
|
|
|
// %res.phi.else3 = phi <16 x i32> [ %11, %cond.load1 ], [ %res.phi.else, %else ]
|
|
|
|
// %12 = extractelement <16 x i1> %mask, i32 2
|
|
|
|
// %13 = icmp eq i1 %12, true
|
|
|
|
// br i1 %13, label %cond.load4, label %else5
|
|
|
|
//
|
|
|
|
static void ScalarizeMaskedLoad(CallInst *CI) {
|
|
|
|
Value *Ptr = CI->getArgOperand(0);
|
2015-10-21 19:50:54 +08:00
|
|
|
Value *Alignment = CI->getArgOperand(1);
|
2014-12-28 16:54:45 +08:00
|
|
|
Value *Mask = CI->getArgOperand(2);
|
2015-10-21 19:50:54 +08:00
|
|
|
Value *Src0 = CI->getArgOperand(3);
|
2014-12-28 16:54:45 +08:00
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();
|
|
|
|
VectorType *VecType = dyn_cast<VectorType>(CI->getType());
|
2014-12-28 16:54:45 +08:00
|
|
|
assert(VecType && "Unexpected return type of masked load intrinsic");
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
Type *EltTy = CI->getType()->getVectorElementType();
|
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
IRBuilder<> Builder(CI->getContext());
|
|
|
|
Instruction *InsertPt = CI;
|
|
|
|
BasicBlock *IfBlock = CI->getParent();
|
|
|
|
BasicBlock *CondBlock = nullptr;
|
|
|
|
BasicBlock *PrevIfBlock = CI->getParent();
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
Builder.SetInsertPoint(InsertPt);
|
2014-12-28 16:54:45 +08:00
|
|
|
Builder.SetCurrentDebugLocation(CI->getDebugLoc());
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
// Short-cut if the mask is all-true.
|
|
|
|
bool IsAllOnesMask = isa<Constant>(Mask) &&
|
|
|
|
cast<Constant>(Mask)->isAllOnesValue();
|
|
|
|
|
|
|
|
if (IsAllOnesMask) {
|
|
|
|
Value *NewI = Builder.CreateAlignedLoad(Ptr, AlignVal);
|
|
|
|
CI->replaceAllUsesWith(NewI);
|
|
|
|
CI->eraseFromParent();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Adjust alignment for the scalar instruction.
|
|
|
|
AlignVal = std::min(AlignVal, VecType->getScalarSizeInBits()/8);
|
2014-12-28 16:54:45 +08:00
|
|
|
// Bitcast %addr fron i8* to EltTy*
|
|
|
|
Type *NewPtrType =
|
|
|
|
EltTy->getPointerTo(cast<PointerType>(Ptr->getType())->getAddressSpace());
|
|
|
|
Value *FirstEltPtr = Builder.CreateBitCast(Ptr, NewPtrType);
|
2015-10-21 19:50:54 +08:00
|
|
|
unsigned VectorWidth = VecType->getNumElements();
|
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
Value *UndefVal = UndefValue::get(VecType);
|
|
|
|
|
|
|
|
// The result vector
|
|
|
|
Value *VResult = UndefVal;
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
if (isa<ConstantVector>(Mask)) {
|
|
|
|
for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
|
|
|
|
if (cast<ConstantVector>(Mask)->getOperand(Idx)->isNullValue())
|
|
|
|
continue;
|
|
|
|
Value *Gep =
|
|
|
|
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
|
|
|
|
LoadInst* Load = Builder.CreateAlignedLoad(Gep, AlignVal);
|
|
|
|
VResult = Builder.CreateInsertElement(VResult, Load,
|
|
|
|
Builder.getInt32(Idx));
|
|
|
|
}
|
|
|
|
Value *NewI = Builder.CreateSelect(Mask, VResult, Src0);
|
|
|
|
CI->replaceAllUsesWith(NewI);
|
|
|
|
CI->eraseFromParent();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
PHINode *Phi = nullptr;
|
|
|
|
Value *PrevPhi = UndefVal;
|
|
|
|
|
|
|
|
for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
|
|
|
|
|
|
|
|
// Fill the "else" block, created in the previous iteration
|
|
|
|
//
|
|
|
|
// %res.phi.else3 = phi <16 x i32> [ %11, %cond.load1 ], [ %res.phi.else, %else ]
|
|
|
|
// %mask_1 = extractelement <16 x i1> %mask, i32 Idx
|
|
|
|
// %to_load = icmp eq i1 %mask_1, true
|
|
|
|
// br i1 %to_load, label %cond.load, label %else
|
|
|
|
//
|
|
|
|
if (Idx > 0) {
|
|
|
|
Phi = Builder.CreatePHI(VecType, 2, "res.phi.else");
|
|
|
|
Phi->addIncoming(VResult, CondBlock);
|
|
|
|
Phi->addIncoming(PrevPhi, PrevIfBlock);
|
|
|
|
PrevPhi = Phi;
|
|
|
|
VResult = Phi;
|
|
|
|
}
|
|
|
|
|
|
|
|
Value *Predicate = Builder.CreateExtractElement(Mask, Builder.getInt32(Idx));
|
|
|
|
Value *Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
|
|
|
|
ConstantInt::get(Predicate->getType(), 1));
|
|
|
|
|
|
|
|
// Create "cond" block
|
|
|
|
//
|
|
|
|
// %EltAddr = getelementptr i32* %1, i32 0
|
|
|
|
// %Elt = load i32* %EltAddr
|
|
|
|
// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx
|
|
|
|
//
|
2015-10-10 02:44:40 +08:00
|
|
|
CondBlock = IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.load");
|
2014-12-28 16:54:45 +08:00
|
|
|
Builder.SetInsertPoint(InsertPt);
|
2015-04-04 05:33:42 +08:00
|
|
|
|
|
|
|
Value *Gep =
|
|
|
|
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
|
2015-10-21 19:50:54 +08:00
|
|
|
LoadInst* Load = Builder.CreateAlignedLoad(Gep, AlignVal);
|
2014-12-28 16:54:45 +08:00
|
|
|
VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx));
|
|
|
|
|
|
|
|
// Create "else" block, fill it in the next iteration
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *NewIfBlock =
|
|
|
|
CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");
|
2014-12-28 16:54:45 +08:00
|
|
|
Builder.SetInsertPoint(InsertPt);
|
|
|
|
Instruction *OldBr = IfBlock->getTerminator();
|
|
|
|
BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
|
|
|
|
OldBr->eraseFromParent();
|
|
|
|
PrevIfBlock = IfBlock;
|
|
|
|
IfBlock = NewIfBlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
Phi = Builder.CreatePHI(VecType, 2, "res.phi.select");
|
|
|
|
Phi->addIncoming(VResult, CondBlock);
|
|
|
|
Phi->addIncoming(PrevPhi, PrevIfBlock);
|
|
|
|
Value *NewI = Builder.CreateSelect(Mask, Phi, Src0);
|
|
|
|
CI->replaceAllUsesWith(NewI);
|
|
|
|
CI->eraseFromParent();
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
// Translate a masked store intrinsic, like
|
2014-12-28 16:54:45 +08:00
|
|
|
// void @llvm.masked.store(<16 x i32> %src, <16 x i32>* %addr, i32 align,
|
|
|
|
// <16 x i1> %mask)
|
|
|
|
// to a chain of basic blocks, that stores element one-by-one if
|
|
|
|
// the appropriate mask bit is set
|
|
|
|
//
|
|
|
|
// %1 = bitcast i8* %addr to i32*
|
|
|
|
// %2 = extractelement <16 x i1> %mask, i32 0
|
|
|
|
// %3 = icmp eq i1 %2, true
|
|
|
|
// br i1 %3, label %cond.store, label %else
|
|
|
|
//
|
|
|
|
// cond.store: ; preds = %0
|
|
|
|
// %4 = extractelement <16 x i32> %val, i32 0
|
|
|
|
// %5 = getelementptr i32* %1, i32 0
|
|
|
|
// store i32 %4, i32* %5
|
|
|
|
// br label %else
|
|
|
|
//
|
|
|
|
// else: ; preds = %0, %cond.store
|
|
|
|
// %6 = extractelement <16 x i1> %mask, i32 1
|
|
|
|
// %7 = icmp eq i1 %6, true
|
|
|
|
// br i1 %7, label %cond.store1, label %else2
|
|
|
|
//
|
|
|
|
// cond.store1: ; preds = %else
|
|
|
|
// %8 = extractelement <16 x i32> %val, i32 1
|
|
|
|
// %9 = getelementptr i32* %1, i32 1
|
|
|
|
// store i32 %8, i32* %9
|
|
|
|
// br label %else2
|
|
|
|
// . . .
|
|
|
|
static void ScalarizeMaskedStore(CallInst *CI) {
|
|
|
|
Value *Src = CI->getArgOperand(0);
|
2015-10-21 19:50:54 +08:00
|
|
|
Value *Ptr = CI->getArgOperand(1);
|
|
|
|
Value *Alignment = CI->getArgOperand(2);
|
2014-12-28 16:54:45 +08:00
|
|
|
Value *Mask = CI->getArgOperand(3);
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();
|
2014-12-28 16:54:45 +08:00
|
|
|
VectorType *VecType = dyn_cast<VectorType>(Src->getType());
|
|
|
|
assert(VecType && "Unexpected data type in masked store intrinsic");
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
Type *EltTy = VecType->getElementType();
|
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
IRBuilder<> Builder(CI->getContext());
|
|
|
|
Instruction *InsertPt = CI;
|
|
|
|
BasicBlock *IfBlock = CI->getParent();
|
|
|
|
Builder.SetInsertPoint(InsertPt);
|
|
|
|
Builder.SetCurrentDebugLocation(CI->getDebugLoc());
|
|
|
|
|
2015-10-21 19:50:54 +08:00
|
|
|
// Short-cut if the mask is all-true.
|
|
|
|
bool IsAllOnesMask = isa<Constant>(Mask) &&
|
|
|
|
cast<Constant>(Mask)->isAllOnesValue();
|
|
|
|
|
|
|
|
if (IsAllOnesMask) {
|
|
|
|
Builder.CreateAlignedStore(Src, Ptr, AlignVal);
|
|
|
|
CI->eraseFromParent();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Adjust alignment for the scalar instruction.
|
|
|
|
AlignVal = std::max(AlignVal, VecType->getScalarSizeInBits()/8);
|
2014-12-28 16:54:45 +08:00
|
|
|
// Bitcast %addr fron i8* to EltTy*
|
|
|
|
Type *NewPtrType =
|
|
|
|
EltTy->getPointerTo(cast<PointerType>(Ptr->getType())->getAddressSpace());
|
|
|
|
Value *FirstEltPtr = Builder.CreateBitCast(Ptr, NewPtrType);
|
|
|
|
unsigned VectorWidth = VecType->getNumElements();
|
2015-10-21 19:50:54 +08:00
|
|
|
|
|
|
|
if (isa<ConstantVector>(Mask)) {
|
|
|
|
for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
|
|
|
|
if (cast<ConstantVector>(Mask)->getOperand(Idx)->isNullValue())
|
|
|
|
continue;
|
|
|
|
Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx));
|
|
|
|
Value *Gep =
|
|
|
|
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
|
|
|
|
Builder.CreateAlignedStore(OneElt, Gep, AlignVal);
|
|
|
|
}
|
|
|
|
CI->eraseFromParent();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
|
|
|
|
|
|
|
|
// Fill the "else" block, created in the previous iteration
|
|
|
|
//
|
|
|
|
// %mask_1 = extractelement <16 x i1> %mask, i32 Idx
|
|
|
|
// %to_store = icmp eq i1 %mask_1, true
|
2015-10-21 19:50:54 +08:00
|
|
|
// br i1 %to_store, label %cond.store, label %else
|
2014-12-28 16:54:45 +08:00
|
|
|
//
|
|
|
|
Value *Predicate = Builder.CreateExtractElement(Mask, Builder.getInt32(Idx));
|
|
|
|
Value *Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
|
|
|
|
ConstantInt::get(Predicate->getType(), 1));
|
|
|
|
|
|
|
|
// Create "cond" block
|
|
|
|
//
|
|
|
|
// %OneElt = extractelement <16 x i32> %Src, i32 Idx
|
|
|
|
// %EltAddr = getelementptr i32* %1, i32 0
|
|
|
|
// %store i32 %OneElt, i32* %EltAddr
|
|
|
|
//
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *CondBlock =
|
|
|
|
IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.store");
|
2014-12-28 16:54:45 +08:00
|
|
|
Builder.SetInsertPoint(InsertPt);
|
2015-10-10 02:44:40 +08:00
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx));
|
2015-04-04 05:33:42 +08:00
|
|
|
Value *Gep =
|
|
|
|
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
|
2015-10-21 19:50:54 +08:00
|
|
|
Builder.CreateAlignedStore(OneElt, Gep, AlignVal);
|
2014-12-28 16:54:45 +08:00
|
|
|
|
|
|
|
// Create "else" block, fill it in the next iteration
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *NewIfBlock =
|
|
|
|
CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");
|
2014-12-28 16:54:45 +08:00
|
|
|
Builder.SetInsertPoint(InsertPt);
|
|
|
|
Instruction *OldBr = IfBlock->getTerminator();
|
|
|
|
BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
|
|
|
|
OldBr->eraseFromParent();
|
|
|
|
IfBlock = NewIfBlock;
|
|
|
|
}
|
|
|
|
CI->eraseFromParent();
|
|
|
|
}
|
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeCallInst(CallInst *CI, bool& ModifiedDT) {
|
2011-01-15 15:14:54 +08:00
|
|
|
BasicBlock *BB = CI->getParent();
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2011-01-15 15:14:54 +08:00
|
|
|
// Lower inline assembly if we can.
|
|
|
|
// If we found an inline asm expession, and if the target knows how to
|
|
|
|
// lower it to normal LLVM code, do so now.
|
|
|
|
if (TLI && isa<InlineAsm>(CI->getCalledValue())) {
|
|
|
|
if (TLI->ExpandInlineAsm(CI)) {
|
|
|
|
// Avoid invalidating the iterator.
|
|
|
|
CurInstIterator = BB->begin();
|
|
|
|
// Avoid processing instructions out of order, which could cause
|
|
|
|
// reuse before a value is defined.
|
|
|
|
SunkAddrs.clear();
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
// Sink address computing for memory operands into the block.
|
2015-09-22 07:03:16 +08:00
|
|
|
if (optimizeInlineAsmInst(CI))
|
2011-01-15 15:14:54 +08:00
|
|
|
return true;
|
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2015-03-18 20:01:59 +08:00
|
|
|
// Align the pointer arguments to this call if the target thinks it's a good
|
|
|
|
// idea
|
|
|
|
unsigned MinSize, PrefAlign;
|
2015-07-08 02:45:17 +08:00
|
|
|
if (TLI && TLI->shouldAlignPointerArgs(CI, MinSize, PrefAlign)) {
|
2015-03-18 20:01:59 +08:00
|
|
|
for (auto &Arg : CI->arg_operands()) {
|
|
|
|
// We want to align both objects whose address is used directly and
|
|
|
|
// objects whose address is used in casts and GEPs, though it only makes
|
|
|
|
// sense for GEPs if the offset is a multiple of the desired alignment and
|
|
|
|
// if size - offset meets the size threshold.
|
|
|
|
if (!Arg->getType()->isPointerTy())
|
|
|
|
continue;
|
2015-07-08 02:45:17 +08:00
|
|
|
APInt Offset(DL->getPointerSizeInBits(
|
|
|
|
cast<PointerType>(Arg->getType())->getAddressSpace()),
|
|
|
|
0);
|
|
|
|
Value *Val = Arg->stripAndAccumulateInBoundsConstantOffsets(*DL, Offset);
|
2015-03-18 20:01:59 +08:00
|
|
|
uint64_t Offset2 = Offset.getLimitedValue();
|
2015-04-13 18:47:39 +08:00
|
|
|
if ((Offset2 & (PrefAlign-1)) != 0)
|
|
|
|
continue;
|
2015-03-18 20:01:59 +08:00
|
|
|
AllocaInst *AI;
|
2015-07-08 02:45:17 +08:00
|
|
|
if ((AI = dyn_cast<AllocaInst>(Val)) && AI->getAlignment() < PrefAlign &&
|
|
|
|
DL->getTypeAllocSize(AI->getAllocatedType()) >= MinSize + Offset2)
|
2015-03-18 20:01:59 +08:00
|
|
|
AI->setAlignment(PrefAlign);
|
2015-04-13 18:47:39 +08:00
|
|
|
// Global variables can only be aligned if they are defined in this
|
|
|
|
// object (i.e. they are uniquely initialized in this object), and
|
|
|
|
// over-aligning global variables that have an explicit section is
|
|
|
|
// forbidden.
|
|
|
|
GlobalVariable *GV;
|
2015-07-08 02:45:17 +08:00
|
|
|
if ((GV = dyn_cast<GlobalVariable>(Val)) && GV->hasUniqueInitializer() &&
|
|
|
|
!GV->hasSection() && GV->getAlignment() < PrefAlign &&
|
|
|
|
DL->getTypeAllocSize(GV->getType()->getElementType()) >=
|
|
|
|
MinSize + Offset2)
|
2015-04-13 18:47:39 +08:00
|
|
|
GV->setAlignment(PrefAlign);
|
2015-03-18 20:01:59 +08:00
|
|
|
}
|
|
|
|
// If this is a memcpy (or similar) then we may be able to improve the
|
|
|
|
// alignment
|
|
|
|
if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(CI)) {
|
2015-07-08 02:45:17 +08:00
|
|
|
unsigned Align = getKnownAlignment(MI->getDest(), *DL);
|
2015-03-18 20:01:59 +08:00
|
|
|
if (MemTransferInst *MTI = dyn_cast<MemTransferInst>(MI))
|
2015-07-08 02:45:17 +08:00
|
|
|
Align = std::min(Align, getKnownAlignment(MTI->getSource(), *DL));
|
2015-03-18 20:01:59 +08:00
|
|
|
if (Align > MI->getAlignment())
|
|
|
|
MI->setAlignment(ConstantInt::get(MI->getAlignmentType(), Align));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-03-11 10:41:03 +08:00
|
|
|
IntrinsicInst *II = dyn_cast<IntrinsicInst>(CI);
|
2014-12-28 16:54:45 +08:00
|
|
|
if (II) {
|
|
|
|
switch (II->getIntrinsicID()) {
|
|
|
|
default: break;
|
|
|
|
case Intrinsic::objectsize: {
|
|
|
|
// Lower all uses of llvm.objectsize.*
|
|
|
|
bool Min = (cast<ConstantInt>(II->getArgOperand(1))->getZExtValue() == 1);
|
|
|
|
Type *ReturnTy = CI->getType();
|
|
|
|
Constant *RetVal = ConstantInt::get(ReturnTy, Min ? 0 : -1ULL);
|
|
|
|
|
|
|
|
// Substituting this can cause recursive simplifications, which can
|
|
|
|
// invalidate our iterator. Use a WeakVH to hold onto it in case this
|
|
|
|
// happens.
|
2015-10-10 02:44:40 +08:00
|
|
|
WeakVH IterHandle(&*CurInstIterator);
|
2014-12-28 16:54:45 +08:00
|
|
|
|
|
|
|
replaceAndRecursivelySimplify(CI, RetVal,
|
2015-03-19 07:17:28 +08:00
|
|
|
TLInfo, nullptr);
|
2011-01-15 15:25:29 +08:00
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
// If the iterator instruction was recursively deleted, start over at the
|
|
|
|
// start of the block.
|
2015-10-10 02:44:40 +08:00
|
|
|
if (IterHandle != CurInstIterator.getNodePtrUnchecked()) {
|
2014-12-28 16:54:45 +08:00
|
|
|
CurInstIterator = BB->begin();
|
|
|
|
SunkAddrs.clear();
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
case Intrinsic::masked_load: {
|
|
|
|
// Scalarize unsupported vector masked load
|
2015-10-19 15:43:38 +08:00
|
|
|
if (!TTI->isLegalMaskedLoad(CI->getType())) {
|
2014-12-28 16:54:45 +08:00
|
|
|
ScalarizeMaskedLoad(CI);
|
|
|
|
ModifiedDT = true;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
case Intrinsic::masked_store: {
|
2015-10-19 15:43:38 +08:00
|
|
|
if (!TTI->isLegalMaskedStore(CI->getArgOperand(0)->getType())) {
|
2014-12-28 16:54:45 +08:00
|
|
|
ScalarizeMaskedStore(CI);
|
|
|
|
ModifiedDT = true;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
2015-05-23 05:37:17 +08:00
|
|
|
case Intrinsic::aarch64_stlxr:
|
|
|
|
case Intrinsic::aarch64_stxr: {
|
|
|
|
ZExtInst *ExtVal = dyn_cast<ZExtInst>(CI->getArgOperand(0));
|
|
|
|
if (!ExtVal || !ExtVal->hasOneUse() ||
|
|
|
|
ExtVal->getParent() == CI->getParent())
|
|
|
|
return false;
|
|
|
|
// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.
|
|
|
|
ExtVal->moveBefore(CI);
|
2015-06-18 04:44:32 +08:00
|
|
|
// Mark this instruction as "inserted by CGP", so that other
|
|
|
|
// optimizations don't touch it.
|
|
|
|
InsertedInsts.insert(ExtVal);
|
2015-05-23 05:37:17 +08:00
|
|
|
return true;
|
|
|
|
}
|
2015-09-16 02:32:14 +08:00
|
|
|
case Intrinsic::invariant_group_barrier:
|
|
|
|
II->replaceAllUsesWith(II->getArgOperand(0));
|
|
|
|
II->eraseFromParent();
|
|
|
|
return true;
|
2011-01-19 04:53:04 +08:00
|
|
|
}
|
2010-03-11 10:41:03 +08:00
|
|
|
|
2014-12-28 16:54:45 +08:00
|
|
|
if (TLI) {
|
2015-06-05 00:17:38 +08:00
|
|
|
// Unknown address space.
|
|
|
|
// TODO: Target hook to pick which address space the intrinsic cares
|
|
|
|
// about?
|
|
|
|
unsigned AddrSpace = ~0u;
|
2014-12-28 16:54:45 +08:00
|
|
|
SmallVector<Value*, 2> PtrOps;
|
|
|
|
Type *AccessTy;
|
2015-06-05 00:17:38 +08:00
|
|
|
if (TLI->GetAddrModeArguments(II, PtrOps, AccessTy, AddrSpace))
|
2014-12-28 16:54:45 +08:00
|
|
|
while (!PtrOps.empty())
|
2015-09-22 07:03:16 +08:00
|
|
|
if (optimizeMemoryInst(II, PtrOps.pop_back_val(), AccessTy, AddrSpace))
|
2014-12-28 16:54:45 +08:00
|
|
|
return true;
|
|
|
|
}
|
2012-03-14 04:59:56 +08:00
|
|
|
}
|
|
|
|
|
2010-03-11 10:41:03 +08:00
|
|
|
// From here on out we're working with named functions.
|
2014-04-14 08:51:57 +08:00
|
|
|
if (!CI->getCalledFunction()) return false;
|
2011-05-27 05:51:06 +08:00
|
|
|
|
2010-03-12 17:27:41 +08:00
|
|
|
// Lower all default uses of _chk calls. This is very similar
|
|
|
|
// to what InstCombineCalls does, but here we are only lowering calls
|
2015-01-13 01:22:43 +08:00
|
|
|
// to fortified library functions (e.g. __memcpy_chk) that have the default
|
|
|
|
// "don't know" as the objectsize. Anything else should be left alone.
|
2015-03-10 10:37:25 +08:00
|
|
|
FortifiedLibCallSimplifier Simplifier(TLInfo, true);
|
2015-01-13 01:22:43 +08:00
|
|
|
if (Value *V = Simplifier.optimizeCall(CI)) {
|
|
|
|
CI->replaceAllUsesWith(V);
|
|
|
|
CI->eraseFromParent();
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
2010-03-11 10:41:03 +08:00
|
|
|
}
|
2011-01-15 15:25:29 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Look for opportunities to duplicate return instructions to the predecessor
|
|
|
|
/// to enable tail call optimizations. The case it is currently looking for is:
|
2012-09-13 20:34:29 +08:00
|
|
|
/// @code
|
2011-03-21 09:19:09 +08:00
|
|
|
/// bb0:
|
|
|
|
/// %tmp0 = tail call i32 @f0()
|
|
|
|
/// br label %return
|
|
|
|
/// bb1:
|
|
|
|
/// %tmp1 = tail call i32 @f1()
|
|
|
|
/// br label %return
|
|
|
|
/// bb2:
|
|
|
|
/// %tmp2 = tail call i32 @f2()
|
|
|
|
/// br label %return
|
|
|
|
/// return:
|
|
|
|
/// %retval = phi i32 [ %tmp0, %bb0 ], [ %tmp1, %bb1 ], [ %tmp2, %bb2 ]
|
|
|
|
/// ret i32 %retval
|
2012-09-13 20:34:29 +08:00
|
|
|
/// @endcode
|
2011-03-21 09:19:09 +08:00
|
|
|
///
|
|
|
|
/// =>
|
|
|
|
///
|
2012-09-13 20:34:29 +08:00
|
|
|
/// @code
|
2011-03-21 09:19:09 +08:00
|
|
|
/// bb0:
|
|
|
|
/// %tmp0 = tail call i32 @f0()
|
|
|
|
/// ret i32 %tmp0
|
|
|
|
/// bb1:
|
|
|
|
/// %tmp1 = tail call i32 @f1()
|
|
|
|
/// ret i32 %tmp1
|
|
|
|
/// bb2:
|
|
|
|
/// %tmp2 = tail call i32 @f2()
|
|
|
|
/// ret i32 %tmp2
|
2012-09-13 20:34:29 +08:00
|
|
|
/// @endcode
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::dupRetToEnableTailCallOpts(BasicBlock *BB) {
|
2011-03-24 12:51:51 +08:00
|
|
|
if (!TLI)
|
|
|
|
return false;
|
|
|
|
|
2012-11-24 03:17:06 +08:00
|
|
|
ReturnInst *RI = dyn_cast<ReturnInst>(BB->getTerminator());
|
|
|
|
if (!RI)
|
|
|
|
return false;
|
|
|
|
|
2014-04-14 08:51:57 +08:00
|
|
|
PHINode *PN = nullptr;
|
|
|
|
BitCastInst *BCI = nullptr;
|
2011-03-21 09:19:09 +08:00
|
|
|
Value *V = RI->getReturnValue();
|
2012-07-28 05:21:26 +08:00
|
|
|
if (V) {
|
|
|
|
BCI = dyn_cast<BitCastInst>(V);
|
|
|
|
if (BCI)
|
|
|
|
V = BCI->getOperand(0);
|
|
|
|
|
|
|
|
PN = dyn_cast<PHINode>(V);
|
|
|
|
if (!PN)
|
|
|
|
return false;
|
|
|
|
}
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2011-03-24 12:52:10 +08:00
|
|
|
if (PN && PN->getParent() != BB)
|
2011-03-24 12:52:07 +08:00
|
|
|
return false;
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2011-03-24 12:52:07 +08:00
|
|
|
// It's not safe to eliminate the sign / zero extension of the return value.
|
|
|
|
// See llvm::isInTailCallPosition().
|
|
|
|
const Function *F = BB->getParent();
|
2013-01-19 05:53:16 +08:00
|
|
|
AttributeSet CallerAttrs = F->getAttributes();
|
|
|
|
if (CallerAttrs.hasAttribute(AttributeSet::ReturnIndex, Attribute::ZExt) ||
|
|
|
|
CallerAttrs.hasAttribute(AttributeSet::ReturnIndex, Attribute::SExt))
|
2011-03-24 12:52:07 +08:00
|
|
|
return false;
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2011-03-24 12:52:10 +08:00
|
|
|
// Make sure there are no instructions between the PHI and return, or that the
|
|
|
|
// return is the first instruction in the block.
|
|
|
|
if (PN) {
|
|
|
|
BasicBlock::iterator BI = BB->begin();
|
|
|
|
do { ++BI; } while (isa<DbgInfoIntrinsic>(BI));
|
2012-07-28 05:21:26 +08:00
|
|
|
if (&*BI == BCI)
|
|
|
|
// Also skip over the bitcast.
|
|
|
|
++BI;
|
2011-03-24 12:52:10 +08:00
|
|
|
if (&*BI != RI)
|
|
|
|
return false;
|
|
|
|
} else {
|
2011-03-25 00:34:59 +08:00
|
|
|
BasicBlock::iterator BI = BB->begin();
|
|
|
|
while (isa<DbgInfoIntrinsic>(BI)) ++BI;
|
|
|
|
if (&*BI != RI)
|
2011-03-24 12:52:10 +08:00
|
|
|
return false;
|
|
|
|
}
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2011-03-24 12:52:07 +08:00
|
|
|
/// Only dup the ReturnInst if the CallInst is likely to be emitted as a tail
|
|
|
|
/// call.
|
|
|
|
SmallVector<CallInst*, 4> TailCalls;
|
2011-03-24 12:52:10 +08:00
|
|
|
if (PN) {
|
|
|
|
for (unsigned I = 0, E = PN->getNumIncomingValues(); I != E; ++I) {
|
|
|
|
CallInst *CI = dyn_cast<CallInst>(PN->getIncomingValue(I));
|
|
|
|
// Make sure the phi value is indeed produced by the tail call.
|
|
|
|
if (CI && CI->hasOneUse() && CI->getParent() == PN->getIncomingBlock(I) &&
|
|
|
|
TLI->mayBeEmittedAsTailCall(CI))
|
|
|
|
TailCalls.push_back(CI);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
SmallPtrSet<BasicBlock*, 4> VisitedBBs;
|
2014-07-22 01:06:51 +08:00
|
|
|
for (pred_iterator PI = pred_begin(BB), PE = pred_end(BB); PI != PE; ++PI) {
|
2014-11-19 15:49:26 +08:00
|
|
|
if (!VisitedBBs.insert(*PI).second)
|
2011-03-24 12:52:10 +08:00
|
|
|
continue;
|
|
|
|
|
2014-07-22 01:06:51 +08:00
|
|
|
BasicBlock::InstListType &InstList = (*PI)->getInstList();
|
2011-03-24 12:52:10 +08:00
|
|
|
BasicBlock::InstListType::reverse_iterator RI = InstList.rbegin();
|
|
|
|
BasicBlock::InstListType::reverse_iterator RE = InstList.rend();
|
2011-03-25 00:34:59 +08:00
|
|
|
do { ++RI; } while (RI != RE && isa<DbgInfoIntrinsic>(&*RI));
|
|
|
|
if (RI == RE)
|
2011-03-24 12:52:10 +08:00
|
|
|
continue;
|
2011-03-25 00:34:59 +08:00
|
|
|
|
2011-03-24 12:52:10 +08:00
|
|
|
CallInst *CI = dyn_cast<CallInst>(&*RI);
|
2011-03-24 23:54:11 +08:00
|
|
|
if (CI && CI->use_empty() && TLI->mayBeEmittedAsTailCall(CI))
|
2011-03-24 12:52:10 +08:00
|
|
|
TailCalls.push_back(CI);
|
|
|
|
}
|
2011-03-24 12:52:07 +08:00
|
|
|
}
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2011-03-24 12:52:07 +08:00
|
|
|
bool Changed = false;
|
|
|
|
for (unsigned i = 0, e = TailCalls.size(); i != e; ++i) {
|
|
|
|
CallInst *CI = TailCalls[i];
|
|
|
|
CallSite CS(CI);
|
|
|
|
|
|
|
|
// Conservatively require the attributes of the call to match those of the
|
|
|
|
// return. Ignore noalias because it doesn't affect the call sequence.
|
2013-01-19 05:53:16 +08:00
|
|
|
AttributeSet CalleeAttrs = CS.getAttributes();
|
|
|
|
if (AttrBuilder(CalleeAttrs, AttributeSet::ReturnIndex).
|
2012-12-19 15:18:57 +08:00
|
|
|
removeAttribute(Attribute::NoAlias) !=
|
2013-01-19 05:53:16 +08:00
|
|
|
AttrBuilder(CalleeAttrs, AttributeSet::ReturnIndex).
|
2012-12-19 15:18:57 +08:00
|
|
|
removeAttribute(Attribute::NoAlias))
|
2011-03-24 12:52:07 +08:00
|
|
|
continue;
|
2011-03-21 09:19:09 +08:00
|
|
|
|
2011-03-24 12:52:07 +08:00
|
|
|
// Make sure the call instruction is followed by an unconditional branch to
|
|
|
|
// the return block.
|
|
|
|
BasicBlock *CallBB = CI->getParent();
|
|
|
|
BranchInst *BI = dyn_cast<BranchInst>(CallBB->getTerminator());
|
|
|
|
if (!BI || !BI->isUnconditional() || BI->getSuccessor(0) != BB)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// Duplicate the return into CallBB.
|
|
|
|
(void)FoldReturnIntoUncondBranch(RI, BB, CallBB);
|
2011-03-24 23:35:25 +08:00
|
|
|
ModifiedDT = Changed = true;
|
2011-03-24 12:52:07 +08:00
|
|
|
++NumRetsDup;
|
2011-03-21 09:19:09 +08:00
|
|
|
}
|
|
|
|
|
2011-03-24 12:52:07 +08:00
|
|
|
// If we eliminated all predecessors of the block, delete the block now.
|
2012-09-29 07:58:57 +08:00
|
|
|
if (Changed && !BB->hasAddressTaken() && pred_begin(BB) == pred_end(BB))
|
2011-03-24 12:52:07 +08:00
|
|
|
BB->eraseFromParent();
|
|
|
|
|
|
|
|
return Changed;
|
2011-03-21 09:19:09 +08:00
|
|
|
}
|
|
|
|
|
2008-11-25 15:09:13 +08:00
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
// Memory Optimization
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
namespace {
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// This is an extended version of TargetLowering::AddrMode
|
2013-01-05 10:09:22 +08:00
|
|
|
/// which holds actual Value*'s for register values.
|
2013-01-07 23:14:13 +08:00
|
|
|
struct ExtAddrMode : public TargetLowering::AddrMode {
|
2013-01-05 10:09:22 +08:00
|
|
|
Value *BaseReg;
|
|
|
|
Value *ScaledReg;
|
2014-04-14 08:51:57 +08:00
|
|
|
ExtAddrMode() : BaseReg(nullptr), ScaledReg(nullptr) {}
|
2013-01-05 10:09:22 +08:00
|
|
|
void print(raw_ostream &OS) const;
|
|
|
|
void dump() const;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
bool operator==(const ExtAddrMode& O) const {
|
|
|
|
return (BaseReg == O.BaseReg) && (ScaledReg == O.ScaledReg) &&
|
|
|
|
(BaseGV == O.BaseGV) && (BaseOffs == O.BaseOffs) &&
|
|
|
|
(HasBaseReg == O.HasBaseReg) && (Scale == O.Scale);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
2013-09-11 07:09:24 +08:00
|
|
|
#ifndef NDEBUG
|
|
|
|
static inline raw_ostream &operator<<(raw_ostream &OS, const ExtAddrMode &AM) {
|
|
|
|
AM.print(OS);
|
|
|
|
return OS;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
void ExtAddrMode::print(raw_ostream &OS) const {
|
|
|
|
bool NeedPlus = false;
|
|
|
|
OS << "[";
|
|
|
|
if (BaseGV) {
|
|
|
|
OS << (NeedPlus ? " + " : "")
|
|
|
|
<< "GV:";
|
2014-01-09 10:29:41 +08:00
|
|
|
BaseGV->printAsOperand(OS, /*PrintType=*/false);
|
2013-01-05 10:09:22 +08:00
|
|
|
NeedPlus = true;
|
|
|
|
}
|
|
|
|
|
2014-05-30 11:15:17 +08:00
|
|
|
if (BaseOffs) {
|
|
|
|
OS << (NeedPlus ? " + " : "")
|
|
|
|
<< BaseOffs;
|
|
|
|
NeedPlus = true;
|
|
|
|
}
|
2013-01-05 10:09:22 +08:00
|
|
|
|
|
|
|
if (BaseReg) {
|
|
|
|
OS << (NeedPlus ? " + " : "")
|
|
|
|
<< "Base:";
|
2014-01-09 10:29:41 +08:00
|
|
|
BaseReg->printAsOperand(OS, /*PrintType=*/false);
|
2013-01-05 10:09:22 +08:00
|
|
|
NeedPlus = true;
|
|
|
|
}
|
|
|
|
if (Scale) {
|
|
|
|
OS << (NeedPlus ? " + " : "")
|
|
|
|
<< Scale << "*";
|
2014-01-09 10:29:41 +08:00
|
|
|
ScaledReg->printAsOperand(OS, /*PrintType=*/false);
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
OS << ']';
|
|
|
|
}
|
|
|
|
|
|
|
|
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
|
|
|
|
void ExtAddrMode::dump() const {
|
|
|
|
print(dbgs());
|
|
|
|
dbgs() << '\n';
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2014-02-07 05:44:56 +08:00
|
|
|
/// \brief This class provides transaction based operation on the IR.
|
|
|
|
/// Every change made through this class is recorded in the internal state and
|
|
|
|
/// can be undone (rollback) until commit is called.
|
|
|
|
class TypePromotionTransaction {
|
|
|
|
|
|
|
|
/// \brief This represents the common interface of the individual transaction.
|
|
|
|
/// Each class implements the logic for doing one specific modification on
|
|
|
|
/// the IR via the TypePromotionTransaction.
|
|
|
|
class TypePromotionAction {
|
|
|
|
protected:
|
|
|
|
/// The Instruction modified.
|
|
|
|
Instruction *Inst;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Constructor of the action.
|
|
|
|
/// The constructor performs the related action on the IR.
|
|
|
|
TypePromotionAction(Instruction *Inst) : Inst(Inst) {}
|
|
|
|
|
|
|
|
virtual ~TypePromotionAction() {}
|
|
|
|
|
|
|
|
/// \brief Undo the modification done by this action.
|
|
|
|
/// When this method is called, the IR must be in the same state as it was
|
|
|
|
/// before this action was applied.
|
|
|
|
/// \pre Undoing the action works if and only if the IR is in the exact same
|
|
|
|
/// state as it was directly after this action was applied.
|
|
|
|
virtual void undo() = 0;
|
|
|
|
|
|
|
|
/// \brief Advocate every change made by this action.
|
|
|
|
/// When the results on the IR of the action are to be kept, it is important
|
|
|
|
/// to call this function, otherwise hidden information may be kept forever.
|
|
|
|
virtual void commit() {
|
|
|
|
// Nothing to be done, this action is not doing anything.
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Utility to remember the position of an instruction.
|
|
|
|
class InsertionHandler {
|
|
|
|
/// Position of an instruction.
|
|
|
|
/// Either an instruction:
|
|
|
|
/// - Is the first in a basic block: BB is used.
|
|
|
|
/// - Has a previous instructon: PrevInst is used.
|
|
|
|
union {
|
|
|
|
Instruction *PrevInst;
|
|
|
|
BasicBlock *BB;
|
|
|
|
} Point;
|
|
|
|
/// Remember whether or not the instruction had a previous instruction.
|
|
|
|
bool HasPrevInstruction;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Record the position of \p Inst.
|
|
|
|
InsertionHandler(Instruction *Inst) {
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock::iterator It = Inst->getIterator();
|
2014-02-07 05:44:56 +08:00
|
|
|
HasPrevInstruction = (It != (Inst->getParent()->begin()));
|
|
|
|
if (HasPrevInstruction)
|
2015-10-10 02:44:40 +08:00
|
|
|
Point.PrevInst = &*--It;
|
2014-02-07 05:44:56 +08:00
|
|
|
else
|
|
|
|
Point.BB = Inst->getParent();
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Insert \p Inst at the recorded position.
|
|
|
|
void insert(Instruction *Inst) {
|
|
|
|
if (HasPrevInstruction) {
|
|
|
|
if (Inst->getParent())
|
|
|
|
Inst->removeFromParent();
|
|
|
|
Inst->insertAfter(Point.PrevInst);
|
|
|
|
} else {
|
2015-10-10 02:44:40 +08:00
|
|
|
Instruction *Position = &*Point.BB->getFirstInsertionPt();
|
2014-02-07 05:44:56 +08:00
|
|
|
if (Inst->getParent())
|
|
|
|
Inst->moveBefore(Position);
|
|
|
|
else
|
|
|
|
Inst->insertBefore(Position);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Move an instruction before another.
|
|
|
|
class InstructionMoveBefore : public TypePromotionAction {
|
|
|
|
/// Original position of the instruction.
|
|
|
|
InsertionHandler Position;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Move \p Inst before \p Before.
|
|
|
|
InstructionMoveBefore(Instruction *Inst, Instruction *Before)
|
|
|
|
: TypePromotionAction(Inst), Position(Inst) {
|
|
|
|
DEBUG(dbgs() << "Do: move: " << *Inst << "\nbefore: " << *Before << "\n");
|
|
|
|
Inst->moveBefore(Before);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Move the instruction back to its original position.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "Undo: moveBefore: " << *Inst << "\n");
|
|
|
|
Position.insert(Inst);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Set the operand of an instruction with a new value.
|
|
|
|
class OperandSetter : public TypePromotionAction {
|
|
|
|
/// Original operand of the instruction.
|
|
|
|
Value *Origin;
|
|
|
|
/// Index of the modified instruction.
|
|
|
|
unsigned Idx;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Set \p Idx operand of \p Inst with \p NewVal.
|
|
|
|
OperandSetter(Instruction *Inst, unsigned Idx, Value *NewVal)
|
|
|
|
: TypePromotionAction(Inst), Idx(Idx) {
|
|
|
|
DEBUG(dbgs() << "Do: setOperand: " << Idx << "\n"
|
|
|
|
<< "for:" << *Inst << "\n"
|
|
|
|
<< "with:" << *NewVal << "\n");
|
|
|
|
Origin = Inst->getOperand(Idx);
|
|
|
|
Inst->setOperand(Idx, NewVal);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Restore the original value of the instruction.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "Undo: setOperand:" << Idx << "\n"
|
|
|
|
<< "for: " << *Inst << "\n"
|
|
|
|
<< "with: " << *Origin << "\n");
|
|
|
|
Inst->setOperand(Idx, Origin);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Hide the operands of an instruction.
|
|
|
|
/// Do as if this instruction was not using any of its operands.
|
|
|
|
class OperandsHider : public TypePromotionAction {
|
|
|
|
/// The list of original operands.
|
|
|
|
SmallVector<Value *, 4> OriginalValues;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Remove \p Inst from the uses of the operands of \p Inst.
|
|
|
|
OperandsHider(Instruction *Inst) : TypePromotionAction(Inst) {
|
|
|
|
DEBUG(dbgs() << "Do: OperandsHider: " << *Inst << "\n");
|
|
|
|
unsigned NumOpnds = Inst->getNumOperands();
|
|
|
|
OriginalValues.reserve(NumOpnds);
|
|
|
|
for (unsigned It = 0; It < NumOpnds; ++It) {
|
|
|
|
// Save the current operand.
|
|
|
|
Value *Val = Inst->getOperand(It);
|
|
|
|
OriginalValues.push_back(Val);
|
|
|
|
// Set a dummy one.
|
2015-10-10 02:01:03 +08:00
|
|
|
// We could use OperandSetter here, but that would imply an overhead
|
2014-02-07 05:44:56 +08:00
|
|
|
// that we are not willing to pay.
|
|
|
|
Inst->setOperand(It, UndefValue::get(Val->getType()));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Restore the original list of uses.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "Undo: OperandsHider: " << *Inst << "\n");
|
|
|
|
for (unsigned It = 0, EndIt = OriginalValues.size(); It != EndIt; ++It)
|
|
|
|
Inst->setOperand(It, OriginalValues[It]);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Build a truncate instruction.
|
|
|
|
class TruncBuilder : public TypePromotionAction {
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *Val;
|
2014-02-07 05:44:56 +08:00
|
|
|
public:
|
|
|
|
/// \brief Build a truncate instruction of \p Opnd producing a \p Ty
|
|
|
|
/// result.
|
|
|
|
/// trunc Opnd to Ty.
|
|
|
|
TruncBuilder(Instruction *Opnd, Type *Ty) : TypePromotionAction(Opnd) {
|
|
|
|
IRBuilder<> Builder(Opnd);
|
2014-09-17 06:36:07 +08:00
|
|
|
Val = Builder.CreateTrunc(Opnd, Ty, "promoted");
|
|
|
|
DEBUG(dbgs() << "Do: TruncBuilder: " << *Val << "\n");
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-17 06:36:07 +08:00
|
|
|
/// \brief Get the built value.
|
|
|
|
Value *getBuiltValue() { return Val; }
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
/// \brief Remove the built instruction.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-09-17 06:36:07 +08:00
|
|
|
DEBUG(dbgs() << "Undo: TruncBuilder: " << *Val << "\n");
|
|
|
|
if (Instruction *IVal = dyn_cast<Instruction>(Val))
|
|
|
|
IVal->eraseFromParent();
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Build a sign extension instruction.
|
|
|
|
class SExtBuilder : public TypePromotionAction {
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *Val;
|
2014-02-07 05:44:56 +08:00
|
|
|
public:
|
|
|
|
/// \brief Build a sign extension instruction of \p Opnd producing a \p Ty
|
|
|
|
/// result.
|
|
|
|
/// sext Opnd to Ty.
|
|
|
|
SExtBuilder(Instruction *InsertPt, Value *Opnd, Type *Ty)
|
2014-09-17 06:36:07 +08:00
|
|
|
: TypePromotionAction(InsertPt) {
|
2014-02-07 05:44:56 +08:00
|
|
|
IRBuilder<> Builder(InsertPt);
|
2014-09-17 06:36:07 +08:00
|
|
|
Val = Builder.CreateSExt(Opnd, Ty, "promoted");
|
|
|
|
DEBUG(dbgs() << "Do: SExtBuilder: " << *Val << "\n");
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-17 06:36:07 +08:00
|
|
|
/// \brief Get the built value.
|
|
|
|
Value *getBuiltValue() { return Val; }
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
/// \brief Remove the built instruction.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-09-17 06:36:07 +08:00
|
|
|
DEBUG(dbgs() << "Undo: SExtBuilder: " << *Val << "\n");
|
|
|
|
if (Instruction *IVal = dyn_cast<Instruction>(Val))
|
|
|
|
IVal->eraseFromParent();
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
};
|
|
|
|
|
2014-09-12 05:22:14 +08:00
|
|
|
/// \brief Build a zero extension instruction.
|
|
|
|
class ZExtBuilder : public TypePromotionAction {
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *Val;
|
2014-09-12 05:22:14 +08:00
|
|
|
public:
|
|
|
|
/// \brief Build a zero extension instruction of \p Opnd producing a \p Ty
|
|
|
|
/// result.
|
|
|
|
/// zext Opnd to Ty.
|
|
|
|
ZExtBuilder(Instruction *InsertPt, Value *Opnd, Type *Ty)
|
2014-09-17 06:36:07 +08:00
|
|
|
: TypePromotionAction(InsertPt) {
|
2014-09-12 05:22:14 +08:00
|
|
|
IRBuilder<> Builder(InsertPt);
|
2014-09-17 06:36:07 +08:00
|
|
|
Val = Builder.CreateZExt(Opnd, Ty, "promoted");
|
|
|
|
DEBUG(dbgs() << "Do: ZExtBuilder: " << *Val << "\n");
|
2014-09-12 05:22:14 +08:00
|
|
|
}
|
|
|
|
|
2014-09-17 06:36:07 +08:00
|
|
|
/// \brief Get the built value.
|
|
|
|
Value *getBuiltValue() { return Val; }
|
2014-09-12 05:22:14 +08:00
|
|
|
|
|
|
|
/// \brief Remove the built instruction.
|
|
|
|
void undo() override {
|
2014-09-17 06:36:07 +08:00
|
|
|
DEBUG(dbgs() << "Undo: ZExtBuilder: " << *Val << "\n");
|
|
|
|
if (Instruction *IVal = dyn_cast<Instruction>(Val))
|
|
|
|
IVal->eraseFromParent();
|
2014-09-12 05:22:14 +08:00
|
|
|
}
|
|
|
|
};
|
|
|
|
|
2014-02-07 05:44:56 +08:00
|
|
|
/// \brief Mutate an instruction to another type.
|
|
|
|
class TypeMutator : public TypePromotionAction {
|
|
|
|
/// Record the original type.
|
|
|
|
Type *OrigTy;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Mutate the type of \p Inst into \p NewTy.
|
|
|
|
TypeMutator(Instruction *Inst, Type *NewTy)
|
|
|
|
: TypePromotionAction(Inst), OrigTy(Inst->getType()) {
|
|
|
|
DEBUG(dbgs() << "Do: MutateType: " << *Inst << " with " << *NewTy
|
|
|
|
<< "\n");
|
|
|
|
Inst->mutateType(NewTy);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Mutate the instruction back to its original type.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "Undo: MutateType: " << *Inst << " with " << *OrigTy
|
|
|
|
<< "\n");
|
|
|
|
Inst->mutateType(OrigTy);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Replace the uses of an instruction by another instruction.
|
|
|
|
class UsesReplacer : public TypePromotionAction {
|
|
|
|
/// Helper structure to keep track of the replaced uses.
|
|
|
|
struct InstructionAndIdx {
|
|
|
|
/// The instruction using the instruction.
|
|
|
|
Instruction *Inst;
|
|
|
|
/// The index where this instruction is used for Inst.
|
|
|
|
unsigned Idx;
|
|
|
|
InstructionAndIdx(Instruction *Inst, unsigned Idx)
|
|
|
|
: Inst(Inst), Idx(Idx) {}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// Keep track of the original uses (pair Instruction, Index).
|
|
|
|
SmallVector<InstructionAndIdx, 4> OriginalUses;
|
|
|
|
typedef SmallVectorImpl<InstructionAndIdx>::iterator use_iterator;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Replace all the use of \p Inst by \p New.
|
|
|
|
UsesReplacer(Instruction *Inst, Value *New) : TypePromotionAction(Inst) {
|
|
|
|
DEBUG(dbgs() << "Do: UsersReplacer: " << *Inst << " with " << *New
|
|
|
|
<< "\n");
|
|
|
|
// Record the original uses.
|
2014-03-09 11:16:01 +08:00
|
|
|
for (Use &U : Inst->uses()) {
|
|
|
|
Instruction *UserI = cast<Instruction>(U.getUser());
|
|
|
|
OriginalUses.push_back(InstructionAndIdx(UserI, U.getOperandNo()));
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
// Now, we can replace the uses.
|
|
|
|
Inst->replaceAllUsesWith(New);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Reassign the original uses of Inst to Inst.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "Undo: UsersReplacer: " << *Inst << "\n");
|
|
|
|
for (use_iterator UseIt = OriginalUses.begin(),
|
|
|
|
EndIt = OriginalUses.end();
|
|
|
|
UseIt != EndIt; ++UseIt) {
|
|
|
|
UseIt->Inst->setOperand(UseIt->Idx, Inst);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
/// \brief Remove an instruction from the IR.
|
|
|
|
class InstructionRemover : public TypePromotionAction {
|
|
|
|
/// Original position of the instruction.
|
|
|
|
InsertionHandler Inserter;
|
|
|
|
/// Helper structure to hide all the link to the instruction. In other
|
|
|
|
/// words, this helps to do as if the instruction was removed.
|
|
|
|
OperandsHider Hider;
|
|
|
|
/// Keep track of the uses replaced, if any.
|
|
|
|
UsesReplacer *Replacer;
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// \brief Remove all reference of \p Inst and optinally replace all its
|
|
|
|
/// uses with New.
|
2014-04-14 08:51:57 +08:00
|
|
|
/// \pre If !Inst->use_empty(), then New != nullptr
|
|
|
|
InstructionRemover(Instruction *Inst, Value *New = nullptr)
|
2014-02-07 05:44:56 +08:00
|
|
|
: TypePromotionAction(Inst), Inserter(Inst), Hider(Inst),
|
2014-04-14 08:51:57 +08:00
|
|
|
Replacer(nullptr) {
|
2014-02-07 05:44:56 +08:00
|
|
|
if (New)
|
|
|
|
Replacer = new UsesReplacer(Inst, New);
|
|
|
|
DEBUG(dbgs() << "Do: InstructionRemover: " << *Inst << "\n");
|
|
|
|
Inst->removeFromParent();
|
|
|
|
}
|
|
|
|
|
2015-04-11 10:11:45 +08:00
|
|
|
~InstructionRemover() override { delete Replacer; }
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
/// \brief Really remove the instruction.
|
2014-03-07 17:26:03 +08:00
|
|
|
void commit() override { delete Inst; }
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
/// \brief Resurrect the instruction and reassign it to the proper uses if
|
|
|
|
/// new value was provided when build this action.
|
2014-03-07 17:26:03 +08:00
|
|
|
void undo() override {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "Undo: InstructionRemover: " << *Inst << "\n");
|
|
|
|
Inserter.insert(Inst);
|
|
|
|
if (Replacer)
|
|
|
|
Replacer->undo();
|
|
|
|
Hider.undo();
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
public:
|
|
|
|
/// Restoration point.
|
|
|
|
/// The restoration point is a pointer to an action instead of an iterator
|
|
|
|
/// because the iterator may be invalidated but not the pointer.
|
|
|
|
typedef const TypePromotionAction *ConstRestorationPt;
|
|
|
|
/// Advocate every changes made in that transaction.
|
|
|
|
void commit();
|
|
|
|
/// Undo all the changes made after the given point.
|
|
|
|
void rollback(ConstRestorationPt Point);
|
|
|
|
/// Get the current restoration point.
|
|
|
|
ConstRestorationPt getRestorationPoint() const;
|
|
|
|
|
|
|
|
/// \name API for IR modification with state keeping to support rollback.
|
|
|
|
/// @{
|
|
|
|
/// Same as Instruction::setOperand.
|
|
|
|
void setOperand(Instruction *Inst, unsigned Idx, Value *NewVal);
|
|
|
|
/// Same as Instruction::eraseFromParent.
|
2014-04-14 08:51:57 +08:00
|
|
|
void eraseInstruction(Instruction *Inst, Value *NewVal = nullptr);
|
2014-02-07 05:44:56 +08:00
|
|
|
/// Same as Value::replaceAllUsesWith.
|
|
|
|
void replaceAllUsesWith(Instruction *Inst, Value *New);
|
|
|
|
/// Same as Value::mutateType.
|
|
|
|
void mutateType(Instruction *Inst, Type *NewTy);
|
|
|
|
/// Same as IRBuilder::createTrunc.
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *createTrunc(Instruction *Opnd, Type *Ty);
|
2014-02-07 05:44:56 +08:00
|
|
|
/// Same as IRBuilder::createSExt.
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *createSExt(Instruction *Inst, Value *Opnd, Type *Ty);
|
2014-09-12 05:22:14 +08:00
|
|
|
/// Same as IRBuilder::createZExt.
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *createZExt(Instruction *Inst, Value *Opnd, Type *Ty);
|
2014-02-07 05:44:56 +08:00
|
|
|
/// Same as Instruction::moveBefore.
|
|
|
|
void moveBefore(Instruction *Inst, Instruction *Before);
|
|
|
|
/// @}
|
|
|
|
|
|
|
|
private:
|
|
|
|
/// The ordered list of actions made so far.
|
2014-04-15 14:17:44 +08:00
|
|
|
SmallVector<std::unique_ptr<TypePromotionAction>, 16> Actions;
|
|
|
|
typedef SmallVectorImpl<std::unique_ptr<TypePromotionAction>>::iterator CommitPt;
|
2014-02-07 05:44:56 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
void TypePromotionTransaction::setOperand(Instruction *Inst, unsigned Idx,
|
|
|
|
Value *NewVal) {
|
|
|
|
Actions.push_back(
|
2014-04-15 14:17:44 +08:00
|
|
|
make_unique<TypePromotionTransaction::OperandSetter>(Inst, Idx, NewVal));
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void TypePromotionTransaction::eraseInstruction(Instruction *Inst,
|
|
|
|
Value *NewVal) {
|
|
|
|
Actions.push_back(
|
2014-04-15 14:17:44 +08:00
|
|
|
make_unique<TypePromotionTransaction::InstructionRemover>(Inst, NewVal));
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void TypePromotionTransaction::replaceAllUsesWith(Instruction *Inst,
|
|
|
|
Value *New) {
|
2014-04-15 14:17:44 +08:00
|
|
|
Actions.push_back(make_unique<TypePromotionTransaction::UsesReplacer>(Inst, New));
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void TypePromotionTransaction::mutateType(Instruction *Inst, Type *NewTy) {
|
2014-04-15 14:17:44 +08:00
|
|
|
Actions.push_back(make_unique<TypePromotionTransaction::TypeMutator>(Inst, NewTy));
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *TypePromotionTransaction::createTrunc(Instruction *Opnd,
|
|
|
|
Type *Ty) {
|
2014-04-15 14:17:44 +08:00
|
|
|
std::unique_ptr<TruncBuilder> Ptr(new TruncBuilder(Opnd, Ty));
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *Val = Ptr->getBuiltValue();
|
2014-04-15 14:17:44 +08:00
|
|
|
Actions.push_back(std::move(Ptr));
|
2014-09-17 06:36:07 +08:00
|
|
|
return Val;
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *TypePromotionTransaction::createSExt(Instruction *Inst,
|
|
|
|
Value *Opnd, Type *Ty) {
|
2014-04-15 14:17:44 +08:00
|
|
|
std::unique_ptr<SExtBuilder> Ptr(new SExtBuilder(Inst, Opnd, Ty));
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *Val = Ptr->getBuiltValue();
|
2014-04-15 14:17:44 +08:00
|
|
|
Actions.push_back(std::move(Ptr));
|
2014-09-17 06:36:07 +08:00
|
|
|
return Val;
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *TypePromotionTransaction::createZExt(Instruction *Inst,
|
|
|
|
Value *Opnd, Type *Ty) {
|
2014-09-12 05:22:14 +08:00
|
|
|
std::unique_ptr<ZExtBuilder> Ptr(new ZExtBuilder(Inst, Opnd, Ty));
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *Val = Ptr->getBuiltValue();
|
2014-09-12 05:22:14 +08:00
|
|
|
Actions.push_back(std::move(Ptr));
|
2014-09-17 06:36:07 +08:00
|
|
|
return Val;
|
2014-09-12 05:22:14 +08:00
|
|
|
}
|
|
|
|
|
2014-02-07 05:44:56 +08:00
|
|
|
void TypePromotionTransaction::moveBefore(Instruction *Inst,
|
|
|
|
Instruction *Before) {
|
|
|
|
Actions.push_back(
|
2014-04-15 14:17:44 +08:00
|
|
|
make_unique<TypePromotionTransaction::InstructionMoveBefore>(Inst, Before));
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
TypePromotionTransaction::ConstRestorationPt
|
|
|
|
TypePromotionTransaction::getRestorationPoint() const {
|
2014-04-15 14:17:44 +08:00
|
|
|
return !Actions.empty() ? Actions.back().get() : nullptr;
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void TypePromotionTransaction::commit() {
|
|
|
|
for (CommitPt It = Actions.begin(), EndIt = Actions.end(); It != EndIt;
|
2014-04-15 14:17:44 +08:00
|
|
|
++It)
|
2014-02-07 05:44:56 +08:00
|
|
|
(*It)->commit();
|
|
|
|
Actions.clear();
|
|
|
|
}
|
|
|
|
|
|
|
|
void TypePromotionTransaction::rollback(
|
|
|
|
TypePromotionTransaction::ConstRestorationPt Point) {
|
2014-04-15 14:17:44 +08:00
|
|
|
while (!Actions.empty() && Point != Actions.back().get()) {
|
|
|
|
std::unique_ptr<TypePromotionAction> Curr = Actions.pop_back_val();
|
2014-02-07 05:44:56 +08:00
|
|
|
Curr->undo();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
/// \brief A helper class for matching addressing modes.
|
|
|
|
///
|
|
|
|
/// This encapsulates the logic for matching the target-legal addressing modes.
|
|
|
|
class AddressingModeMatcher {
|
|
|
|
SmallVectorImpl<Instruction*> &AddrModeInsts;
|
2015-02-27 06:38:34 +08:00
|
|
|
const TargetMachine &TM;
|
2013-01-05 10:09:22 +08:00
|
|
|
const TargetLowering &TLI;
|
2015-07-08 02:45:17 +08:00
|
|
|
const DataLayout &DL;
|
2013-01-05 10:09:22 +08:00
|
|
|
|
|
|
|
/// AccessTy/MemoryInst - This is the type for the access (e.g. double) and
|
|
|
|
/// the memory instruction that we're computing this address for.
|
|
|
|
Type *AccessTy;
|
2015-06-05 00:17:38 +08:00
|
|
|
unsigned AddrSpace;
|
2013-01-05 10:09:22 +08:00
|
|
|
Instruction *MemoryInst;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// This is the addressing mode that we're building up. This is
|
2013-01-05 10:09:22 +08:00
|
|
|
/// part of the return value of this addressing mode matching stuff.
|
|
|
|
ExtAddrMode &AddrMode;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2015-06-18 04:44:32 +08:00
|
|
|
/// The instructions inserted by other CodeGenPrepare optimizations.
|
|
|
|
const SetOfInstrs &InsertedInsts;
|
2014-02-07 05:44:56 +08:00
|
|
|
/// A map from the instructions to their type before promotion.
|
|
|
|
InstrToOrigTy &PromotedInsts;
|
|
|
|
/// The ongoing transaction where every action should be registered.
|
|
|
|
TypePromotionTransaction &TPT;
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// This is set to true when we should not do profitability checks.
|
|
|
|
/// When true, IsProfitableToFoldIntoAddressingMode always returns true.
|
2013-01-05 10:09:22 +08:00
|
|
|
bool IgnoreProfitability;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2015-02-27 06:38:34 +08:00
|
|
|
AddressingModeMatcher(SmallVectorImpl<Instruction *> &AMI,
|
2015-06-05 00:17:38 +08:00
|
|
|
const TargetMachine &TM, Type *AT, unsigned AS,
|
|
|
|
Instruction *MI, ExtAddrMode &AM,
|
2015-06-18 04:44:32 +08:00
|
|
|
const SetOfInstrs &InsertedInsts,
|
2014-02-07 05:44:56 +08:00
|
|
|
InstrToOrigTy &PromotedInsts,
|
|
|
|
TypePromotionTransaction &TPT)
|
2015-02-27 06:38:34 +08:00
|
|
|
: AddrModeInsts(AMI), TM(TM),
|
|
|
|
TLI(*TM.getSubtargetImpl(*MI->getParent()->getParent())
|
|
|
|
->getTargetLowering()),
|
2015-07-08 02:45:17 +08:00
|
|
|
DL(MI->getModule()->getDataLayout()), AccessTy(AT), AddrSpace(AS),
|
|
|
|
MemoryInst(MI), AddrMode(AM), InsertedInsts(InsertedInsts),
|
|
|
|
PromotedInsts(PromotedInsts), TPT(TPT) {
|
2013-01-05 10:09:22 +08:00
|
|
|
IgnoreProfitability = false;
|
|
|
|
}
|
|
|
|
public:
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Find the maximal addressing mode that a load/store of V can fold,
|
2013-01-05 10:09:22 +08:00
|
|
|
/// give an access type of AccessTy. This returns a list of involved
|
|
|
|
/// instructions in AddrModeInsts.
|
2015-06-18 04:44:32 +08:00
|
|
|
/// \p InsertedInsts The instructions inserted by other CodeGenPrepare
|
2014-02-07 05:44:56 +08:00
|
|
|
/// optimizations.
|
|
|
|
/// \p PromotedInsts maps the instructions to their type before promotion.
|
|
|
|
/// \p The ongoing transaction where every action should be registered.
|
2015-06-05 00:17:38 +08:00
|
|
|
static ExtAddrMode Match(Value *V, Type *AccessTy, unsigned AS,
|
2013-01-05 10:09:22 +08:00
|
|
|
Instruction *MemoryInst,
|
|
|
|
SmallVectorImpl<Instruction*> &AddrModeInsts,
|
2015-02-27 06:38:34 +08:00
|
|
|
const TargetMachine &TM,
|
2015-06-18 04:44:32 +08:00
|
|
|
const SetOfInstrs &InsertedInsts,
|
2014-02-07 05:44:56 +08:00
|
|
|
InstrToOrigTy &PromotedInsts,
|
|
|
|
TypePromotionTransaction &TPT) {
|
2013-01-05 10:09:22 +08:00
|
|
|
ExtAddrMode Result;
|
|
|
|
|
2015-06-05 00:17:38 +08:00
|
|
|
bool Success = AddressingModeMatcher(AddrModeInsts, TM, AccessTy, AS,
|
2015-06-18 04:44:32 +08:00
|
|
|
MemoryInst, Result, InsertedInsts,
|
2015-09-22 07:03:16 +08:00
|
|
|
PromotedInsts, TPT).matchAddr(V, 0);
|
2013-01-05 10:09:22 +08:00
|
|
|
(void)Success; assert(Success && "Couldn't select *anything*?");
|
|
|
|
return Result;
|
|
|
|
}
|
|
|
|
private:
|
2015-09-22 07:03:16 +08:00
|
|
|
bool matchScaledValue(Value *ScaleReg, int64_t Scale, unsigned Depth);
|
|
|
|
bool matchAddr(Value *V, unsigned Depth);
|
|
|
|
bool matchOperationAddr(User *Operation, unsigned Opcode, unsigned Depth,
|
2014-04-14 08:51:57 +08:00
|
|
|
bool *MovedAway = nullptr);
|
2015-09-22 07:03:16 +08:00
|
|
|
bool isProfitableToFoldIntoAddressingMode(Instruction *I,
|
2013-01-05 10:09:22 +08:00
|
|
|
ExtAddrMode &AMBefore,
|
|
|
|
ExtAddrMode &AMAfter);
|
2015-09-22 07:03:16 +08:00
|
|
|
bool valueAlreadyLiveAtInst(Value *Val, Value *KnownLive1, Value *KnownLive2);
|
|
|
|
bool isPromotionProfitable(unsigned NewCost, unsigned OldCost,
|
2014-02-15 06:23:22 +08:00
|
|
|
Value *PromotedOperand) const;
|
2013-01-05 10:09:22 +08:00
|
|
|
};
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Try adding ScaleReg*Scale to the current addressing mode.
|
2013-01-05 10:09:22 +08:00
|
|
|
/// Return true and update AddrMode if this addr mode is legal for the target,
|
|
|
|
/// false if not.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool AddressingModeMatcher::matchScaledValue(Value *ScaleReg, int64_t Scale,
|
2013-01-05 10:09:22 +08:00
|
|
|
unsigned Depth) {
|
|
|
|
// If Scale is 1, then this is the same as adding ScaleReg to the addressing
|
|
|
|
// mode. Just process that directly.
|
|
|
|
if (Scale == 1)
|
2015-09-22 07:03:16 +08:00
|
|
|
return matchAddr(ScaleReg, Depth);
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If the scale is 0, it takes nothing to add this.
|
|
|
|
if (Scale == 0)
|
|
|
|
return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If we already have a scale of this value, we can add to it, otherwise, we
|
|
|
|
// need an available scale field.
|
|
|
|
if (AddrMode.Scale != 0 && AddrMode.ScaledReg != ScaleReg)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
ExtAddrMode TestAddrMode = AddrMode;
|
|
|
|
|
|
|
|
// Add scale to turn X*4+X*3 -> X*7. This could also do things like
|
|
|
|
// [A+B + A*7] -> [B+A*8].
|
|
|
|
TestAddrMode.Scale += Scale;
|
|
|
|
TestAddrMode.ScaledReg = ScaleReg;
|
|
|
|
|
|
|
|
// If the new address isn't legal, bail out.
|
2015-07-09 10:09:40 +08:00
|
|
|
if (!TLI.isLegalAddressingMode(DL, TestAddrMode, AccessTy, AddrSpace))
|
2013-01-05 10:09:22 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
// It was legal, so commit it.
|
|
|
|
AddrMode = TestAddrMode;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Okay, we decided that we can add ScaleReg+Scale to AddrMode. Check now
|
|
|
|
// to see if ScaleReg is actually X+C. If so, we can turn this into adding
|
|
|
|
// X*Scale + C*Scale to addr mode.
|
2014-04-14 08:51:57 +08:00
|
|
|
ConstantInt *CI = nullptr; Value *AddLHS = nullptr;
|
2013-01-05 10:09:22 +08:00
|
|
|
if (isa<Instruction>(ScaleReg) && // not a constant expr.
|
|
|
|
match(ScaleReg, m_Add(m_Value(AddLHS), m_ConstantInt(CI)))) {
|
|
|
|
TestAddrMode.ScaledReg = AddLHS;
|
|
|
|
TestAddrMode.BaseOffs += CI->getSExtValue()*TestAddrMode.Scale;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If this addressing mode is legal, commit it and remember that we folded
|
|
|
|
// this instruction.
|
2015-07-09 10:09:40 +08:00
|
|
|
if (TLI.isLegalAddressingMode(DL, TestAddrMode, AccessTy, AddrSpace)) {
|
2013-01-05 10:09:22 +08:00
|
|
|
AddrModeInsts.push_back(cast<Instruction>(ScaleReg));
|
|
|
|
AddrMode = TestAddrMode;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Otherwise, not (x+c)*scale, just return what we have.
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// This is a little filter, which returns true if an addressing computation
|
|
|
|
/// involving I might be folded into a load/store accessing it.
|
|
|
|
/// This doesn't need to be perfect, but needs to accept at least
|
2013-01-05 10:09:22 +08:00
|
|
|
/// the set of instructions that MatchOperationAddr can.
|
|
|
|
static bool MightBeFoldableInst(Instruction *I) {
|
|
|
|
switch (I->getOpcode()) {
|
|
|
|
case Instruction::BitCast:
|
2014-05-22 08:02:52 +08:00
|
|
|
case Instruction::AddrSpaceCast:
|
2013-01-05 10:09:22 +08:00
|
|
|
// Don't touch identity bitcasts.
|
|
|
|
if (I->getType() == I->getOperand(0)->getType())
|
|
|
|
return false;
|
|
|
|
return I->getType()->isPointerTy() || I->getType()->isIntegerTy();
|
|
|
|
case Instruction::PtrToInt:
|
|
|
|
// PtrToInt is always a noop, as we know that the int type is pointer sized.
|
|
|
|
return true;
|
|
|
|
case Instruction::IntToPtr:
|
|
|
|
// We know the input is intptr_t, so this is foldable.
|
|
|
|
return true;
|
|
|
|
case Instruction::Add:
|
|
|
|
return true;
|
|
|
|
case Instruction::Mul:
|
|
|
|
case Instruction::Shl:
|
|
|
|
// Can only handle X*C and X << C.
|
|
|
|
return isa<ConstantInt>(I->getOperand(1));
|
|
|
|
case Instruction::GetElementPtr:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
/// \brief Check whether or not \p Val is a legal instruction for \p TLI.
|
|
|
|
/// \note \p Val is assumed to be the product of some type promotion.
|
|
|
|
/// Therefore if \p Val has an undefined state in \p TLI, this is assumed
|
|
|
|
/// to be legal, as the non-promoted value would have had the same state.
|
2015-07-09 10:09:04 +08:00
|
|
|
static bool isPromotedInstructionLegal(const TargetLowering &TLI,
|
|
|
|
const DataLayout &DL, Value *Val) {
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
Instruction *PromotedInst = dyn_cast<Instruction>(Val);
|
|
|
|
if (!PromotedInst)
|
|
|
|
return false;
|
|
|
|
int ISDOpcode = TLI.InstructionOpcodeToISD(PromotedInst->getOpcode());
|
|
|
|
// If the ISDOpcode is undefined, it was undefined before the promotion.
|
|
|
|
if (!ISDOpcode)
|
|
|
|
return true;
|
|
|
|
// Otherwise, check if the promoted instruction is legal or not.
|
|
|
|
return TLI.isOperationLegalOrCustom(
|
2015-07-09 10:09:04 +08:00
|
|
|
ISDOpcode, TLI.getValueType(DL, PromotedInst->getType()));
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
}
|
|
|
|
|
2014-02-07 05:44:56 +08:00
|
|
|
/// \brief Hepler class to perform type promotion.
|
|
|
|
class TypePromotionHelper {
|
2014-11-13 09:44:51 +08:00
|
|
|
/// \brief Utility function to check whether or not a sign or zero extension
|
|
|
|
/// of \p Inst with \p ConsideredExtType can be moved through \p Inst by
|
|
|
|
/// either using the operands of \p Inst or promoting \p Inst.
|
|
|
|
/// The type of the extension is defined by \p IsSExt.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// In other words, check if:
|
2014-11-13 09:44:51 +08:00
|
|
|
/// ext (Ty Inst opnd1 opnd2 ... opndN) to ConsideredExtType.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// #1 Promotion applies:
|
2014-11-13 09:44:51 +08:00
|
|
|
/// ConsideredExtType Inst (ext opnd1 to ConsideredExtType, ...).
|
2014-02-07 05:44:56 +08:00
|
|
|
/// #2 Operand reuses:
|
2014-11-13 09:44:51 +08:00
|
|
|
/// ext opnd1 to ConsideredExtType.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// \p PromotedInsts maps the instructions to their type before promotion.
|
2014-11-13 09:44:51 +08:00
|
|
|
static bool canGetThrough(const Instruction *Inst, Type *ConsideredExtType,
|
|
|
|
const InstrToOrigTy &PromotedInsts, bool IsSExt);
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
/// \brief Utility function to determine if \p OpIdx should be promoted when
|
|
|
|
/// promoting \p Inst.
|
2014-11-13 09:44:51 +08:00
|
|
|
static bool shouldExtOperand(const Instruction *Inst, int OpIdx) {
|
2015-10-25 07:11:13 +08:00
|
|
|
return !(isa<SelectInst>(Inst) && OpIdx == 0);
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
/// \brief Utility function to promote the operand of \p Ext when this
|
2014-09-12 05:22:14 +08:00
|
|
|
/// operand is a promotable trunc or sext or zext.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// \p PromotedInsts maps the instructions to their type before promotion.
|
2015-03-11 05:48:15 +08:00
|
|
|
/// \p CreatedInstsCost[out] contains the cost of all instructions
|
2014-11-13 09:44:51 +08:00
|
|
|
/// created to promote the operand of Ext.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
/// Newly added extensions are inserted in \p Exts.
|
|
|
|
/// Newly added truncates are inserted in \p Truncs.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// Should never be called directly.
|
2014-11-13 09:44:51 +08:00
|
|
|
/// \return The promoted value which is used instead of Ext.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
static Value *promoteOperandForTruncAndAnyExt(
|
|
|
|
Instruction *Ext, TypePromotionTransaction &TPT,
|
2015-03-11 05:48:15 +08:00
|
|
|
InstrToOrigTy &PromotedInsts, unsigned &CreatedInstsCost,
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
2015-03-11 05:48:15 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Truncs, const TargetLowering &TLI);
|
2014-02-07 05:44:56 +08:00
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
/// \brief Utility function to promote the operand of \p Ext when this
|
2014-02-07 05:44:56 +08:00
|
|
|
/// operand is promotable and is not a supported trunc or sext.
|
|
|
|
/// \p PromotedInsts maps the instructions to their type before promotion.
|
2015-03-11 05:48:15 +08:00
|
|
|
/// \p CreatedInstsCost[out] contains the cost of all the instructions
|
2014-11-13 09:44:51 +08:00
|
|
|
/// created to promote the operand of Ext.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
/// Newly added extensions are inserted in \p Exts.
|
|
|
|
/// Newly added truncates are inserted in \p Truncs.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// Should never be called directly.
|
2014-11-13 09:44:51 +08:00
|
|
|
/// \return The promoted value which is used instead of Ext.
|
2015-03-11 05:48:15 +08:00
|
|
|
static Value *promoteOperandForOther(Instruction *Ext,
|
|
|
|
TypePromotionTransaction &TPT,
|
|
|
|
InstrToOrigTy &PromotedInsts,
|
|
|
|
unsigned &CreatedInstsCost,
|
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
|
|
|
SmallVectorImpl<Instruction *> *Truncs,
|
|
|
|
const TargetLowering &TLI, bool IsSExt);
|
2014-11-13 09:44:51 +08:00
|
|
|
|
|
|
|
/// \see promoteOperandForOther.
|
2015-03-11 05:48:15 +08:00
|
|
|
static Value *signExtendOperandForOther(
|
|
|
|
Instruction *Ext, TypePromotionTransaction &TPT,
|
|
|
|
InstrToOrigTy &PromotedInsts, unsigned &CreatedInstsCost,
|
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
|
|
|
SmallVectorImpl<Instruction *> *Truncs, const TargetLowering &TLI) {
|
|
|
|
return promoteOperandForOther(Ext, TPT, PromotedInsts, CreatedInstsCost,
|
|
|
|
Exts, Truncs, TLI, true);
|
2014-11-13 09:44:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/// \see promoteOperandForOther.
|
2015-03-11 05:48:15 +08:00
|
|
|
static Value *zeroExtendOperandForOther(
|
|
|
|
Instruction *Ext, TypePromotionTransaction &TPT,
|
|
|
|
InstrToOrigTy &PromotedInsts, unsigned &CreatedInstsCost,
|
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
|
|
|
SmallVectorImpl<Instruction *> *Truncs, const TargetLowering &TLI) {
|
|
|
|
return promoteOperandForOther(Ext, TPT, PromotedInsts, CreatedInstsCost,
|
|
|
|
Exts, Truncs, TLI, false);
|
2014-11-13 09:44:51 +08:00
|
|
|
}
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
public:
|
2014-11-13 09:44:51 +08:00
|
|
|
/// Type for the utility function that promotes the operand of Ext.
|
|
|
|
typedef Value *(*Action)(Instruction *Ext, TypePromotionTransaction &TPT,
|
2015-03-11 05:48:15 +08:00
|
|
|
InstrToOrigTy &PromotedInsts,
|
|
|
|
unsigned &CreatedInstsCost,
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
2015-03-11 05:48:15 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Truncs,
|
|
|
|
const TargetLowering &TLI);
|
2014-11-13 09:44:51 +08:00
|
|
|
/// \brief Given a sign/zero extend instruction \p Ext, return the approriate
|
|
|
|
/// action to promote the operand of \p Ext instead of using Ext.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// \return NULL if no promotable action is possible with the current
|
|
|
|
/// sign extension.
|
2015-06-18 04:44:32 +08:00
|
|
|
/// \p InsertedInsts keeps track of all the instructions inserted by the
|
|
|
|
/// other CodeGenPrepare optimizations. This information is important
|
2014-02-07 05:44:56 +08:00
|
|
|
/// because we do not want to promote these instructions as CodeGenPrepare
|
|
|
|
/// will reinsert them later. Thus creating an infinite loop: create/remove.
|
|
|
|
/// \p PromotedInsts maps the instructions to their type before promotion.
|
2015-06-18 04:44:32 +08:00
|
|
|
static Action getAction(Instruction *Ext, const SetOfInstrs &InsertedInsts,
|
2014-02-07 05:44:56 +08:00
|
|
|
const TargetLowering &TLI,
|
|
|
|
const InstrToOrigTy &PromotedInsts);
|
|
|
|
};
|
|
|
|
|
|
|
|
bool TypePromotionHelper::canGetThrough(const Instruction *Inst,
|
2014-11-13 09:44:51 +08:00
|
|
|
Type *ConsideredExtType,
|
|
|
|
const InstrToOrigTy &PromotedInsts,
|
|
|
|
bool IsSExt) {
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// The promotion helper does not know how to deal with vector types yet.
|
|
|
|
// To be able to fix that, we would need to fix the places where we
|
|
|
|
// statically extend, e.g., constants and such.
|
|
|
|
if (Inst->getType()->isVectorTy())
|
|
|
|
return false;
|
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
// We can always get through zext.
|
|
|
|
if (isa<ZExtInst>(Inst))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
// sext(sext) is ok too.
|
|
|
|
if (IsSExt && isa<SExtInst>(Inst))
|
2014-02-07 05:44:56 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
// We can get through binary operator, if it is legal. In other words, the
|
|
|
|
// binary operator must have a nuw or nsw flag.
|
|
|
|
const BinaryOperator *BinOp = dyn_cast<BinaryOperator>(Inst);
|
|
|
|
if (BinOp && isa<OverflowingBinaryOperator>(BinOp) &&
|
2014-11-13 09:44:51 +08:00
|
|
|
((!IsSExt && BinOp->hasNoUnsignedWrap()) ||
|
|
|
|
(IsSExt && BinOp->hasNoSignedWrap())))
|
2014-02-07 05:44:56 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
// Check if we can do the following simplification.
|
2014-11-13 09:44:51 +08:00
|
|
|
// ext(trunc(opnd)) --> ext(opnd)
|
2014-02-07 05:44:56 +08:00
|
|
|
if (!isa<TruncInst>(Inst))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
Value *OpndVal = Inst->getOperand(0);
|
2014-11-13 09:44:51 +08:00
|
|
|
// Check if we can use this operand in the extension.
|
2015-10-10 02:01:03 +08:00
|
|
|
// If the type is larger than the result type of the extension, we cannot.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
if (!OpndVal->getType()->isIntegerTy() ||
|
|
|
|
OpndVal->getType()->getIntegerBitWidth() >
|
|
|
|
ConsideredExtType->getIntegerBitWidth())
|
2014-02-07 05:44:56 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
// If the operand of the truncate is not an instruction, we will not have
|
|
|
|
// any information on the dropped bits.
|
|
|
|
// (Actually we could for constant but it is not worth the extra logic).
|
|
|
|
Instruction *Opnd = dyn_cast<Instruction>(OpndVal);
|
|
|
|
if (!Opnd)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// Check if the source of the type is narrow enough.
|
2014-11-13 09:44:51 +08:00
|
|
|
// I.e., check that trunc just drops extended bits of the same kind of
|
|
|
|
// the extension.
|
|
|
|
// #1 get the type of the operand and check the kind of the extended bits.
|
2014-02-07 05:44:56 +08:00
|
|
|
const Type *OpndType;
|
|
|
|
InstrToOrigTy::const_iterator It = PromotedInsts.find(Opnd);
|
2015-08-01 01:00:39 +08:00
|
|
|
if (It != PromotedInsts.end() && It->second.getInt() == IsSExt)
|
|
|
|
OpndType = It->second.getPointer();
|
2014-11-13 09:44:51 +08:00
|
|
|
else if ((IsSExt && isa<SExtInst>(Opnd)) || (!IsSExt && isa<ZExtInst>(Opnd)))
|
|
|
|
OpndType = Opnd->getOperand(0)->getType();
|
2014-02-07 05:44:56 +08:00
|
|
|
else
|
|
|
|
return false;
|
|
|
|
|
2015-10-10 02:01:03 +08:00
|
|
|
// #2 check that the truncate just drops extended bits.
|
2015-10-25 07:11:13 +08:00
|
|
|
return Inst->getType()->getIntegerBitWidth() >=
|
|
|
|
OpndType->getIntegerBitWidth();
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
TypePromotionHelper::Action TypePromotionHelper::getAction(
|
2015-06-18 04:44:32 +08:00
|
|
|
Instruction *Ext, const SetOfInstrs &InsertedInsts,
|
2014-02-07 05:44:56 +08:00
|
|
|
const TargetLowering &TLI, const InstrToOrigTy &PromotedInsts) {
|
2014-11-13 09:44:51 +08:00
|
|
|
assert((isa<SExtInst>(Ext) || isa<ZExtInst>(Ext)) &&
|
|
|
|
"Unexpected instruction type");
|
|
|
|
Instruction *ExtOpnd = dyn_cast<Instruction>(Ext->getOperand(0));
|
|
|
|
Type *ExtTy = Ext->getType();
|
|
|
|
bool IsSExt = isa<SExtInst>(Ext);
|
|
|
|
// If the operand of the extension is not an instruction, we cannot
|
2014-02-07 05:44:56 +08:00
|
|
|
// get through.
|
|
|
|
// If it, check we can get through.
|
2014-11-13 09:44:51 +08:00
|
|
|
if (!ExtOpnd || !canGetThrough(ExtOpnd, ExtTy, PromotedInsts, IsSExt))
|
2014-04-14 08:51:57 +08:00
|
|
|
return nullptr;
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
// Do not promote if the operand has been added by codegenprepare.
|
|
|
|
// Otherwise, it means we are undoing an optimization that is likely to be
|
|
|
|
// redone, thus causing potential infinite loop.
|
2015-06-18 04:44:32 +08:00
|
|
|
if (isa<TruncInst>(ExtOpnd) && InsertedInsts.count(ExtOpnd))
|
2014-04-14 08:51:57 +08:00
|
|
|
return nullptr;
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
// SExt or Trunc instructions.
|
|
|
|
// Return the related handler.
|
2014-11-13 09:44:51 +08:00
|
|
|
if (isa<SExtInst>(ExtOpnd) || isa<TruncInst>(ExtOpnd) ||
|
|
|
|
isa<ZExtInst>(ExtOpnd))
|
2014-09-12 05:22:14 +08:00
|
|
|
return promoteOperandForTruncAndAnyExt;
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
// Regular instruction.
|
|
|
|
// Abort early if we will have to insert non-free instructions.
|
2014-11-13 09:44:51 +08:00
|
|
|
if (!ExtOpnd->hasOneUse() && !TLI.isTruncateFree(ExtTy, ExtOpnd->getType()))
|
2014-04-14 08:51:57 +08:00
|
|
|
return nullptr;
|
2014-11-13 09:44:51 +08:00
|
|
|
return IsSExt ? signExtendOperandForOther : zeroExtendOperandForOther;
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2014-09-12 05:22:14 +08:00
|
|
|
Value *TypePromotionHelper::promoteOperandForTruncAndAnyExt(
|
2014-02-07 05:44:56 +08:00
|
|
|
llvm::Instruction *SExt, TypePromotionTransaction &TPT,
|
2015-03-11 05:48:15 +08:00
|
|
|
InstrToOrigTy &PromotedInsts, unsigned &CreatedInstsCost,
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
2015-03-11 05:48:15 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Truncs, const TargetLowering &TLI) {
|
2014-02-07 05:44:56 +08:00
|
|
|
// By construction, the operand of SExt is an instruction. Otherwise we cannot
|
|
|
|
// get through it and this method should not be called.
|
|
|
|
Instruction *SExtOpnd = cast<Instruction>(SExt->getOperand(0));
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *ExtVal = SExt;
|
2015-03-11 05:48:15 +08:00
|
|
|
bool HasMergedNonFreeExt = false;
|
2014-09-12 05:22:14 +08:00
|
|
|
if (isa<ZExtInst>(SExtOpnd)) {
|
2014-11-13 09:44:51 +08:00
|
|
|
// Replace s|zext(zext(opnd))
|
2014-09-12 05:22:14 +08:00
|
|
|
// => zext(opnd).
|
2015-03-11 05:48:15 +08:00
|
|
|
HasMergedNonFreeExt = !TLI.isExtFree(SExtOpnd);
|
2014-09-17 06:36:07 +08:00
|
|
|
Value *ZExt =
|
2014-09-12 05:22:14 +08:00
|
|
|
TPT.createZExt(SExt, SExtOpnd->getOperand(0), SExt->getType());
|
|
|
|
TPT.replaceAllUsesWith(SExt, ZExt);
|
|
|
|
TPT.eraseInstruction(SExt);
|
2014-09-17 06:36:07 +08:00
|
|
|
ExtVal = ZExt;
|
2014-09-12 05:22:14 +08:00
|
|
|
} else {
|
2014-11-13 09:44:51 +08:00
|
|
|
// Replace z|sext(trunc(opnd)) or sext(sext(opnd))
|
|
|
|
// => z|sext(opnd).
|
2014-09-12 05:22:14 +08:00
|
|
|
TPT.setOperand(SExt, 0, SExtOpnd->getOperand(0));
|
|
|
|
}
|
2015-03-11 05:48:15 +08:00
|
|
|
CreatedInstsCost = 0;
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
// Remove dead code.
|
|
|
|
if (SExtOpnd->use_empty())
|
|
|
|
TPT.eraseInstruction(SExtOpnd);
|
|
|
|
|
2014-09-16 02:26:58 +08:00
|
|
|
// Check if the extension is still needed.
|
2014-09-17 06:36:07 +08:00
|
|
|
Instruction *ExtInst = dyn_cast<Instruction>(ExtVal);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
if (!ExtInst || ExtInst->getType() != ExtInst->getOperand(0)->getType()) {
|
2015-03-11 05:48:15 +08:00
|
|
|
if (ExtInst) {
|
|
|
|
if (Exts)
|
|
|
|
Exts->push_back(ExtInst);
|
|
|
|
CreatedInstsCost = !TLI.isExtFree(ExtInst) && !HasMergedNonFreeExt;
|
|
|
|
}
|
2014-09-17 06:36:07 +08:00
|
|
|
return ExtVal;
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
}
|
2014-02-07 05:44:56 +08:00
|
|
|
|
2014-09-16 02:26:58 +08:00
|
|
|
// At this point we have: ext ty opnd to ty.
|
|
|
|
// Reassign the uses of ExtInst to the opnd and remove ExtInst.
|
|
|
|
Value *NextVal = ExtInst->getOperand(0);
|
|
|
|
TPT.eraseInstruction(ExtInst, NextVal);
|
2014-02-07 05:44:56 +08:00
|
|
|
return NextVal;
|
|
|
|
}
|
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
Value *TypePromotionHelper::promoteOperandForOther(
|
|
|
|
Instruction *Ext, TypePromotionTransaction &TPT,
|
2015-03-11 05:48:15 +08:00
|
|
|
InstrToOrigTy &PromotedInsts, unsigned &CreatedInstsCost,
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Exts,
|
2015-03-11 05:48:15 +08:00
|
|
|
SmallVectorImpl<Instruction *> *Truncs, const TargetLowering &TLI,
|
|
|
|
bool IsSExt) {
|
2014-11-13 09:44:51 +08:00
|
|
|
// By construction, the operand of Ext is an instruction. Otherwise we cannot
|
2014-02-07 05:44:56 +08:00
|
|
|
// get through it and this method should not be called.
|
2014-11-13 09:44:51 +08:00
|
|
|
Instruction *ExtOpnd = cast<Instruction>(Ext->getOperand(0));
|
2015-03-11 05:48:15 +08:00
|
|
|
CreatedInstsCost = 0;
|
2014-11-13 09:44:51 +08:00
|
|
|
if (!ExtOpnd->hasOneUse()) {
|
|
|
|
// ExtOpnd will be promoted.
|
|
|
|
// All its uses, but Ext, will need to use a truncated value of the
|
2014-02-07 05:44:56 +08:00
|
|
|
// promoted version.
|
|
|
|
// Create the truncate now.
|
2014-11-13 09:44:51 +08:00
|
|
|
Value *Trunc = TPT.createTrunc(Ext, ExtOpnd->getType());
|
2014-09-17 06:36:07 +08:00
|
|
|
if (Instruction *ITrunc = dyn_cast<Instruction>(Trunc)) {
|
|
|
|
ITrunc->removeFromParent();
|
|
|
|
// Insert it just after the definition.
|
2014-11-13 09:44:51 +08:00
|
|
|
ITrunc->insertAfter(ExtOpnd);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
if (Truncs)
|
|
|
|
Truncs->push_back(ITrunc);
|
2014-09-17 06:36:07 +08:00
|
|
|
}
|
2014-02-07 05:44:56 +08:00
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
TPT.replaceAllUsesWith(ExtOpnd, Trunc);
|
2015-10-10 02:01:03 +08:00
|
|
|
// Restore the operand of Ext (which has been replaced by the previous call
|
2014-02-07 05:44:56 +08:00
|
|
|
// to replaceAllUsesWith) to avoid creating a cycle trunc <-> sext.
|
2014-11-13 09:44:51 +08:00
|
|
|
TPT.setOperand(Ext, 0, ExtOpnd);
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Get through the Instruction:
|
|
|
|
// 1. Update its type.
|
2014-11-13 09:44:51 +08:00
|
|
|
// 2. Replace the uses of Ext by Inst.
|
|
|
|
// 3. Extend each operand that needs to be extended.
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
// Remember the original type of the instruction before promotion.
|
|
|
|
// This is useful to know that the high bits are sign extended bits.
|
2014-11-13 09:44:51 +08:00
|
|
|
PromotedInsts.insert(std::pair<Instruction *, TypeIsSExt>(
|
|
|
|
ExtOpnd, TypeIsSExt(ExtOpnd->getType(), IsSExt)));
|
2014-02-07 05:44:56 +08:00
|
|
|
// Step #1.
|
2014-11-13 09:44:51 +08:00
|
|
|
TPT.mutateType(ExtOpnd, Ext->getType());
|
2014-02-07 05:44:56 +08:00
|
|
|
// Step #2.
|
2014-11-13 09:44:51 +08:00
|
|
|
TPT.replaceAllUsesWith(Ext, ExtOpnd);
|
2014-02-07 05:44:56 +08:00
|
|
|
// Step #3.
|
2014-11-13 09:44:51 +08:00
|
|
|
Instruction *ExtForOpnd = Ext;
|
2014-02-07 05:44:56 +08:00
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
DEBUG(dbgs() << "Propagate Ext to operands\n");
|
|
|
|
for (int OpIdx = 0, EndOpIdx = ExtOpnd->getNumOperands(); OpIdx != EndOpIdx;
|
2014-02-07 05:44:56 +08:00
|
|
|
++OpIdx) {
|
2014-11-13 09:44:51 +08:00
|
|
|
DEBUG(dbgs() << "Operand:\n" << *(ExtOpnd->getOperand(OpIdx)) << '\n');
|
|
|
|
if (ExtOpnd->getOperand(OpIdx)->getType() == Ext->getType() ||
|
|
|
|
!shouldExtOperand(ExtOpnd, OpIdx)) {
|
2014-02-07 05:44:56 +08:00
|
|
|
DEBUG(dbgs() << "No need to propagate\n");
|
|
|
|
continue;
|
|
|
|
}
|
2014-11-13 09:44:51 +08:00
|
|
|
// Check if we can statically extend the operand.
|
|
|
|
Value *Opnd = ExtOpnd->getOperand(OpIdx);
|
2014-02-07 05:44:56 +08:00
|
|
|
if (const ConstantInt *Cst = dyn_cast<ConstantInt>(Opnd)) {
|
2014-11-13 09:44:51 +08:00
|
|
|
DEBUG(dbgs() << "Statically extend\n");
|
|
|
|
unsigned BitWidth = Ext->getType()->getIntegerBitWidth();
|
|
|
|
APInt CstVal = IsSExt ? Cst->getValue().sext(BitWidth)
|
|
|
|
: Cst->getValue().zext(BitWidth);
|
|
|
|
TPT.setOperand(ExtOpnd, OpIdx, ConstantInt::get(Ext->getType(), CstVal));
|
2014-02-07 05:44:56 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// UndefValue are typed, so we have to statically sign extend them.
|
|
|
|
if (isa<UndefValue>(Opnd)) {
|
2014-11-13 09:44:51 +08:00
|
|
|
DEBUG(dbgs() << "Statically extend\n");
|
|
|
|
TPT.setOperand(ExtOpnd, OpIdx, UndefValue::get(Ext->getType()));
|
2014-02-07 05:44:56 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Otherwise we have to explicity sign extend the operand.
|
2014-11-13 09:44:51 +08:00
|
|
|
// Check if Ext was reused to extend an operand.
|
|
|
|
if (!ExtForOpnd) {
|
2014-02-07 05:44:56 +08:00
|
|
|
// If yes, create a new one.
|
2014-11-13 09:44:51 +08:00
|
|
|
DEBUG(dbgs() << "More operands to ext\n");
|
2014-12-23 02:11:52 +08:00
|
|
|
Value *ValForExtOpnd = IsSExt ? TPT.createSExt(Ext, Opnd, Ext->getType())
|
|
|
|
: TPT.createZExt(Ext, Opnd, Ext->getType());
|
|
|
|
if (!isa<Instruction>(ValForExtOpnd)) {
|
|
|
|
TPT.setOperand(ExtOpnd, OpIdx, ValForExtOpnd);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ExtForOpnd = cast<Instruction>(ValForExtOpnd);
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
if (Exts)
|
|
|
|
Exts->push_back(ExtForOpnd);
|
2014-11-13 09:44:51 +08:00
|
|
|
TPT.setOperand(ExtForOpnd, 0, Opnd);
|
2014-02-07 05:44:56 +08:00
|
|
|
|
|
|
|
// Move the sign extension before the insertion point.
|
2014-11-13 09:44:51 +08:00
|
|
|
TPT.moveBefore(ExtForOpnd, ExtOpnd);
|
|
|
|
TPT.setOperand(ExtOpnd, OpIdx, ExtForOpnd);
|
2015-03-11 05:48:15 +08:00
|
|
|
CreatedInstsCost += !TLI.isExtFree(ExtForOpnd);
|
2014-02-07 05:44:56 +08:00
|
|
|
// If more sext are required, new instructions will have to be created.
|
2014-11-13 09:44:51 +08:00
|
|
|
ExtForOpnd = nullptr;
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
2014-11-13 09:44:51 +08:00
|
|
|
if (ExtForOpnd == Ext) {
|
|
|
|
DEBUG(dbgs() << "Extension is useless now\n");
|
|
|
|
TPT.eraseInstruction(Ext);
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
2014-11-13 09:44:51 +08:00
|
|
|
return ExtOpnd;
|
2014-02-07 05:44:56 +08:00
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Check whether or not promoting an instruction to a wider type is profitable.
|
2015-03-11 05:48:15 +08:00
|
|
|
/// \p NewCost gives the cost of extension instructions created by the
|
|
|
|
/// promotion.
|
|
|
|
/// \p OldCost gives the cost of extension instructions before the promotion
|
|
|
|
/// plus the number of instructions that have been
|
|
|
|
/// matched in the addressing mode the promotion.
|
2014-02-15 06:23:22 +08:00
|
|
|
/// \p PromotedOperand is the value that has been promoted.
|
|
|
|
/// \return True if the promotion is profitable, false otherwise.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool AddressingModeMatcher::isPromotionProfitable(
|
2015-03-11 05:48:15 +08:00
|
|
|
unsigned NewCost, unsigned OldCost, Value *PromotedOperand) const {
|
|
|
|
DEBUG(dbgs() << "OldCost: " << OldCost << "\tNewCost: " << NewCost << '\n');
|
|
|
|
// The cost of the new extensions is greater than the cost of the
|
|
|
|
// old extension plus what we folded.
|
2014-02-15 06:23:22 +08:00
|
|
|
// This is not profitable.
|
2015-03-11 05:48:15 +08:00
|
|
|
if (NewCost > OldCost)
|
2014-02-15 06:23:22 +08:00
|
|
|
return false;
|
2015-03-11 05:48:15 +08:00
|
|
|
if (NewCost < OldCost)
|
2014-02-15 06:23:22 +08:00
|
|
|
return true;
|
|
|
|
// The promotion is neutral but it may help folding the sign extension in
|
|
|
|
// loads for instance.
|
|
|
|
// Check that we did not create an illegal instruction.
|
2015-07-09 10:09:04 +08:00
|
|
|
return isPromotedInstructionLegal(TLI, DL, PromotedOperand);
|
2014-02-15 06:23:22 +08:00
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Given an instruction or constant expr, see if we can fold the operation
|
2015-10-10 02:01:03 +08:00
|
|
|
/// into the addressing mode. If so, update the addressing mode and return
|
2015-09-22 06:47:23 +08:00
|
|
|
/// true, otherwise return false without modifying AddrMode.
|
2014-02-07 05:44:56 +08:00
|
|
|
/// If \p MovedAway is not NULL, it contains the information of whether or
|
|
|
|
/// not AddrInst has to be folded into the addressing mode on success.
|
|
|
|
/// If \p MovedAway == true, \p AddrInst will not be part of the addressing
|
|
|
|
/// because it has been moved away.
|
|
|
|
/// Thus AddrInst must not be added in the matched instructions.
|
|
|
|
/// This state can happen when AddrInst is a sext, since it may be moved away.
|
|
|
|
/// Therefore, AddrInst may not be valid when MovedAway is true and it must
|
|
|
|
/// not be referenced anymore.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool AddressingModeMatcher::matchOperationAddr(User *AddrInst, unsigned Opcode,
|
2014-02-07 05:44:56 +08:00
|
|
|
unsigned Depth,
|
|
|
|
bool *MovedAway) {
|
2013-01-05 10:09:22 +08:00
|
|
|
// Avoid exponential behavior on extremely deep expression trees.
|
|
|
|
if (Depth >= 5) return false;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2014-02-07 05:44:56 +08:00
|
|
|
// By default, all matched instructions stay in place.
|
|
|
|
if (MovedAway)
|
|
|
|
*MovedAway = false;
|
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
switch (Opcode) {
|
|
|
|
case Instruction::PtrToInt:
|
|
|
|
// PtrToInt is always a noop, as we know that the int type is pointer sized.
|
2015-09-22 07:03:16 +08:00
|
|
|
return matchAddr(AddrInst->getOperand(0), Depth);
|
2015-07-09 10:09:04 +08:00
|
|
|
case Instruction::IntToPtr: {
|
|
|
|
auto AS = AddrInst->getType()->getPointerAddressSpace();
|
|
|
|
auto PtrTy = MVT::getIntegerVT(DL.getPointerSizeInBits(AS));
|
2013-01-05 10:09:22 +08:00
|
|
|
// This inttoptr is a no-op if the integer type is pointer sized.
|
2015-07-09 10:09:04 +08:00
|
|
|
if (TLI.getValueType(DL, AddrInst->getOperand(0)->getType()) == PtrTy)
|
2015-09-22 07:03:16 +08:00
|
|
|
return matchAddr(AddrInst->getOperand(0), Depth);
|
2013-01-05 10:09:22 +08:00
|
|
|
return false;
|
2015-07-09 10:09:04 +08:00
|
|
|
}
|
2013-01-05 10:09:22 +08:00
|
|
|
case Instruction::BitCast:
|
|
|
|
// BitCast is always a noop, and we can handle it as long as it is
|
|
|
|
// int->int or pointer->pointer (we don't want int<->fp or something).
|
|
|
|
if ((AddrInst->getOperand(0)->getType()->isPointerTy() ||
|
|
|
|
AddrInst->getOperand(0)->getType()->isIntegerTy()) &&
|
|
|
|
// Don't touch identity bitcasts. These were probably put here by LSR,
|
|
|
|
// and we don't want to mess around with them. Assume it knows what it
|
|
|
|
// is doing.
|
|
|
|
AddrInst->getOperand(0)->getType() != AddrInst->getType())
|
2015-09-22 07:03:16 +08:00
|
|
|
return matchAddr(AddrInst->getOperand(0), Depth);
|
2013-01-05 10:09:22 +08:00
|
|
|
return false;
|
2015-05-27 00:59:43 +08:00
|
|
|
case Instruction::AddrSpaceCast: {
|
|
|
|
unsigned SrcAS
|
|
|
|
= AddrInst->getOperand(0)->getType()->getPointerAddressSpace();
|
|
|
|
unsigned DestAS = AddrInst->getType()->getPointerAddressSpace();
|
|
|
|
if (TLI.isNoopAddrSpaceCast(SrcAS, DestAS))
|
2015-09-22 07:03:16 +08:00
|
|
|
return matchAddr(AddrInst->getOperand(0), Depth);
|
2015-05-27 00:59:43 +08:00
|
|
|
return false;
|
|
|
|
}
|
2013-01-05 10:09:22 +08:00
|
|
|
case Instruction::Add: {
|
|
|
|
// Check to see if we can merge in the RHS then the LHS. If so, we win.
|
|
|
|
ExtAddrMode BackupAddrMode = AddrMode;
|
|
|
|
unsigned OldSize = AddrModeInsts.size();
|
2014-02-07 05:44:56 +08:00
|
|
|
// Start a transaction at this point.
|
|
|
|
// The LHS may match but not the RHS.
|
|
|
|
// Therefore, we need a higher level restoration point to undo partially
|
|
|
|
// matched operation.
|
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
if (matchAddr(AddrInst->getOperand(1), Depth+1) &&
|
|
|
|
matchAddr(AddrInst->getOperand(0), Depth+1))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Restore the old addr mode info.
|
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
2014-02-07 05:44:56 +08:00
|
|
|
TPT.rollback(LastKnownGood);
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Otherwise this was over-aggressive. Try merging in the LHS then the RHS.
|
2015-09-22 07:03:16 +08:00
|
|
|
if (matchAddr(AddrInst->getOperand(0), Depth+1) &&
|
|
|
|
matchAddr(AddrInst->getOperand(1), Depth+1))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Otherwise we definitely can't merge the ADD in.
|
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
2014-02-07 05:44:56 +08:00
|
|
|
TPT.rollback(LastKnownGood);
|
2013-01-05 10:09:22 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
//case Instruction::Or:
|
|
|
|
// TODO: We can handle "Or Val, Imm" iff this OR is equivalent to an ADD.
|
|
|
|
//break;
|
|
|
|
case Instruction::Mul:
|
|
|
|
case Instruction::Shl: {
|
|
|
|
// Can only handle X*C and X << C.
|
|
|
|
ConstantInt *RHS = dyn_cast<ConstantInt>(AddrInst->getOperand(1));
|
2014-07-17 06:40:28 +08:00
|
|
|
if (!RHS)
|
|
|
|
return false;
|
2013-01-05 10:09:22 +08:00
|
|
|
int64_t Scale = RHS->getSExtValue();
|
|
|
|
if (Opcode == Instruction::Shl)
|
|
|
|
Scale = 1LL << Scale;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
return matchScaledValue(AddrInst->getOperand(0), Scale, Depth);
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
case Instruction::GetElementPtr: {
|
|
|
|
// Scan the GEP. We check it if it contains constant offsets and at most
|
|
|
|
// one variable offset.
|
|
|
|
int VariableOperand = -1;
|
|
|
|
unsigned VariableScale = 0;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
int64_t ConstantOffset = 0;
|
|
|
|
gep_type_iterator GTI = gep_type_begin(AddrInst);
|
|
|
|
for (unsigned i = 1, e = AddrInst->getNumOperands(); i != e; ++i, ++GTI) {
|
|
|
|
if (StructType *STy = dyn_cast<StructType>(*GTI)) {
|
2015-07-08 02:45:17 +08:00
|
|
|
const StructLayout *SL = DL.getStructLayout(STy);
|
2013-01-05 10:09:22 +08:00
|
|
|
unsigned Idx =
|
|
|
|
cast<ConstantInt>(AddrInst->getOperand(i))->getZExtValue();
|
|
|
|
ConstantOffset += SL->getElementOffset(Idx);
|
|
|
|
} else {
|
2015-07-08 02:45:17 +08:00
|
|
|
uint64_t TypeSize = DL.getTypeAllocSize(GTI.getIndexedType());
|
2013-01-05 10:09:22 +08:00
|
|
|
if (ConstantInt *CI = dyn_cast<ConstantInt>(AddrInst->getOperand(i))) {
|
|
|
|
ConstantOffset += CI->getSExtValue()*TypeSize;
|
|
|
|
} else if (TypeSize) { // Scales of zero don't do anything.
|
|
|
|
// We only allow one variable index at the moment.
|
|
|
|
if (VariableOperand != -1)
|
|
|
|
return false;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Remember the variable index.
|
|
|
|
VariableOperand = i;
|
|
|
|
VariableScale = TypeSize;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// A common case is for the GEP to only do a constant offset. In this case,
|
|
|
|
// just add it to the disp field and check validity.
|
|
|
|
if (VariableOperand == -1) {
|
|
|
|
AddrMode.BaseOffs += ConstantOffset;
|
2015-06-05 00:17:38 +08:00
|
|
|
if (ConstantOffset == 0 ||
|
2015-07-09 10:09:40 +08:00
|
|
|
TLI.isLegalAddressingMode(DL, AddrMode, AccessTy, AddrSpace)) {
|
2013-01-05 10:09:22 +08:00
|
|
|
// Check to see if we can fold the base pointer in too.
|
2015-09-22 07:03:16 +08:00
|
|
|
if (matchAddr(AddrInst->getOperand(0), Depth+1))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
AddrMode.BaseOffs -= ConstantOffset;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Save the valid addressing mode in case we can't match.
|
|
|
|
ExtAddrMode BackupAddrMode = AddrMode;
|
|
|
|
unsigned OldSize = AddrModeInsts.size();
|
|
|
|
|
|
|
|
// See if the scale and offset amount is valid for this target.
|
|
|
|
AddrMode.BaseOffs += ConstantOffset;
|
|
|
|
|
|
|
|
// Match the base operand of the GEP.
|
2015-09-22 07:03:16 +08:00
|
|
|
if (!matchAddr(AddrInst->getOperand(0), Depth+1)) {
|
2013-01-05 10:09:22 +08:00
|
|
|
// If it couldn't be matched, just stuff the value in a register.
|
|
|
|
if (AddrMode.HasBaseReg) {
|
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
AddrMode.HasBaseReg = true;
|
|
|
|
AddrMode.BaseReg = AddrInst->getOperand(0);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Match the remaining variable portion of the GEP.
|
2015-09-22 07:03:16 +08:00
|
|
|
if (!matchScaledValue(AddrInst->getOperand(VariableOperand), VariableScale,
|
2013-01-05 10:09:22 +08:00
|
|
|
Depth)) {
|
|
|
|
// If it couldn't be matched, try stuffing the base into a register
|
|
|
|
// instead of matching it, and retrying the match of the scale.
|
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
|
|
|
if (AddrMode.HasBaseReg)
|
|
|
|
return false;
|
|
|
|
AddrMode.HasBaseReg = true;
|
|
|
|
AddrMode.BaseReg = AddrInst->getOperand(0);
|
|
|
|
AddrMode.BaseOffs += ConstantOffset;
|
2015-09-22 07:03:16 +08:00
|
|
|
if (!matchScaledValue(AddrInst->getOperand(VariableOperand),
|
2013-01-05 10:09:22 +08:00
|
|
|
VariableScale, Depth)) {
|
|
|
|
// If even that didn't work, bail.
|
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2014-11-13 09:44:51 +08:00
|
|
|
case Instruction::SExt:
|
|
|
|
case Instruction::ZExt: {
|
|
|
|
Instruction *Ext = dyn_cast<Instruction>(AddrInst);
|
|
|
|
if (!Ext)
|
2014-07-17 06:40:28 +08:00
|
|
|
return false;
|
2014-07-17 05:08:10 +08:00
|
|
|
|
2014-11-13 09:44:51 +08:00
|
|
|
// Try to move this ext out of the way of the addressing mode.
|
2014-02-07 05:44:56 +08:00
|
|
|
// Ask for a method for doing so.
|
2014-11-13 09:44:51 +08:00
|
|
|
TypePromotionHelper::Action TPH =
|
2015-06-18 04:44:32 +08:00
|
|
|
TypePromotionHelper::getAction(Ext, InsertedInsts, TLI, PromotedInsts);
|
2014-02-07 05:44:56 +08:00
|
|
|
if (!TPH)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
2015-03-11 05:48:15 +08:00
|
|
|
unsigned CreatedInstsCost = 0;
|
|
|
|
unsigned ExtCost = !TLI.isExtFree(Ext);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
Value *PromotedOperand =
|
2015-03-11 05:48:15 +08:00
|
|
|
TPH(Ext, TPT, PromotedInsts, CreatedInstsCost, nullptr, nullptr, TLI);
|
2014-02-07 05:44:56 +08:00
|
|
|
// SExt has been moved away.
|
|
|
|
// Thus either it will be rematched later in the recursive calls or it is
|
|
|
|
// gone. Anyway, we must not fold it into the addressing mode at this point.
|
|
|
|
// E.g.,
|
|
|
|
// op = add opnd, 1
|
2014-11-13 09:44:51 +08:00
|
|
|
// idx = ext op
|
2014-02-07 05:44:56 +08:00
|
|
|
// addr = gep base, idx
|
|
|
|
// is now:
|
2014-11-13 09:44:51 +08:00
|
|
|
// promotedOpnd = ext opnd <- no match here
|
2014-02-07 05:44:56 +08:00
|
|
|
// op = promoted_add promotedOpnd, 1 <- match (later in recursive calls)
|
|
|
|
// addr = gep base, op <- match
|
|
|
|
if (MovedAway)
|
|
|
|
*MovedAway = true;
|
|
|
|
|
|
|
|
assert(PromotedOperand &&
|
|
|
|
"TypePromotionHelper should have filtered out those cases");
|
|
|
|
|
|
|
|
ExtAddrMode BackupAddrMode = AddrMode;
|
|
|
|
unsigned OldSize = AddrModeInsts.size();
|
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
if (!matchAddr(PromotedOperand, Depth) ||
|
2015-10-10 02:01:03 +08:00
|
|
|
// The total of the new cost is equal to the cost of the created
|
2015-03-11 05:48:15 +08:00
|
|
|
// instructions.
|
2015-10-10 02:01:03 +08:00
|
|
|
// The total of the old cost is equal to the cost of the extension plus
|
2015-03-11 05:48:15 +08:00
|
|
|
// what we have saved in the addressing mode.
|
2015-09-22 07:03:16 +08:00
|
|
|
!isPromotionProfitable(CreatedInstsCost,
|
2015-03-11 05:48:15 +08:00
|
|
|
ExtCost + (AddrModeInsts.size() - OldSize),
|
2014-02-15 06:23:22 +08:00
|
|
|
PromotedOperand)) {
|
2014-02-07 05:44:56 +08:00
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
|
|
|
DEBUG(dbgs() << "Sign extension does not pay off: rollback\n");
|
|
|
|
TPT.rollback(LastKnownGood);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// If we can, try to add the value of 'Addr' into the current addressing mode.
|
|
|
|
/// If Addr can't be added to AddrMode this returns false and leaves AddrMode
|
|
|
|
/// unmodified. This assumes that Addr is either a pointer type or intptr_t
|
|
|
|
/// for the target.
|
2013-01-05 10:09:22 +08:00
|
|
|
///
|
2015-09-22 07:03:16 +08:00
|
|
|
bool AddressingModeMatcher::matchAddr(Value *Addr, unsigned Depth) {
|
2014-02-07 05:44:56 +08:00
|
|
|
// Start a transaction at this point that we will rollback if the matching
|
|
|
|
// fails.
|
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
2013-01-05 10:09:22 +08:00
|
|
|
if (ConstantInt *CI = dyn_cast<ConstantInt>(Addr)) {
|
|
|
|
// Fold in immediates if legal for the target.
|
|
|
|
AddrMode.BaseOffs += CI->getSExtValue();
|
2015-07-09 10:09:40 +08:00
|
|
|
if (TLI.isLegalAddressingMode(DL, AddrMode, AccessTy, AddrSpace))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
AddrMode.BaseOffs -= CI->getSExtValue();
|
|
|
|
} else if (GlobalValue *GV = dyn_cast<GlobalValue>(Addr)) {
|
|
|
|
// If this is a global variable, try to fold it into the addressing mode.
|
2014-04-14 08:51:57 +08:00
|
|
|
if (!AddrMode.BaseGV) {
|
2013-01-05 10:09:22 +08:00
|
|
|
AddrMode.BaseGV = GV;
|
2015-07-09 10:09:40 +08:00
|
|
|
if (TLI.isLegalAddressingMode(DL, AddrMode, AccessTy, AddrSpace))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
2014-04-14 08:51:57 +08:00
|
|
|
AddrMode.BaseGV = nullptr;
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
} else if (Instruction *I = dyn_cast<Instruction>(Addr)) {
|
|
|
|
ExtAddrMode BackupAddrMode = AddrMode;
|
|
|
|
unsigned OldSize = AddrModeInsts.size();
|
|
|
|
|
|
|
|
// Check to see if it is possible to fold this operation.
|
2014-02-07 05:44:56 +08:00
|
|
|
bool MovedAway = false;
|
2015-09-22 07:03:16 +08:00
|
|
|
if (matchOperationAddr(I, I->getOpcode(), Depth, &MovedAway)) {
|
2015-10-10 02:01:03 +08:00
|
|
|
// This instruction may have been moved away. If so, there is nothing
|
2014-02-07 05:44:56 +08:00
|
|
|
// to check here.
|
|
|
|
if (MovedAway)
|
|
|
|
return true;
|
2013-01-05 10:09:22 +08:00
|
|
|
// Okay, it's possible to fold this. Check to see if it is actually
|
|
|
|
// *profitable* to do so. We use a simple cost model to avoid increasing
|
|
|
|
// register pressure too much.
|
|
|
|
if (I->hasOneUse() ||
|
2015-09-22 07:03:16 +08:00
|
|
|
isProfitableToFoldIntoAddressingMode(I, BackupAddrMode, AddrMode)) {
|
2013-01-05 10:09:22 +08:00
|
|
|
AddrModeInsts.push_back(I);
|
|
|
|
return true;
|
|
|
|
}
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// It isn't profitable to do this, roll back.
|
|
|
|
//cerr << "NOT FOLDING: " << *I;
|
|
|
|
AddrMode = BackupAddrMode;
|
|
|
|
AddrModeInsts.resize(OldSize);
|
2014-02-07 05:44:56 +08:00
|
|
|
TPT.rollback(LastKnownGood);
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
} else if (ConstantExpr *CE = dyn_cast<ConstantExpr>(Addr)) {
|
2015-09-22 07:03:16 +08:00
|
|
|
if (matchOperationAddr(CE, CE->getOpcode(), Depth))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
2014-02-07 05:44:56 +08:00
|
|
|
TPT.rollback(LastKnownGood);
|
2013-01-05 10:09:22 +08:00
|
|
|
} else if (isa<ConstantPointerNull>(Addr)) {
|
|
|
|
// Null pointer gets folded without affecting the addressing mode.
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Worse case, the target should support [reg] addressing modes. :)
|
|
|
|
if (!AddrMode.HasBaseReg) {
|
|
|
|
AddrMode.HasBaseReg = true;
|
|
|
|
AddrMode.BaseReg = Addr;
|
|
|
|
// Still check for legality in case the target supports [imm] but not [i+r].
|
2015-07-09 10:09:40 +08:00
|
|
|
if (TLI.isLegalAddressingMode(DL, AddrMode, AccessTy, AddrSpace))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
AddrMode.HasBaseReg = false;
|
2014-04-14 08:51:57 +08:00
|
|
|
AddrMode.BaseReg = nullptr;
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// If the base register is already taken, see if we can do [r+r].
|
|
|
|
if (AddrMode.Scale == 0) {
|
|
|
|
AddrMode.Scale = 1;
|
|
|
|
AddrMode.ScaledReg = Addr;
|
2015-07-09 10:09:40 +08:00
|
|
|
if (TLI.isLegalAddressingMode(DL, AddrMode, AccessTy, AddrSpace))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
AddrMode.Scale = 0;
|
2014-04-14 08:51:57 +08:00
|
|
|
AddrMode.ScaledReg = nullptr;
|
2013-01-05 10:09:22 +08:00
|
|
|
}
|
|
|
|
// Couldn't match.
|
2014-02-07 05:44:56 +08:00
|
|
|
TPT.rollback(LastKnownGood);
|
2013-01-05 10:09:22 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Check to see if all uses of OpVal by the specified inline asm call are due
|
|
|
|
/// to memory operands. If so, return true, otherwise return false.
|
2013-01-05 10:09:22 +08:00
|
|
|
static bool IsOperandAMemoryOperand(CallInst *CI, InlineAsm *IA, Value *OpVal,
|
2015-02-27 06:38:43 +08:00
|
|
|
const TargetMachine &TM) {
|
|
|
|
const Function *F = CI->getParent()->getParent();
|
|
|
|
const TargetLowering *TLI = TM.getSubtargetImpl(*F)->getTargetLowering();
|
|
|
|
const TargetRegisterInfo *TRI = TM.getSubtargetImpl(*F)->getRegisterInfo();
|
2015-02-27 06:38:34 +08:00
|
|
|
TargetLowering::AsmOperandInfoVector TargetConstraints =
|
2015-07-08 03:07:19 +08:00
|
|
|
TLI->ParseConstraints(F->getParent()->getDataLayout(), TRI,
|
|
|
|
ImmutableCallSite(CI));
|
2013-01-05 10:09:22 +08:00
|
|
|
for (unsigned i = 0, e = TargetConstraints.size(); i != e; ++i) {
|
|
|
|
TargetLowering::AsmOperandInfo &OpInfo = TargetConstraints[i];
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Compute the constraint code and ConstraintType to use.
|
2015-02-27 06:38:43 +08:00
|
|
|
TLI->ComputeConstraintToUse(OpInfo, SDValue());
|
2013-01-05 10:09:22 +08:00
|
|
|
|
|
|
|
// If this asm operand is our Value*, and if it isn't an indirect memory
|
|
|
|
// operand, we can't fold it!
|
|
|
|
if (OpInfo.CallOperandVal == OpVal &&
|
|
|
|
(OpInfo.ConstraintType != TargetLowering::C_Memory ||
|
|
|
|
!OpInfo.isIndirect))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Recursively walk all the uses of I until we find a memory use.
|
|
|
|
/// If we find an obviously non-foldable instruction, return true.
|
2013-01-05 10:09:22 +08:00
|
|
|
/// Add the ultimately found memory instructions to MemoryUses.
|
2015-02-27 06:38:43 +08:00
|
|
|
static bool FindAllMemoryUses(
|
|
|
|
Instruction *I,
|
|
|
|
SmallVectorImpl<std::pair<Instruction *, unsigned>> &MemoryUses,
|
|
|
|
SmallPtrSetImpl<Instruction *> &ConsideredInsts, const TargetMachine &TM) {
|
2013-01-05 10:09:22 +08:00
|
|
|
// If we already considered this instruction, we're done.
|
2014-11-19 15:49:26 +08:00
|
|
|
if (!ConsideredInsts.insert(I).second)
|
2013-01-05 10:09:22 +08:00
|
|
|
return false;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If this is an obviously unfoldable instruction, bail out.
|
|
|
|
if (!MightBeFoldableInst(I))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
// Loop over all the uses, recursively processing them.
|
2014-03-09 11:16:01 +08:00
|
|
|
for (Use &U : I->uses()) {
|
|
|
|
Instruction *UserI = cast<Instruction>(U.getUser());
|
2013-01-05 10:09:22 +08:00
|
|
|
|
2014-03-09 11:16:01 +08:00
|
|
|
if (LoadInst *LI = dyn_cast<LoadInst>(UserI)) {
|
|
|
|
MemoryUses.push_back(std::make_pair(LI, U.getOperandNo()));
|
2013-01-05 10:09:22 +08:00
|
|
|
continue;
|
|
|
|
}
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2014-03-09 11:16:01 +08:00
|
|
|
if (StoreInst *SI = dyn_cast<StoreInst>(UserI)) {
|
|
|
|
unsigned opNo = U.getOperandNo();
|
2013-01-05 10:09:22 +08:00
|
|
|
if (opNo == 0) return true; // Storing addr, not into addr.
|
|
|
|
MemoryUses.push_back(std::make_pair(SI, opNo));
|
|
|
|
continue;
|
|
|
|
}
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2014-03-09 11:16:01 +08:00
|
|
|
if (CallInst *CI = dyn_cast<CallInst>(UserI)) {
|
2013-01-05 10:09:22 +08:00
|
|
|
InlineAsm *IA = dyn_cast<InlineAsm>(CI->getCalledValue());
|
|
|
|
if (!IA) return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If this is a memory operand, we're cool, otherwise bail out.
|
2015-02-27 06:38:43 +08:00
|
|
|
if (!IsOperandAMemoryOperand(CI, IA, I, TM))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
continue;
|
|
|
|
}
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2015-02-27 06:38:43 +08:00
|
|
|
if (FindAllMemoryUses(UserI, MemoryUses, ConsideredInsts, TM))
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-10-10 02:01:03 +08:00
|
|
|
/// Return true if Val is already known to be live at the use site that we're
|
|
|
|
/// folding it into. If so, there is no cost to include it in the addressing
|
|
|
|
/// mode. KnownLive1 and KnownLive2 are two values that we know are live at the
|
|
|
|
/// instruction already.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool AddressingModeMatcher::valueAlreadyLiveAtInst(Value *Val,Value *KnownLive1,
|
2013-01-05 10:09:22 +08:00
|
|
|
Value *KnownLive2) {
|
|
|
|
// If Val is either of the known-live values, we know it is live!
|
2014-04-14 08:51:57 +08:00
|
|
|
if (Val == nullptr || Val == KnownLive1 || Val == KnownLive2)
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// All values other than instructions and arguments (e.g. constants) are live.
|
|
|
|
if (!isa<Instruction>(Val) && !isa<Argument>(Val)) return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If Val is a constant sized alloca in the entry block, it is live, this is
|
|
|
|
// true because it is just a reference to the stack/frame pointer, which is
|
|
|
|
// live for the whole function.
|
|
|
|
if (AllocaInst *AI = dyn_cast<AllocaInst>(Val))
|
|
|
|
if (AI->isStaticAlloca())
|
|
|
|
return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Check to see if this value is already used in the memory instruction's
|
|
|
|
// block. If so, it's already live into the block at the very least, so we
|
|
|
|
// can reasonably fold it.
|
|
|
|
return Val->isUsedInBasicBlock(MemoryInst->getParent());
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// It is possible for the addressing mode of the machine to fold the specified
|
|
|
|
/// instruction into a load or store that ultimately uses it.
|
|
|
|
/// However, the specified instruction has multiple uses.
|
|
|
|
/// Given this, it may actually increase register pressure to fold it
|
|
|
|
/// into the load. For example, consider this code:
|
2013-01-05 10:09:22 +08:00
|
|
|
///
|
|
|
|
/// X = ...
|
|
|
|
/// Y = X+1
|
|
|
|
/// use(Y) -> nonload/store
|
|
|
|
/// Z = Y+1
|
|
|
|
/// load Z
|
|
|
|
///
|
|
|
|
/// In this case, Y has multiple uses, and can be folded into the load of Z
|
|
|
|
/// (yielding load [X+2]). However, doing this will cause both "X" and "X+1" to
|
|
|
|
/// be live at the use(Y) line. If we don't fold Y into load Z, we use one
|
|
|
|
/// fewer register. Since Y can't be folded into "use(Y)" we don't increase the
|
|
|
|
/// number of computations either.
|
|
|
|
///
|
|
|
|
/// Note that this (like most of CodeGenPrepare) is just a rough heuristic. If
|
|
|
|
/// X was live across 'load Z' for other reasons, we actually *would* want to
|
|
|
|
/// fold the addressing mode in the Z case. This would make Y die earlier.
|
|
|
|
bool AddressingModeMatcher::
|
2015-09-22 07:03:16 +08:00
|
|
|
isProfitableToFoldIntoAddressingMode(Instruction *I, ExtAddrMode &AMBefore,
|
2013-01-05 10:09:22 +08:00
|
|
|
ExtAddrMode &AMAfter) {
|
|
|
|
if (IgnoreProfitability) return true;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// AMBefore is the addressing mode before this instruction was folded into it,
|
|
|
|
// and AMAfter is the addressing mode after the instruction was folded. Get
|
|
|
|
// the set of registers referenced by AMAfter and subtract out those
|
|
|
|
// referenced by AMBefore: this is the set of values which folding in this
|
|
|
|
// address extends the lifetime of.
|
|
|
|
//
|
|
|
|
// Note that there are only two potential values being referenced here,
|
|
|
|
// BaseReg and ScaleReg (global addresses are always available, as are any
|
|
|
|
// folded immediates).
|
|
|
|
Value *BaseReg = AMAfter.BaseReg, *ScaledReg = AMAfter.ScaledReg;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If the BaseReg or ScaledReg was referenced by the previous addrmode, their
|
|
|
|
// lifetime wasn't extended by adding this instruction.
|
2015-09-22 07:03:16 +08:00
|
|
|
if (valueAlreadyLiveAtInst(BaseReg, AMBefore.BaseReg, AMBefore.ScaledReg))
|
2014-04-14 08:51:57 +08:00
|
|
|
BaseReg = nullptr;
|
2015-09-22 07:03:16 +08:00
|
|
|
if (valueAlreadyLiveAtInst(ScaledReg, AMBefore.BaseReg, AMBefore.ScaledReg))
|
2014-04-14 08:51:57 +08:00
|
|
|
ScaledReg = nullptr;
|
2013-01-05 10:09:22 +08:00
|
|
|
|
|
|
|
// If folding this instruction (and it's subexprs) didn't extend any live
|
|
|
|
// ranges, we're ok with it.
|
2014-04-14 08:51:57 +08:00
|
|
|
if (!BaseReg && !ScaledReg)
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
// If all uses of this instruction are ultimately load/store/inlineasm's,
|
|
|
|
// check to see if their addressing modes will include this instruction. If
|
|
|
|
// so, we can fold it into all uses, so it doesn't matter if it has multiple
|
|
|
|
// uses.
|
|
|
|
SmallVector<std::pair<Instruction*,unsigned>, 16> MemoryUses;
|
|
|
|
SmallPtrSet<Instruction*, 16> ConsideredInsts;
|
2015-02-27 06:38:43 +08:00
|
|
|
if (FindAllMemoryUses(I, MemoryUses, ConsideredInsts, TM))
|
2013-01-05 10:09:22 +08:00
|
|
|
return false; // Has a non-memory, non-foldable use!
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Now that we know that all uses of this instruction are part of a chain of
|
|
|
|
// computation involving only operations that could theoretically be folded
|
|
|
|
// into a memory use, loop over each of these uses and see if they could
|
|
|
|
// *actually* fold the instruction.
|
|
|
|
SmallVector<Instruction*, 32> MatchedAddrModeInsts;
|
|
|
|
for (unsigned i = 0, e = MemoryUses.size(); i != e; ++i) {
|
|
|
|
Instruction *User = MemoryUses[i].first;
|
|
|
|
unsigned OpNo = MemoryUses[i].second;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Get the access type of this use. If the use isn't a pointer, we don't
|
|
|
|
// know what it accesses.
|
|
|
|
Value *Address = User->getOperand(OpNo);
|
2015-06-05 00:17:38 +08:00
|
|
|
PointerType *AddrTy = dyn_cast<PointerType>(Address->getType());
|
|
|
|
if (!AddrTy)
|
2013-01-05 10:09:22 +08:00
|
|
|
return false;
|
2015-06-05 00:17:38 +08:00
|
|
|
Type *AddressAccessTy = AddrTy->getElementType();
|
|
|
|
unsigned AS = AddrTy->getAddressSpace();
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// Do a match against the root of this address, ignoring profitability. This
|
|
|
|
// will tell us if the addressing mode for the memory operation will
|
|
|
|
// *actually* cover the shared instruction.
|
|
|
|
ExtAddrMode Result;
|
2014-02-11 09:59:02 +08:00
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
2015-06-05 00:17:38 +08:00
|
|
|
AddressingModeMatcher Matcher(MatchedAddrModeInsts, TM, AddressAccessTy, AS,
|
2015-06-18 04:44:32 +08:00
|
|
|
MemoryInst, Result, InsertedInsts,
|
2014-02-07 05:44:56 +08:00
|
|
|
PromotedInsts, TPT);
|
2013-01-05 10:09:22 +08:00
|
|
|
Matcher.IgnoreProfitability = true;
|
2015-09-22 07:03:16 +08:00
|
|
|
bool Success = Matcher.matchAddr(Address, 0);
|
2013-01-05 10:09:22 +08:00
|
|
|
(void)Success; assert(Success && "Couldn't select *anything*?");
|
|
|
|
|
2014-02-11 09:59:02 +08:00
|
|
|
// The match was to check the profitability, the changes made are not
|
|
|
|
// part of the original matcher. Therefore, they should be dropped
|
|
|
|
// otherwise the original matcher will not present the right state.
|
|
|
|
TPT.rollback(LastKnownGood);
|
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
// If the match didn't cover I, then it won't be shared by it.
|
|
|
|
if (std::find(MatchedAddrModeInsts.begin(), MatchedAddrModeInsts.end(),
|
|
|
|
I) == MatchedAddrModeInsts.end())
|
|
|
|
return false;
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
MatchedAddrModeInsts.clear();
|
|
|
|
}
|
2013-07-16 01:55:02 +08:00
|
|
|
|
2013-01-05 10:09:22 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
} // end anonymous namespace
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Return true if the specified values are defined in a
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
/// different basic block than BB.
|
|
|
|
static bool IsNonLocalValue(Value *V, BasicBlock *BB) {
|
|
|
|
if (Instruction *I = dyn_cast<Instruction>(V))
|
|
|
|
return I->getParent() != BB;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Load and Store Instructions often have addressing modes that can do
|
|
|
|
/// significant amounts of computation. As such, instruction selection will try
|
|
|
|
/// to get the load or store to do as much computation as possible for the
|
|
|
|
/// program. The problem is that isel can only see within a single block. As
|
|
|
|
/// such, we sink as much legal addressing mode work into the block as possible.
|
2008-11-25 15:09:13 +08:00
|
|
|
///
|
|
|
|
/// This method is used to optimize both load/store and inline asms with memory
|
|
|
|
/// operands.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeMemoryInst(Instruction *MemoryInst, Value *Addr,
|
2015-06-05 00:17:38 +08:00
|
|
|
Type *AccessTy, unsigned AddrSpace) {
|
2010-11-27 16:15:55 +08:00
|
|
|
Value *Repl = Addr;
|
2012-07-24 18:51:42 +08:00
|
|
|
|
|
|
|
// Try to collapse single-value PHI nodes. This is necessary to undo
|
2010-11-20 06:15:03 +08:00
|
|
|
// unprofitable PRE transformations.
|
2011-01-03 14:33:01 +08:00
|
|
|
SmallVector<Value*, 8> worklist;
|
|
|
|
SmallPtrSet<Value*, 16> Visited;
|
2010-11-27 16:15:55 +08:00
|
|
|
worklist.push_back(Addr);
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2010-11-27 16:15:55 +08:00
|
|
|
// Use a worklist to iteratively look through PHI nodes, and ensure that
|
|
|
|
// the addressing mode obtained from the non-PHI roots of the graph
|
|
|
|
// are equivalent.
|
2014-04-14 08:51:57 +08:00
|
|
|
Value *Consensus = nullptr;
|
2011-03-02 05:13:53 +08:00
|
|
|
unsigned NumUsesConsensus = 0;
|
2011-03-05 16:12:26 +08:00
|
|
|
bool IsNumUsesConsensusValid = false;
|
2010-11-27 16:15:55 +08:00
|
|
|
SmallVector<Instruction*, 16> AddrModeInsts;
|
|
|
|
ExtAddrMode AddrMode;
|
2014-02-07 05:44:56 +08:00
|
|
|
TypePromotionTransaction TPT;
|
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
2010-11-27 16:15:55 +08:00
|
|
|
while (!worklist.empty()) {
|
|
|
|
Value *V = worklist.back();
|
|
|
|
worklist.pop_back();
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2010-11-27 16:15:55 +08:00
|
|
|
// Break use-def graph loops.
|
2014-11-19 15:49:26 +08:00
|
|
|
if (!Visited.insert(V).second) {
|
2014-04-14 08:51:57 +08:00
|
|
|
Consensus = nullptr;
|
2010-11-27 16:15:55 +08:00
|
|
|
break;
|
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2010-11-27 16:15:55 +08:00
|
|
|
// For a PHI node, push all of its incoming values.
|
|
|
|
if (PHINode *P = dyn_cast<PHINode>(V)) {
|
2015-05-13 04:05:31 +08:00
|
|
|
for (Value *IncValue : P->incoming_values())
|
|
|
|
worklist.push_back(IncValue);
|
2010-11-27 16:15:55 +08:00
|
|
|
continue;
|
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2010-11-27 16:15:55 +08:00
|
|
|
// For non-PHIs, determine the addressing mode being computed.
|
|
|
|
SmallVector<Instruction*, 16> NewAddrModeInsts;
|
2014-02-07 05:44:56 +08:00
|
|
|
ExtAddrMode NewAddrMode = AddressingModeMatcher::Match(
|
2015-06-05 00:17:38 +08:00
|
|
|
V, AccessTy, AddrSpace, MemoryInst, NewAddrModeInsts, *TM,
|
2015-06-18 04:44:32 +08:00
|
|
|
InsertedInsts, PromotedInsts, TPT);
|
2011-03-05 16:12:26 +08:00
|
|
|
|
|
|
|
// This check is broken into two cases with very similar code to avoid using
|
|
|
|
// getNumUses() as much as possible. Some values have a lot of uses, so
|
|
|
|
// calling getNumUses() unconditionally caused a significant compile-time
|
|
|
|
// regression.
|
|
|
|
if (!Consensus) {
|
|
|
|
Consensus = V;
|
|
|
|
AddrMode = NewAddrMode;
|
|
|
|
AddrModeInsts = NewAddrModeInsts;
|
|
|
|
continue;
|
|
|
|
} else if (NewAddrMode == AddrMode) {
|
|
|
|
if (!IsNumUsesConsensusValid) {
|
|
|
|
NumUsesConsensus = Consensus->getNumUses();
|
|
|
|
IsNumUsesConsensusValid = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Ensure that the obtained addressing mode is equivalent to that obtained
|
|
|
|
// for all other roots of the PHI traversal. Also, when choosing one
|
|
|
|
// such root as representative, select the one with the most uses in order
|
|
|
|
// to keep the cost modeling heuristics in AddressingModeMatcher
|
|
|
|
// applicable.
|
2011-03-02 05:13:53 +08:00
|
|
|
unsigned NumUses = V->getNumUses();
|
|
|
|
if (NumUses > NumUsesConsensus) {
|
2010-11-27 16:15:55 +08:00
|
|
|
Consensus = V;
|
2011-03-02 05:13:53 +08:00
|
|
|
NumUsesConsensus = NumUses;
|
2010-11-27 16:15:55 +08:00
|
|
|
AddrModeInsts = NewAddrModeInsts;
|
2010-11-20 06:15:03 +08:00
|
|
|
}
|
2010-11-27 16:15:55 +08:00
|
|
|
continue;
|
2010-11-20 06:15:03 +08:00
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2014-04-14 08:51:57 +08:00
|
|
|
Consensus = nullptr;
|
2010-11-27 16:15:55 +08:00
|
|
|
break;
|
2010-11-20 06:15:03 +08:00
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2010-11-27 16:15:55 +08:00
|
|
|
// If the addressing mode couldn't be determined, or if multiple different
|
|
|
|
// ones were determined, bail out now.
|
2014-02-07 05:44:56 +08:00
|
|
|
if (!Consensus) {
|
|
|
|
TPT.rollback(LastKnownGood);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
TPT.commit();
|
2012-07-24 18:51:42 +08:00
|
|
|
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
// Check to see if any of the instructions supersumed by this addr mode are
|
|
|
|
// non-local to I's BB.
|
|
|
|
bool AnyNonLocal = false;
|
|
|
|
for (unsigned i = 0, e = AddrModeInsts.size(); i != e; ++i) {
|
2008-11-26 11:20:37 +08:00
|
|
|
if (IsNonLocalValue(AddrModeInsts[i], MemoryInst->getParent())) {
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
AnyNonLocal = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
// If all the instructions matched are already in this BB, don't do anything.
|
|
|
|
if (!AnyNonLocal) {
|
2010-01-05 09:27:11 +08:00
|
|
|
DEBUG(dbgs() << "CGP: Found local addrmode: " << AddrMode << "\n");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
return false;
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
// Insert this computation right after this user. Since our caller is
|
|
|
|
// scanning from the top of the BB to the bottom, reuse of the expr are
|
|
|
|
// guaranteed to happen later.
|
2011-09-07 02:49:53 +08:00
|
|
|
IRBuilder<> Builder(MemoryInst);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
// Now that we determined the addressing expression we want to use and know
|
|
|
|
// that we have to sink it into this block. Check to see if we have already
|
|
|
|
// done this for some other load/store instr in this block. If so, reuse the
|
|
|
|
// computation.
|
|
|
|
Value *&SunkAddr = SunkAddrs[Addr];
|
|
|
|
if (SunkAddr) {
|
2010-01-05 09:27:11 +08:00
|
|
|
DEBUG(dbgs() << "CGP: Reusing nonlocal addrmode: " << AddrMode << " for "
|
2014-05-14 05:54:22 +08:00
|
|
|
<< *MemoryInst << "\n");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
if (SunkAddr->getType() != Addr->getType())
|
2011-09-28 04:39:19 +08:00
|
|
|
SunkAddr = Builder.CreateBitCast(SunkAddr, Addr->getType());
|
2015-01-27 09:01:38 +08:00
|
|
|
} else if (AddrSinkUsingGEPs ||
|
|
|
|
(!AddrSinkUsingGEPs.getNumOccurrences() && TM &&
|
2015-01-27 15:54:39 +08:00
|
|
|
TM->getSubtargetImpl(*MemoryInst->getParent()->getParent())
|
|
|
|
->useAA())) {
|
2014-04-12 08:59:48 +08:00
|
|
|
// By default, we use the GEP-based method when AA is used later. This
|
|
|
|
// prevents new inttoptr/ptrtoint pairs from degrading AA capabilities.
|
|
|
|
DEBUG(dbgs() << "CGP: SINKING nonlocal addrmode: " << AddrMode << " for "
|
2014-05-14 05:54:22 +08:00
|
|
|
<< *MemoryInst << "\n");
|
2015-07-08 02:45:17 +08:00
|
|
|
Type *IntPtrTy = DL->getIntPtrType(Addr->getType());
|
2014-04-14 08:51:57 +08:00
|
|
|
Value *ResultPtr = nullptr, *ResultIndex = nullptr;
|
2014-04-12 08:59:48 +08:00
|
|
|
|
|
|
|
// First, find the pointer.
|
|
|
|
if (AddrMode.BaseReg && AddrMode.BaseReg->getType()->isPointerTy()) {
|
|
|
|
ResultPtr = AddrMode.BaseReg;
|
2014-04-14 08:51:57 +08:00
|
|
|
AddrMode.BaseReg = nullptr;
|
2014-04-12 08:59:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (AddrMode.Scale && AddrMode.ScaledReg->getType()->isPointerTy()) {
|
|
|
|
// We can't add more than one pointer together, nor can we scale a
|
|
|
|
// pointer (both of which seem meaningless).
|
|
|
|
if (ResultPtr || AddrMode.Scale != 1)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
ResultPtr = AddrMode.ScaledReg;
|
|
|
|
AddrMode.Scale = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (AddrMode.BaseGV) {
|
|
|
|
if (ResultPtr)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
ResultPtr = AddrMode.BaseGV;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If the real base value actually came from an inttoptr, then the matcher
|
|
|
|
// will look through it and provide only the integer value. In that case,
|
|
|
|
// use it here.
|
|
|
|
if (!ResultPtr && AddrMode.BaseReg) {
|
|
|
|
ResultPtr =
|
|
|
|
Builder.CreateIntToPtr(AddrMode.BaseReg, Addr->getType(), "sunkaddr");
|
2014-04-14 08:51:57 +08:00
|
|
|
AddrMode.BaseReg = nullptr;
|
2014-04-12 08:59:48 +08:00
|
|
|
} else if (!ResultPtr && AddrMode.Scale == 1) {
|
|
|
|
ResultPtr =
|
|
|
|
Builder.CreateIntToPtr(AddrMode.ScaledReg, Addr->getType(), "sunkaddr");
|
|
|
|
AddrMode.Scale = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ResultPtr &&
|
|
|
|
!AddrMode.BaseReg && !AddrMode.Scale && !AddrMode.BaseOffs) {
|
|
|
|
SunkAddr = Constant::getNullValue(Addr->getType());
|
|
|
|
} else if (!ResultPtr) {
|
|
|
|
return false;
|
|
|
|
} else {
|
|
|
|
Type *I8PtrTy =
|
2015-03-31 04:42:56 +08:00
|
|
|
Builder.getInt8PtrTy(Addr->getType()->getPointerAddressSpace());
|
|
|
|
Type *I8Ty = Builder.getInt8Ty();
|
2014-04-12 08:59:48 +08:00
|
|
|
|
|
|
|
// Start with the base register. Do this first so that subsequent address
|
|
|
|
// matching finds it last, which will prevent it from trying to match it
|
|
|
|
// as the scaled value in case it happens to be a mul. That would be
|
|
|
|
// problematic if we've sunk a different mul for the scale, because then
|
|
|
|
// we'd end up sinking both muls.
|
|
|
|
if (AddrMode.BaseReg) {
|
|
|
|
Value *V = AddrMode.BaseReg;
|
|
|
|
if (V->getType() != IntPtrTy)
|
|
|
|
V = Builder.CreateIntCast(V, IntPtrTy, /*isSigned=*/true, "sunkaddr");
|
|
|
|
|
|
|
|
ResultIndex = V;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Add the scale value.
|
|
|
|
if (AddrMode.Scale) {
|
|
|
|
Value *V = AddrMode.ScaledReg;
|
|
|
|
if (V->getType() == IntPtrTy) {
|
|
|
|
// done.
|
|
|
|
} else if (cast<IntegerType>(IntPtrTy)->getBitWidth() <
|
|
|
|
cast<IntegerType>(V->getType())->getBitWidth()) {
|
|
|
|
V = Builder.CreateTrunc(V, IntPtrTy, "sunkaddr");
|
|
|
|
} else {
|
|
|
|
// It is only safe to sign extend the BaseReg if we know that the math
|
|
|
|
// required to create it did not overflow before we extend it. Since
|
|
|
|
// the original IR value was tossed in favor of a constant back when
|
|
|
|
// the AddrMode was created we need to bail out gracefully if widths
|
|
|
|
// do not match instead of extending it.
|
|
|
|
Instruction *I = dyn_cast_or_null<Instruction>(ResultIndex);
|
|
|
|
if (I && (ResultIndex != AddrMode.BaseReg))
|
|
|
|
I->eraseFromParent();
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (AddrMode.Scale != 1)
|
|
|
|
V = Builder.CreateMul(V, ConstantInt::get(IntPtrTy, AddrMode.Scale),
|
|
|
|
"sunkaddr");
|
|
|
|
if (ResultIndex)
|
|
|
|
ResultIndex = Builder.CreateAdd(ResultIndex, V, "sunkaddr");
|
|
|
|
else
|
|
|
|
ResultIndex = V;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Add in the Base Offset if present.
|
|
|
|
if (AddrMode.BaseOffs) {
|
|
|
|
Value *V = ConstantInt::get(IntPtrTy, AddrMode.BaseOffs);
|
|
|
|
if (ResultIndex) {
|
2014-10-29 23:23:11 +08:00
|
|
|
// We need to add this separately from the scale above to help with
|
|
|
|
// SDAG consecutive load/store merging.
|
2014-04-12 08:59:48 +08:00
|
|
|
if (ResultPtr->getType() != I8PtrTy)
|
|
|
|
ResultPtr = Builder.CreateBitCast(ResultPtr, I8PtrTy);
|
2015-03-31 04:42:56 +08:00
|
|
|
ResultPtr = Builder.CreateGEP(I8Ty, ResultPtr, ResultIndex, "sunkaddr");
|
2014-04-12 08:59:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ResultIndex = V;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ResultIndex) {
|
|
|
|
SunkAddr = ResultPtr;
|
|
|
|
} else {
|
|
|
|
if (ResultPtr->getType() != I8PtrTy)
|
|
|
|
ResultPtr = Builder.CreateBitCast(ResultPtr, I8PtrTy);
|
2015-03-31 04:42:56 +08:00
|
|
|
SunkAddr = Builder.CreateGEP(I8Ty, ResultPtr, ResultIndex, "sunkaddr");
|
2014-04-12 08:59:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (SunkAddr->getType() != Addr->getType())
|
|
|
|
SunkAddr = Builder.CreateBitCast(SunkAddr, Addr->getType());
|
|
|
|
}
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
} else {
|
2010-01-05 09:27:11 +08:00
|
|
|
DEBUG(dbgs() << "CGP: SINKING nonlocal addrmode: " << AddrMode << " for "
|
2014-05-14 05:54:22 +08:00
|
|
|
<< *MemoryInst << "\n");
|
2015-07-08 02:45:17 +08:00
|
|
|
Type *IntPtrTy = DL->getIntPtrType(Addr->getType());
|
2014-04-14 08:51:57 +08:00
|
|
|
Value *Result = nullptr;
|
2010-01-20 06:45:06 +08:00
|
|
|
|
|
|
|
// Start with the base register. Do this first so that subsequent address
|
|
|
|
// matching finds it last, which will prevent it from trying to match it
|
|
|
|
// as the scaled value in case it happens to be a mul. That would be
|
|
|
|
// problematic if we've sunk a different mul for the scale, because then
|
|
|
|
// we'd end up sinking both muls.
|
|
|
|
if (AddrMode.BaseReg) {
|
|
|
|
Value *V = AddrMode.BaseReg;
|
2010-02-16 19:11:14 +08:00
|
|
|
if (V->getType()->isPointerTy())
|
2011-09-07 02:49:53 +08:00
|
|
|
V = Builder.CreatePtrToInt(V, IntPtrTy, "sunkaddr");
|
2010-01-20 06:45:06 +08:00
|
|
|
if (V->getType() != IntPtrTy)
|
2011-09-07 02:49:53 +08:00
|
|
|
V = Builder.CreateIntCast(V, IntPtrTy, /*isSigned=*/true, "sunkaddr");
|
2010-01-20 06:45:06 +08:00
|
|
|
Result = V;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Add the scale value.
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
if (AddrMode.Scale) {
|
|
|
|
Value *V = AddrMode.ScaledReg;
|
|
|
|
if (V->getType() == IntPtrTy) {
|
|
|
|
// done.
|
2010-02-16 19:11:14 +08:00
|
|
|
} else if (V->getType()->isPointerTy()) {
|
2011-09-07 02:49:53 +08:00
|
|
|
V = Builder.CreatePtrToInt(V, IntPtrTy, "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
} else if (cast<IntegerType>(IntPtrTy)->getBitWidth() <
|
|
|
|
cast<IntegerType>(V->getType())->getBitWidth()) {
|
2011-09-07 02:49:53 +08:00
|
|
|
V = Builder.CreateTrunc(V, IntPtrTy, "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
} else {
|
2014-03-27 01:27:01 +08:00
|
|
|
// It is only safe to sign extend the BaseReg if we know that the math
|
|
|
|
// required to create it did not overflow before we extend it. Since
|
|
|
|
// the original IR value was tossed in favor of a constant back when
|
|
|
|
// the AddrMode was created we need to bail out gracefully if widths
|
|
|
|
// do not match instead of extending it.
|
2014-05-13 23:42:45 +08:00
|
|
|
Instruction *I = dyn_cast_or_null<Instruction>(Result);
|
2014-04-10 08:27:45 +08:00
|
|
|
if (I && (Result != AddrMode.BaseReg))
|
|
|
|
I->eraseFromParent();
|
2014-03-27 01:27:01 +08:00
|
|
|
return false;
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
}
|
|
|
|
if (AddrMode.Scale != 1)
|
2011-09-07 02:49:53 +08:00
|
|
|
V = Builder.CreateMul(V, ConstantInt::get(IntPtrTy, AddrMode.Scale),
|
|
|
|
"sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
if (Result)
|
2011-09-07 02:49:53 +08:00
|
|
|
Result = Builder.CreateAdd(Result, V, "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
else
|
|
|
|
Result = V;
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
// Add in the BaseGV if present.
|
|
|
|
if (AddrMode.BaseGV) {
|
2011-09-07 02:49:53 +08:00
|
|
|
Value *V = Builder.CreatePtrToInt(AddrMode.BaseGV, IntPtrTy, "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
if (Result)
|
2011-09-07 02:49:53 +08:00
|
|
|
Result = Builder.CreateAdd(Result, V, "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
else
|
|
|
|
Result = V;
|
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
// Add in the Base Offset if present.
|
|
|
|
if (AddrMode.BaseOffs) {
|
2009-07-25 07:12:02 +08:00
|
|
|
Value *V = ConstantInt::get(IntPtrTy, AddrMode.BaseOffs);
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
if (Result)
|
2011-09-07 02:49:53 +08:00
|
|
|
Result = Builder.CreateAdd(Result, V, "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
else
|
|
|
|
Result = V;
|
|
|
|
}
|
2007-03-31 12:06:36 +08:00
|
|
|
|
2014-04-14 08:51:57 +08:00
|
|
|
if (!Result)
|
2009-08-01 04:28:14 +08:00
|
|
|
SunkAddr = Constant::getNullValue(Addr->getType());
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
else
|
2011-09-07 02:49:53 +08:00
|
|
|
SunkAddr = Builder.CreateIntToPtr(Result, Addr->getType(), "sunkaddr");
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
}
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2010-11-20 06:15:03 +08:00
|
|
|
MemoryInst->replaceUsesOfWith(Repl, SunkAddr);
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2011-04-09 15:05:44 +08:00
|
|
|
// If we have no uses, recursively delete the value and all dead instructions
|
|
|
|
// using it.
|
2010-11-20 06:15:03 +08:00
|
|
|
if (Repl->use_empty()) {
|
2011-04-09 15:05:44 +08:00
|
|
|
// This can cause recursive deletion, which can invalidate our iterator.
|
|
|
|
// Use a WeakVH to hold onto it in case this happens.
|
2015-10-10 02:44:40 +08:00
|
|
|
WeakVH IterHandle(&*CurInstIterator);
|
2011-04-09 15:05:44 +08:00
|
|
|
BasicBlock *BB = CurInstIterator->getParent();
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2012-08-29 23:32:21 +08:00
|
|
|
RecursivelyDeleteTriviallyDeadInstructions(Repl, TLInfo);
|
2011-04-09 15:05:44 +08:00
|
|
|
|
2015-10-10 02:44:40 +08:00
|
|
|
if (IterHandle != CurInstIterator.getNodePtrUnchecked()) {
|
2011-04-09 15:05:44 +08:00
|
|
|
// If the iterator instruction was recursively deleted, start over at the
|
|
|
|
// start of the block.
|
|
|
|
CurInstIterator = BB->begin();
|
|
|
|
SunkAddrs.clear();
|
2012-07-24 18:51:42 +08:00
|
|
|
}
|
2010-04-01 04:37:15 +08:00
|
|
|
}
|
2011-01-06 01:27:27 +08:00
|
|
|
++NumMemoryInsts;
|
Completely rewrite addressing-mode related sinking of code. In particular,
this fixes problems where codegenprepare would sink expressions into load/stores
that are not valid, and fixes cases where it would miss important valid ones.
This fixes several serious codesize and perf issues, particularly on targets
with complex addressing modes like arm and x86. For example, now we compile
CodeGen/X86/isel-sink.ll to:
_test:
movl 8(%esp), %eax
movl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx,%eax,4)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx,%eax,4), %eax
ret
instead of:
_test:
movl 8(%esp), %eax
leal (,%eax,4), %ecx
addl 4(%esp), %ecx
cmpl $1233, %eax
ja LBB1_2 #F
LBB1_1: #T
movl $4, (%ecx)
movl $141, %eax
ret
LBB1_2: #F
movl (%ecx), %eax
ret
llvm-svn: 35970
2007-04-14 04:30:56 +08:00
|
|
|
return true;
|
|
|
|
}
|
2007-03-31 12:06:36 +08:00
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// If there are any memory operands, use OptimizeMemoryInst to sink their
|
|
|
|
/// address computing into the block when possible / profitable.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeInlineAsmInst(CallInst *CS) {
|
2008-02-26 10:42:37 +08:00
|
|
|
bool MadeChange = false;
|
|
|
|
|
2015-02-27 06:38:43 +08:00
|
|
|
const TargetRegisterInfo *TRI =
|
|
|
|
TM->getSubtargetImpl(*CS->getParent()->getParent())->getRegisterInfo();
|
2015-07-08 03:07:19 +08:00
|
|
|
TargetLowering::AsmOperandInfoVector TargetConstraints =
|
|
|
|
TLI->ParseConstraints(*DL, TRI, CS);
|
2010-09-17 02:30:55 +08:00
|
|
|
unsigned ArgNo = 0;
|
2010-09-14 02:15:37 +08:00
|
|
|
for (unsigned i = 0, e = TargetConstraints.size(); i != e; ++i) {
|
|
|
|
TargetLowering::AsmOperandInfo &OpInfo = TargetConstraints[i];
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2008-02-26 10:42:37 +08:00
|
|
|
// Compute the constraint code and ConstraintType to use.
|
2010-06-26 05:55:36 +08:00
|
|
|
TLI->ComputeConstraintToUse(OpInfo, SDValue());
|
2008-02-26 10:42:37 +08:00
|
|
|
|
2008-02-27 02:37:49 +08:00
|
|
|
if (OpInfo.ConstraintType == TargetLowering::C_Memory &&
|
|
|
|
OpInfo.isIndirect) {
|
2011-01-15 15:14:54 +08:00
|
|
|
Value *OpVal = CS->getArgOperand(ArgNo++);
|
2015-09-22 07:03:16 +08:00
|
|
|
MadeChange |= optimizeMemoryInst(CS, OpVal, OpVal->getType(), ~0u);
|
2010-09-17 02:30:55 +08:00
|
|
|
} else if (OpInfo.Type == InlineAsm::isInput)
|
|
|
|
ArgNo++;
|
2008-02-26 10:42:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
/// \brief Check if all the uses of \p Inst are equivalent (or free) zero or
|
|
|
|
/// sign extensions.
|
|
|
|
static bool hasSameExtUse(Instruction *Inst, const TargetLowering &TLI) {
|
|
|
|
assert(!Inst->use_empty() && "Input must have at least one use");
|
|
|
|
const Instruction *FirstUser = cast<Instruction>(*Inst->user_begin());
|
|
|
|
bool IsSExt = isa<SExtInst>(FirstUser);
|
|
|
|
Type *ExtTy = FirstUser->getType();
|
|
|
|
for (const User *U : Inst->users()) {
|
|
|
|
const Instruction *UI = cast<Instruction>(U);
|
|
|
|
if ((IsSExt && !isa<SExtInst>(UI)) || (!IsSExt && !isa<ZExtInst>(UI)))
|
|
|
|
return false;
|
|
|
|
Type *CurTy = UI->getType();
|
|
|
|
// Same input and output types: Same instruction after CSE.
|
|
|
|
if (CurTy == ExtTy)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// If IsSExt is true, we are in this situation:
|
|
|
|
// a = Inst
|
|
|
|
// b = sext ty1 a to ty2
|
|
|
|
// c = sext ty1 a to ty3
|
|
|
|
// Assuming ty2 is shorter than ty3, this could be turned into:
|
|
|
|
// a = Inst
|
|
|
|
// b = sext ty1 a to ty2
|
|
|
|
// c = sext ty2 b to ty3
|
|
|
|
// However, the last sext is not free.
|
|
|
|
if (IsSExt)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// This is a ZExt, maybe this is free to extend from one type to another.
|
|
|
|
// In that case, we would not account for a different use.
|
|
|
|
Type *NarrowTy;
|
|
|
|
Type *LargeTy;
|
|
|
|
if (ExtTy->getScalarType()->getIntegerBitWidth() >
|
|
|
|
CurTy->getScalarType()->getIntegerBitWidth()) {
|
|
|
|
NarrowTy = CurTy;
|
|
|
|
LargeTy = ExtTy;
|
|
|
|
} else {
|
|
|
|
NarrowTy = ExtTy;
|
|
|
|
LargeTy = CurTy;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!TLI.isZExtFree(NarrowTy, LargeTy))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
// All uses are the same or can be derived from one another for free.
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Try to form ExtLd by promoting \p Exts until they reach a
|
|
|
|
/// load instruction.
|
|
|
|
/// If an ext(load) can be formed, it is returned via \p LI for the load
|
|
|
|
/// and \p Inst for the extension.
|
|
|
|
/// Otherwise LI == nullptr and Inst == nullptr.
|
|
|
|
/// When some promotion happened, \p TPT contains the proper state to
|
|
|
|
/// revert them.
|
|
|
|
///
|
|
|
|
/// \return true when promoting was necessary to expose the ext(load)
|
|
|
|
/// opportunity, false otherwise.
|
|
|
|
///
|
|
|
|
/// Example:
|
|
|
|
/// \code
|
|
|
|
/// %ld = load i32* %addr
|
|
|
|
/// %add = add nuw i32 %ld, 4
|
|
|
|
/// %zext = zext i32 %add to i64
|
|
|
|
/// \endcode
|
|
|
|
/// =>
|
|
|
|
/// \code
|
|
|
|
/// %ld = load i32* %addr
|
|
|
|
/// %zext = zext i32 %ld to i64
|
|
|
|
/// %add = add nuw i64 %zext, 4
|
|
|
|
/// \encode
|
|
|
|
/// Thanks to the promotion, we can match zext(load i32*) to i64.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::extLdPromotion(TypePromotionTransaction &TPT,
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
LoadInst *&LI, Instruction *&Inst,
|
|
|
|
const SmallVectorImpl<Instruction *> &Exts,
|
2015-03-11 05:48:15 +08:00
|
|
|
unsigned CreatedInstsCost = 0) {
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// Iterate over all the extensions to see if one form an ext(load).
|
|
|
|
for (auto I : Exts) {
|
|
|
|
// Check if we directly have ext(load).
|
|
|
|
if ((LI = dyn_cast<LoadInst>(I->getOperand(0)))) {
|
|
|
|
Inst = I;
|
|
|
|
// No promotion happened here.
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
// Check whether or not we want to do any promotion.
|
|
|
|
if (!TLI || !TLI->enableExtLdPromotion() || DisableExtLdPromotion)
|
|
|
|
continue;
|
|
|
|
// Get the action to perform the promotion.
|
|
|
|
TypePromotionHelper::Action TPH = TypePromotionHelper::getAction(
|
2015-06-18 04:44:32 +08:00
|
|
|
I, InsertedInsts, *TLI, PromotedInsts);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// Check if we can promote.
|
|
|
|
if (!TPH)
|
|
|
|
continue;
|
|
|
|
// Save the current state.
|
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
|
|
|
SmallVector<Instruction *, 4> NewExts;
|
2015-03-11 05:48:15 +08:00
|
|
|
unsigned NewCreatedInstsCost = 0;
|
|
|
|
unsigned ExtCost = !TLI->isExtFree(I);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// Promote.
|
2015-03-11 05:48:15 +08:00
|
|
|
Value *PromotedVal = TPH(I, TPT, PromotedInsts, NewCreatedInstsCost,
|
|
|
|
&NewExts, nullptr, *TLI);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
assert(PromotedVal &&
|
|
|
|
"TypePromotionHelper should have filtered out those cases");
|
|
|
|
|
|
|
|
// We would be able to merge only one extension in a load.
|
|
|
|
// Therefore, if we have more than 1 new extension we heuristically
|
|
|
|
// cut this search path, because it means we degrade the code quality.
|
|
|
|
// With exactly 2, the transformation is neutral, because we will merge
|
|
|
|
// one extension but leave one. However, we optimistically keep going,
|
|
|
|
// because the new extension may be removed too.
|
2015-03-11 05:48:15 +08:00
|
|
|
long long TotalCreatedInstsCost = CreatedInstsCost + NewCreatedInstsCost;
|
|
|
|
TotalCreatedInstsCost -= ExtCost;
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
if (!StressExtLdPromotion &&
|
2015-03-11 05:48:15 +08:00
|
|
|
(TotalCreatedInstsCost > 1 ||
|
2015-07-09 10:09:04 +08:00
|
|
|
!isPromotedInstructionLegal(*TLI, *DL, PromotedVal))) {
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// The promotion is not profitable, rollback to the previous state.
|
|
|
|
TPT.rollback(LastKnownGood);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// The promotion is profitable.
|
|
|
|
// Check if it exposes an ext(load).
|
2015-09-22 07:03:16 +08:00
|
|
|
(void)extLdPromotion(TPT, LI, Inst, NewExts, TotalCreatedInstsCost);
|
2015-03-11 05:48:15 +08:00
|
|
|
if (LI && (StressExtLdPromotion || NewCreatedInstsCost <= ExtCost ||
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// If we have created a new extension, i.e., now we have two
|
|
|
|
// extensions. We must make sure one of them is merged with
|
|
|
|
// the load, otherwise we may degrade the code quality.
|
|
|
|
(LI->hasOneUse() || hasSameExtUse(LI, *TLI))))
|
|
|
|
// Promotion happened.
|
|
|
|
return true;
|
|
|
|
// If this does not help to expose an ext(load) then, rollback.
|
|
|
|
TPT.rollback(LastKnownGood);
|
|
|
|
}
|
|
|
|
// None of the extension can form an ext(load).
|
|
|
|
LI = nullptr;
|
|
|
|
Inst = nullptr;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Move a zext or sext fed by a load into the same basic block as the load,
|
|
|
|
/// unless conditions are unfavorable. This allows SelectionDAG to fold the
|
|
|
|
/// extend into the load.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
/// \p I[in/out] the extension may be modified during the process if some
|
|
|
|
/// promotions apply.
|
2009-10-17 04:59:35 +08:00
|
|
|
///
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::moveExtToFormExtLoad(Instruction *&I) {
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// Try to promote a chain of computation if it allows to form
|
|
|
|
// an extended load.
|
|
|
|
TypePromotionTransaction TPT;
|
|
|
|
TypePromotionTransaction::ConstRestorationPt LastKnownGood =
|
|
|
|
TPT.getRestorationPoint();
|
|
|
|
SmallVector<Instruction *, 1> Exts;
|
|
|
|
Exts.push_back(I);
|
2009-10-17 04:59:35 +08:00
|
|
|
// Look for a load being extended.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
LoadInst *LI = nullptr;
|
|
|
|
Instruction *OldExt = I;
|
2015-09-22 07:03:16 +08:00
|
|
|
bool HasPromoted = extLdPromotion(TPT, LI, I, Exts);
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
if (!LI || !I) {
|
|
|
|
assert(!HasPromoted && !LI && "If we did not match any load instruction "
|
|
|
|
"the code must remain the same");
|
|
|
|
I = OldExt;
|
|
|
|
return false;
|
|
|
|
}
|
2009-10-17 04:59:35 +08:00
|
|
|
|
|
|
|
// If they're already in the same block, there's nothing to do.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
// Make the cheap checks first if we did not promote.
|
|
|
|
// If we promoted, we need to check if it is indeed profitable.
|
|
|
|
if (!HasPromoted && LI->getParent() == I->getParent())
|
2009-10-17 04:59:35 +08:00
|
|
|
return false;
|
|
|
|
|
2015-07-09 10:09:04 +08:00
|
|
|
EVT VT = TLI->getValueType(*DL, I->getType());
|
|
|
|
EVT LoadVT = TLI->getValueType(*DL, LI->getType());
|
2014-12-06 02:04:40 +08:00
|
|
|
|
2009-10-17 04:59:35 +08:00
|
|
|
// If the load has other users and the truncate is not free, this probably
|
|
|
|
// isn't worthwhile.
|
2014-12-06 02:04:40 +08:00
|
|
|
if (!LI->hasOneUse() && TLI &&
|
|
|
|
(TLI->isTypeLegal(LoadVT) || !TLI->isTypeLegal(VT)) &&
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
!TLI->isTruncateFree(I->getType(), LI->getType())) {
|
|
|
|
I = OldExt;
|
|
|
|
TPT.rollback(LastKnownGood);
|
2009-10-17 04:59:35 +08:00
|
|
|
return false;
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
}
|
2009-10-17 04:59:35 +08:00
|
|
|
|
|
|
|
// Check whether the target supports casts folded into loads.
|
|
|
|
unsigned LType;
|
|
|
|
if (isa<ZExtInst>(I))
|
|
|
|
LType = ISD::ZEXTLOAD;
|
|
|
|
else {
|
|
|
|
assert(isa<SExtInst>(I) && "Unexpected ext type!");
|
|
|
|
LType = ISD::SEXTLOAD;
|
|
|
|
}
|
[SelectionDAG] Allow targets to specify legality of extloads' result
type (in addition to the memory type).
The *LoadExt* legalization handling used to only have one type, the
memory type. This forced users to assume that as long as the extload
for the memory type was declared legal, and the result type was legal,
the whole extload was legal.
However, this isn't always the case. For instance, on X86, with AVX,
this is legal:
v4i32 load, zext from v4i8
but this isn't:
v4i64 load, zext from v4i8
Whereas v4i64 is (arguably) legal, even without AVX2.
Note that the same thing was done a while ago for truncstores (r46140),
but I assume no one needed it yet for extloads, so here we go.
Calls to getLoadExtAction were changed to add the value type, found
manually in the surrounding code.
Calls to setLoadExtAction were mechanically changed, by wrapping the
call in a loop, to match previous behavior. The loop iterates over
the MVT subrange corresponding to the memory type (FP vectors, etc...).
I also pulled neighboring setTruncStoreActions into some of the loops;
those shouldn't make a difference, as the additional types are illegal.
(e.g., i128->i1 truncstores on PPC.)
No functional change intended.
Differential Revision: http://reviews.llvm.org/D6532
llvm-svn: 225421
2015-01-08 08:51:32 +08:00
|
|
|
if (TLI && !TLI->isLoadExtLegal(LType, VT, LoadVT)) {
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
I = OldExt;
|
|
|
|
TPT.rollback(LastKnownGood);
|
2009-10-17 04:59:35 +08:00
|
|
|
return false;
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
}
|
2009-10-17 04:59:35 +08:00
|
|
|
|
|
|
|
// Move the extend into the same block as the load, so that SelectionDAG
|
|
|
|
// can fold it.
|
[CodeGenPrepare] Reapply r224351 with a fix for the assertion failure:
The type promotion helper does not support vector type, so when make
such it does not kick in in such cases.
Original commit message:
[CodeGenPrepare] Move sign/zero extensions near loads using type promotion.
This patch extends the optimization in CodeGenPrepare that moves a sign/zero
extension near a load when the target can combine them. The optimization may
promote any operations between the extension and the load to make that possible.
Although this optimization may be beneficial for all targets, in particular
AArch64, this is enabled for X86 only as I have not benchmarked it for other
targets yet.
** Context **
Most targets feature extended loads, i.e., loads that perform a zero or sign
extension for free. In that context it is interesting to expose such pattern in
CodeGenPrepare so that the instruction selection pass can form such loads.
Sometimes, this pattern is blocked because of instructions between the load and
the extension. When those instructions are promotable to the extended type, we
can expose this pattern.
** Motivating Example **
Let us consider an example:
define void @foo(i8* %addr1, i32* %addr2, i8 %a, i32 %b) {
%ld = load i8* %addr1
%zextld = zext i8 %ld to i32
%ld2 = load i32* %addr2
%add = add nsw i32 %ld2, %zextld
%sextadd = sext i32 %add to i64
%zexta = zext i8 %a to i32
%addza = add nsw i32 %zexta, %zextld
%sextaddza = sext i32 %addza to i64
%addb = add nsw i32 %b, %zextld
%sextaddb = sext i32 %addb to i64
call void @dummy(i64 %sextadd, i64 %sextaddza, i64 %sextaddb)
ret void
}
As it is, this IR generates the following assembly on x86_64:
[...]
movzbl (%rdi), %eax # zero-extended load
movl (%rsi), %es # plain load
addl %eax, %esi # 32-bit add
movslq %esi, %rdi # sign extend the result of add
movzbl %dl, %edx # zero extend the first argument
addl %eax, %edx # 32-bit add
movslq %edx, %rsi # sign extend the result of add
addl %eax, %ecx # 32-bit add
movslq %ecx, %rdx # sign extend the result of add
[...]
The throughput of this sequence is 7.45 cycles on Ivy Bridge according to IACA.
Now, by promoting the additions to form more extended loads we would generate:
[...]
movzbl (%rdi), %eax # zero-extended load
movslq (%rsi), %rdi # sign-extended load
addq %rax, %rdi # 64-bit add
movzbl %dl, %esi # zero extend the first argument
addq %rax, %rsi # 64-bit add
movslq %ecx, %rdx # sign extend the second argument
addq %rax, %rdx # 64-bit add
[...]
The throughput of this sequence is 6.15 cycles on Ivy Bridge according to IACA.
This kind of sequences happen a lot on code using 32-bit indexes on 64-bit
architectures.
Note: The throughput numbers are similar on Sandy Bridge and Haswell.
** Proposed Solution **
To avoid the penalty of all these sign/zero extensions, we merge them in the
loads at the beginning of the chain of computation by promoting all the chain of
computation on the extended type. The promotion is done if and only if we do not
introduce new extensions, i.e., if we do not degrade the code quality.
To achieve this, we extend the existing “move ext to load” optimization with the
promotion mechanism introduced to match larger patterns for addressing mode
(r200947).
The idea of this extension is to perform the following transformation:
ext(promotableInst1(...(promotableInstN(load))))
=>
promotedInst1(...(promotedInstN(ext(load))))
The promotion mechanism in that optimization is enabled by a new TargetLowering
switch, which is off by default. In other words, by default, the optimization
performs the “move ext to load” optimization as it was before this patch.
** Performance **
Configuration: x86_64: Ivy Bridge fixed at 2900MHz running OS X 10.10.
Tested Optimization Levels: O3/Os
Tests: llvm-testsuite + externals.
Results:
- No regression beside noise.
- Improvements:
CINT2006/473.astar: ~2%
Benchmarks/PAQ8p: ~2%
Misc/perlin: ~3%
The results are consistent for both O3 and Os.
<rdar://problem/18310086>
llvm-svn: 224402
2014-12-17 09:36:17 +08:00
|
|
|
TPT.commit();
|
2009-10-17 04:59:35 +08:00
|
|
|
I->removeFromParent();
|
|
|
|
I->insertAfter(LI);
|
2011-01-06 01:27:27 +08:00
|
|
|
++NumExtsMoved;
|
2009-10-17 04:59:35 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeExtUses(Instruction *I) {
|
2007-12-06 07:58:20 +08:00
|
|
|
BasicBlock *DefBB = I->getParent();
|
|
|
|
|
2010-09-22 05:44:14 +08:00
|
|
|
// If the result of a {s|z}ext and its source are both live out, rewrite all
|
2007-12-06 07:58:20 +08:00
|
|
|
// other uses of the source with result of extension.
|
|
|
|
Value *Src = I->getOperand(0);
|
|
|
|
if (Src->hasOneUse())
|
|
|
|
return false;
|
|
|
|
|
2007-12-13 15:50:36 +08:00
|
|
|
// Only do this xform if truncating is free.
|
2008-02-27 03:13:21 +08:00
|
|
|
if (TLI && !TLI->isTruncateFree(I->getType(), Src->getType()))
|
2007-12-13 11:32:53 +08:00
|
|
|
return false;
|
|
|
|
|
2007-12-12 08:51:06 +08:00
|
|
|
// Only safe to perform the optimization if the source is also defined in
|
2007-12-12 10:53:41 +08:00
|
|
|
// this block.
|
|
|
|
if (!isa<Instruction>(Src) || DefBB != cast<Instruction>(Src)->getParent())
|
2007-12-12 08:51:06 +08:00
|
|
|
return false;
|
|
|
|
|
2007-12-06 07:58:20 +08:00
|
|
|
bool DefIsLiveOut = false;
|
2014-03-09 11:16:01 +08:00
|
|
|
for (User *U : I->users()) {
|
|
|
|
Instruction *UI = cast<Instruction>(U);
|
2007-12-06 07:58:20 +08:00
|
|
|
|
|
|
|
// Figure out which BB this ext is used in.
|
2014-03-09 11:16:01 +08:00
|
|
|
BasicBlock *UserBB = UI->getParent();
|
2007-12-06 07:58:20 +08:00
|
|
|
if (UserBB == DefBB) continue;
|
|
|
|
DefIsLiveOut = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (!DefIsLiveOut)
|
|
|
|
return false;
|
|
|
|
|
2013-04-16 01:40:48 +08:00
|
|
|
// Make sure none of the uses are PHI nodes.
|
2014-03-09 11:16:01 +08:00
|
|
|
for (User *U : Src->users()) {
|
|
|
|
Instruction *UI = cast<Instruction>(U);
|
|
|
|
BasicBlock *UserBB = UI->getParent();
|
2007-12-13 11:32:53 +08:00
|
|
|
if (UserBB == DefBB) continue;
|
|
|
|
// Be conservative. We don't want this xform to end up introducing
|
|
|
|
// reloads just before load / store instructions.
|
2014-03-09 11:16:01 +08:00
|
|
|
if (isa<PHINode>(UI) || isa<LoadInst>(UI) || isa<StoreInst>(UI))
|
2007-12-12 10:53:41 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2007-12-06 07:58:20 +08:00
|
|
|
// InsertedTruncs - Only insert one trunc in each block once.
|
|
|
|
DenseMap<BasicBlock*, Instruction*> InsertedTruncs;
|
|
|
|
|
|
|
|
bool MadeChange = false;
|
2014-03-09 11:16:01 +08:00
|
|
|
for (Use &U : Src->uses()) {
|
|
|
|
Instruction *User = cast<Instruction>(U.getUser());
|
2007-12-06 07:58:20 +08:00
|
|
|
|
|
|
|
// Figure out which BB this ext is used in.
|
|
|
|
BasicBlock *UserBB = User->getParent();
|
|
|
|
if (UserBB == DefBB) continue;
|
|
|
|
|
|
|
|
// Both src and def are live in this block. Rewrite the use.
|
|
|
|
Instruction *&InsertedTrunc = InsertedTruncs[UserBB];
|
|
|
|
|
|
|
|
if (!InsertedTrunc) {
|
2011-08-17 04:45:24 +08:00
|
|
|
BasicBlock::iterator InsertPt = UserBB->getFirstInsertionPt();
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(InsertPt != UserBB->end());
|
|
|
|
InsertedTrunc = new TruncInst(I, Src->getType(), "", &*InsertPt);
|
2015-06-18 04:44:32 +08:00
|
|
|
InsertedInsts.insert(InsertedTrunc);
|
2007-12-06 07:58:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Replace a use of the {s|z}ext source with a use of the result.
|
2014-03-09 11:16:01 +08:00
|
|
|
U = InsertedTrunc;
|
2011-01-06 01:27:27 +08:00
|
|
|
++NumExtUses;
|
2007-12-06 07:58:20 +08:00
|
|
|
MadeChange = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
2015-10-20 05:59:12 +08:00
|
|
|
/// Check if V (an operand of a select instruction) is an expensive instruction
|
|
|
|
/// that is only used once.
|
|
|
|
static bool sinkSelectOperand(const TargetTransformInfo *TTI, Value *V) {
|
|
|
|
auto *I = dyn_cast<Instruction>(V);
|
|
|
|
// If it's safe to speculatively execute, then it should not have side
|
|
|
|
// effects; therefore, it's safe to sink and possibly *not* execute.
|
2015-10-25 07:11:13 +08:00
|
|
|
return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&
|
|
|
|
TTI->getUserCost(I) >= TargetTransformInfo::TCC_Expensive;
|
2015-10-20 05:59:12 +08:00
|
|
|
}
|
|
|
|
|
2015-09-22 06:47:23 +08:00
|
|
|
/// Returns true if a SelectInst should be turned into an explicit branch.
|
2015-10-20 05:59:12 +08:00
|
|
|
static bool isFormingBranchFromSelectProfitable(const TargetTransformInfo *TTI,
|
|
|
|
SelectInst *SI) {
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
// FIXME: This should use the same heuristics as IfConversion to determine
|
|
|
|
// whether a select is better represented as a branch. This requires that
|
|
|
|
// branch probability metadata is preserved for the select, which is not the
|
|
|
|
// case currently.
|
|
|
|
|
|
|
|
CmpInst *Cmp = dyn_cast<CmpInst>(SI->getCondition());
|
|
|
|
|
2015-09-29 06:14:51 +08:00
|
|
|
// If a branch is predictable, an out-of-order CPU can avoid blocking on its
|
|
|
|
// comparison condition. If the compare has more than one use, there's
|
|
|
|
// probably another cmov or setcc around, so it's not worth emitting a branch.
|
2015-09-29 05:44:46 +08:00
|
|
|
if (!Cmp || !Cmp->hasOneUse())
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
Value *CmpOp0 = Cmp->getOperand(0);
|
|
|
|
Value *CmpOp1 = Cmp->getOperand(1);
|
|
|
|
|
2015-09-29 06:14:51 +08:00
|
|
|
// Emit "cmov on compare with a memory operand" as a branch to avoid stalls
|
|
|
|
// on a load from memory. But if the load is used more than once, do not
|
|
|
|
// change the select to a branch because the load is probably needed
|
|
|
|
// regardless of whether the branch is taken or not.
|
2015-10-20 05:59:12 +08:00
|
|
|
if ((isa<LoadInst>(CmpOp0) && CmpOp0->hasOneUse()) ||
|
|
|
|
(isa<LoadInst>(CmpOp1) && CmpOp1->hasOneUse()))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
// If either operand of the select is expensive and only needed on one side
|
|
|
|
// of the select, we should form a branch.
|
|
|
|
if (sinkSelectOperand(TTI, SI->getTrueValue()) ||
|
|
|
|
sinkSelectOperand(TTI, SI->getFalseValue()))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2012-09-02 20:10:19 +08:00
|
|
|
/// If we have a SelectInst that will likely profit from branch prediction,
|
|
|
|
/// turn it into a branch.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeSelectInst(SelectInst *SI) {
|
2012-09-02 20:10:19 +08:00
|
|
|
bool VectorCond = !SI->getCondition()->getType()->isIntegerTy(1);
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
|
2012-09-02 20:10:19 +08:00
|
|
|
// Can we convert the 'select' to CF ?
|
|
|
|
if (DisableSelectToBranch || OptSize || !TLI || VectorCond)
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
return false;
|
|
|
|
|
2012-09-02 20:10:19 +08:00
|
|
|
TargetLowering::SelectSupportKind SelectKind;
|
|
|
|
if (VectorCond)
|
|
|
|
SelectKind = TargetLowering::VectorMaskSelect;
|
|
|
|
else if (SI->getType()->isVectorTy())
|
|
|
|
SelectKind = TargetLowering::ScalarCondVectorVal;
|
|
|
|
else
|
|
|
|
SelectKind = TargetLowering::ScalarValSelect;
|
|
|
|
|
|
|
|
// Do we have efficient codegen support for this kind of 'selects' ?
|
|
|
|
if (TLI->isSelectSupported(SelectKind)) {
|
|
|
|
// We have efficient codegen support for the select instruction.
|
|
|
|
// Check if it is profitable to keep this 'select'.
|
|
|
|
if (!TLI->isPredictableSelectExpensive() ||
|
2015-10-20 05:59:12 +08:00
|
|
|
!isFormingBranchFromSelectProfitable(TTI, SI))
|
2012-09-02 20:10:19 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
ModifiedDT = true;
|
|
|
|
|
2015-10-20 05:59:12 +08:00
|
|
|
// Transform a sequence like this:
|
|
|
|
// start:
|
|
|
|
// %cmp = cmp uge i32 %a, %b
|
|
|
|
// %sel = select i1 %cmp, i32 %c, i32 %d
|
|
|
|
//
|
|
|
|
// Into:
|
|
|
|
// start:
|
|
|
|
// %cmp = cmp uge i32 %a, %b
|
|
|
|
// br i1 %cmp, label %select.true, label %select.false
|
|
|
|
// select.true:
|
|
|
|
// br label %select.end
|
|
|
|
// select.false:
|
|
|
|
// br label %select.end
|
|
|
|
// select.end:
|
|
|
|
// %sel = phi i32 [ %c, %select.true ], [ %d, %select.false ]
|
|
|
|
//
|
|
|
|
// In addition, we may sink instructions that produce %c or %d from
|
|
|
|
// the entry block into the destination(s) of the new branch.
|
|
|
|
// If the true or false blocks do not contain a sunken instruction, that
|
|
|
|
// block and its branch may be optimized away. In that case, one side of the
|
|
|
|
// first branch will point directly to select.end, and the corresponding PHI
|
|
|
|
// predecessor block will be the start block.
|
|
|
|
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
// First, we split the block containing the select into 2 blocks.
|
|
|
|
BasicBlock *StartBlock = SI->getParent();
|
|
|
|
BasicBlock::iterator SplitPt = ++(BasicBlock::iterator(SI));
|
2015-10-20 05:59:12 +08:00
|
|
|
BasicBlock *EndBlock = StartBlock->splitBasicBlock(SplitPt, "select.end");
|
2015-10-17 00:54:30 +08:00
|
|
|
|
2015-10-20 05:59:12 +08:00
|
|
|
// Delete the unconditional branch that was just created by the split.
|
2015-10-17 07:00:29 +08:00
|
|
|
StartBlock->getTerminator()->eraseFromParent();
|
2015-10-20 05:59:12 +08:00
|
|
|
|
|
|
|
// These are the new basic blocks for the conditional branch.
|
|
|
|
// At least one will become an actual new basic block.
|
|
|
|
BasicBlock *TrueBlock = nullptr;
|
|
|
|
BasicBlock *FalseBlock = nullptr;
|
|
|
|
|
|
|
|
// Sink expensive instructions into the conditional blocks to avoid executing
|
|
|
|
// them speculatively.
|
|
|
|
if (sinkSelectOperand(TTI, SI->getTrueValue())) {
|
|
|
|
TrueBlock = BasicBlock::Create(SI->getContext(), "select.true.sink",
|
|
|
|
EndBlock->getParent(), EndBlock);
|
|
|
|
auto *TrueBranch = BranchInst::Create(EndBlock, TrueBlock);
|
|
|
|
auto *TrueInst = cast<Instruction>(SI->getTrueValue());
|
|
|
|
TrueInst->moveBefore(TrueBranch);
|
|
|
|
}
|
|
|
|
if (sinkSelectOperand(TTI, SI->getFalseValue())) {
|
|
|
|
FalseBlock = BasicBlock::Create(SI->getContext(), "select.false.sink",
|
|
|
|
EndBlock->getParent(), EndBlock);
|
|
|
|
auto *FalseBranch = BranchInst::Create(EndBlock, FalseBlock);
|
|
|
|
auto *FalseInst = cast<Instruction>(SI->getFalseValue());
|
|
|
|
FalseInst->moveBefore(FalseBranch);
|
|
|
|
}
|
|
|
|
|
|
|
|
// If there was nothing to sink, then arbitrarily choose the 'false' side
|
|
|
|
// for a new input value to the PHI.
|
|
|
|
if (TrueBlock == FalseBlock) {
|
|
|
|
assert(TrueBlock == nullptr &&
|
|
|
|
"Unexpected basic block transform while optimizing select");
|
|
|
|
|
|
|
|
FalseBlock = BasicBlock::Create(SI->getContext(), "select.false",
|
|
|
|
EndBlock->getParent(), EndBlock);
|
|
|
|
BranchInst::Create(EndBlock, FalseBlock);
|
|
|
|
}
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
|
|
|
|
// Insert the real conditional branch based on the original condition.
|
2015-10-20 05:59:12 +08:00
|
|
|
// If we did not create a new block for one of the 'true' or 'false' paths
|
|
|
|
// of the condition, it means that side of the branch goes to the end block
|
|
|
|
// directly and the path originates from the start block from the point of
|
|
|
|
// view of the new PHI.
|
|
|
|
if (TrueBlock == nullptr) {
|
|
|
|
BranchInst::Create(EndBlock, FalseBlock, SI->getCondition(), SI);
|
|
|
|
TrueBlock = StartBlock;
|
|
|
|
} else if (FalseBlock == nullptr) {
|
|
|
|
BranchInst::Create(TrueBlock, EndBlock, SI->getCondition(), SI);
|
|
|
|
FalseBlock = StartBlock;
|
|
|
|
} else {
|
|
|
|
BranchInst::Create(TrueBlock, FalseBlock, SI->getCondition(), SI);
|
|
|
|
}
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
|
|
|
|
// The select itself is replaced with a PHI Node.
|
2015-10-20 05:59:12 +08:00
|
|
|
PHINode *PN = PHINode::Create(SI->getType(), 2, "", &EndBlock->front());
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
PN->takeName(SI);
|
2015-10-20 05:59:12 +08:00
|
|
|
PN->addIncoming(SI->getTrueValue(), TrueBlock);
|
|
|
|
PN->addIncoming(SI->getFalseValue(), FalseBlock);
|
|
|
|
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
SI->replaceAllUsesWith(PN);
|
|
|
|
SI->eraseFromParent();
|
|
|
|
|
|
|
|
// Instruct OptimizeBlock to skip to the next block.
|
|
|
|
CurInstIterator = StartBlock->end();
|
|
|
|
++NumSelectsExpanded;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2014-03-02 01:24:40 +08:00
|
|
|
static bool isBroadcastShuffle(ShuffleVectorInst *SVI) {
|
2014-02-19 18:02:43 +08:00
|
|
|
SmallVector<int, 16> Mask(SVI->getShuffleMask());
|
|
|
|
int SplatElem = -1;
|
|
|
|
for (unsigned i = 0; i < Mask.size(); ++i) {
|
|
|
|
if (SplatElem != -1 && Mask[i] != -1 && Mask[i] != SplatElem)
|
|
|
|
return false;
|
|
|
|
SplatElem = Mask[i];
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Some targets have expensive vector shifts if the lanes aren't all the same
|
|
|
|
/// (e.g. x86 only introduced "vpsllvd" and friends with AVX2). In these cases
|
|
|
|
/// it's often worth sinking a shufflevector splat down to its use so that
|
|
|
|
/// codegen can spot all lanes are identical.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeShuffleVectorInst(ShuffleVectorInst *SVI) {
|
2014-02-19 18:02:43 +08:00
|
|
|
BasicBlock *DefBB = SVI->getParent();
|
|
|
|
|
|
|
|
// Only do this xform if variable vector shifts are particularly expensive.
|
|
|
|
if (!TLI || !TLI->isVectorShiftByScalarCheap(SVI->getType()))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// We only expect better codegen by sinking a shuffle if we can recognise a
|
|
|
|
// constant splat.
|
|
|
|
if (!isBroadcastShuffle(SVI))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// InsertedShuffles - Only insert a shuffle in each block once.
|
|
|
|
DenseMap<BasicBlock*, Instruction*> InsertedShuffles;
|
|
|
|
|
|
|
|
bool MadeChange = false;
|
2014-03-09 11:16:01 +08:00
|
|
|
for (User *U : SVI->users()) {
|
|
|
|
Instruction *UI = cast<Instruction>(U);
|
2014-02-19 18:02:43 +08:00
|
|
|
|
|
|
|
// Figure out which BB this ext is used in.
|
2014-03-09 11:16:01 +08:00
|
|
|
BasicBlock *UserBB = UI->getParent();
|
2014-02-19 18:02:43 +08:00
|
|
|
if (UserBB == DefBB) continue;
|
|
|
|
|
|
|
|
// For now only apply this when the splat is used by a shift instruction.
|
2014-03-09 11:16:01 +08:00
|
|
|
if (!UI->isShift()) continue;
|
2014-02-19 18:02:43 +08:00
|
|
|
|
|
|
|
// Everything checks out, sink the shuffle if the user's block doesn't
|
|
|
|
// already have a copy.
|
|
|
|
Instruction *&InsertedShuffle = InsertedShuffles[UserBB];
|
|
|
|
|
|
|
|
if (!InsertedShuffle) {
|
|
|
|
BasicBlock::iterator InsertPt = UserBB->getFirstInsertionPt();
|
2015-10-10 02:44:40 +08:00
|
|
|
assert(InsertPt != UserBB->end());
|
|
|
|
InsertedShuffle =
|
|
|
|
new ShuffleVectorInst(SVI->getOperand(0), SVI->getOperand(1),
|
|
|
|
SVI->getOperand(2), "", &*InsertPt);
|
2014-02-19 18:02:43 +08:00
|
|
|
}
|
|
|
|
|
2014-03-09 11:16:01 +08:00
|
|
|
UI->replaceUsesOfWith(SVI, InsertedShuffle);
|
2014-02-19 18:02:43 +08:00
|
|
|
MadeChange = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If we removed all uses, nuke the shuffle.
|
|
|
|
if (SVI->use_empty()) {
|
|
|
|
SVI->eraseFromParent();
|
|
|
|
MadeChange = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return MadeChange;
|
|
|
|
}
|
|
|
|
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
namespace {
|
|
|
|
/// \brief Helper class to promote a scalar operation to a vector one.
|
|
|
|
/// This class is used to move downward extractelement transition.
|
|
|
|
/// E.g.,
|
|
|
|
/// a = vector_op <2 x i32>
|
|
|
|
/// b = extractelement <2 x i32> a, i32 0
|
|
|
|
/// c = scalar_op b
|
|
|
|
/// store c
|
|
|
|
///
|
|
|
|
/// =>
|
|
|
|
/// a = vector_op <2 x i32>
|
|
|
|
/// c = vector_op a (equivalent to scalar_op on the related lane)
|
|
|
|
/// * d = extractelement <2 x i32> c, i32 0
|
|
|
|
/// * store d
|
|
|
|
/// Assuming both extractelement and store can be combine, we get rid of the
|
|
|
|
/// transition.
|
|
|
|
class VectorPromoteHelper {
|
2015-07-09 10:09:04 +08:00
|
|
|
/// DataLayout associated with the current module.
|
|
|
|
const DataLayout &DL;
|
|
|
|
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
/// Used to perform some checks on the legality of vector operations.
|
|
|
|
const TargetLowering &TLI;
|
|
|
|
|
|
|
|
/// Used to estimated the cost of the promoted chain.
|
|
|
|
const TargetTransformInfo &TTI;
|
|
|
|
|
|
|
|
/// The transition being moved downwards.
|
|
|
|
Instruction *Transition;
|
|
|
|
/// The sequence of instructions to be promoted.
|
|
|
|
SmallVector<Instruction *, 4> InstsToBePromoted;
|
|
|
|
/// Cost of combining a store and an extract.
|
|
|
|
unsigned StoreExtractCombineCost;
|
|
|
|
/// Instruction that will be combined with the transition.
|
|
|
|
Instruction *CombineInst;
|
|
|
|
|
|
|
|
/// \brief The instruction that represents the current end of the transition.
|
|
|
|
/// Since we are faking the promotion until we reach the end of the chain
|
|
|
|
/// of computation, we need a way to get the current end of the transition.
|
|
|
|
Instruction *getEndOfTransition() const {
|
|
|
|
if (InstsToBePromoted.empty())
|
|
|
|
return Transition;
|
|
|
|
return InstsToBePromoted.back();
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Return the index of the original value in the transition.
|
|
|
|
/// E.g., for "extractelement <2 x i32> c, i32 1" the original value,
|
|
|
|
/// c, is at index 0.
|
|
|
|
unsigned getTransitionOriginalValueIdx() const {
|
|
|
|
assert(isa<ExtractElementInst>(Transition) &&
|
|
|
|
"Other kind of transitions are not supported yet");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Return the index of the index in the transition.
|
|
|
|
/// E.g., for "extractelement <2 x i32> c, i32 0" the index
|
|
|
|
/// is at index 1.
|
|
|
|
unsigned getTransitionIdx() const {
|
|
|
|
assert(isa<ExtractElementInst>(Transition) &&
|
|
|
|
"Other kind of transitions are not supported yet");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Get the type of the transition.
|
|
|
|
/// This is the type of the original value.
|
|
|
|
/// E.g., for "extractelement <2 x i32> c, i32 1" the type of the
|
|
|
|
/// transition is <2 x i32>.
|
|
|
|
Type *getTransitionType() const {
|
|
|
|
return Transition->getOperand(getTransitionOriginalValueIdx())->getType();
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Promote \p ToBePromoted by moving \p Def downward through.
|
|
|
|
/// I.e., we have the following sequence:
|
|
|
|
/// Def = Transition <ty1> a to <ty2>
|
|
|
|
/// b = ToBePromoted <ty2> Def, ...
|
|
|
|
/// =>
|
|
|
|
/// b = ToBePromoted <ty1> a, ...
|
|
|
|
/// Def = Transition <ty1> ToBePromoted to <ty2>
|
|
|
|
void promoteImpl(Instruction *ToBePromoted);
|
|
|
|
|
|
|
|
/// \brief Check whether or not it is profitable to promote all the
|
|
|
|
/// instructions enqueued to be promoted.
|
|
|
|
bool isProfitableToPromote() {
|
|
|
|
Value *ValIdx = Transition->getOperand(getTransitionOriginalValueIdx());
|
|
|
|
unsigned Index = isa<ConstantInt>(ValIdx)
|
|
|
|
? cast<ConstantInt>(ValIdx)->getZExtValue()
|
|
|
|
: -1;
|
|
|
|
Type *PromotedType = getTransitionType();
|
|
|
|
|
|
|
|
StoreInst *ST = cast<StoreInst>(CombineInst);
|
|
|
|
unsigned AS = ST->getPointerAddressSpace();
|
|
|
|
unsigned Align = ST->getAlignment();
|
|
|
|
// Check if this store is supported.
|
|
|
|
if (!TLI.allowsMisalignedMemoryAccesses(
|
2015-07-09 10:09:04 +08:00
|
|
|
TLI.getValueType(DL, ST->getValueOperand()->getType()), AS,
|
|
|
|
Align)) {
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
// If this is not supported, there is no way we can combine
|
|
|
|
// the extract with the store.
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
// The scalar chain of computation has to pay for the transition
|
|
|
|
// scalar to vector.
|
|
|
|
// The vector chain has to account for the combining cost.
|
|
|
|
uint64_t ScalarCost =
|
|
|
|
TTI.getVectorInstrCost(Transition->getOpcode(), PromotedType, Index);
|
|
|
|
uint64_t VectorCost = StoreExtractCombineCost;
|
|
|
|
for (const auto &Inst : InstsToBePromoted) {
|
|
|
|
// Compute the cost.
|
|
|
|
// By construction, all instructions being promoted are arithmetic ones.
|
|
|
|
// Moreover, one argument is a constant that can be viewed as a splat
|
|
|
|
// constant.
|
|
|
|
Value *Arg0 = Inst->getOperand(0);
|
|
|
|
bool IsArg0Constant = isa<UndefValue>(Arg0) || isa<ConstantInt>(Arg0) ||
|
|
|
|
isa<ConstantFP>(Arg0);
|
|
|
|
TargetTransformInfo::OperandValueKind Arg0OVK =
|
|
|
|
IsArg0Constant ? TargetTransformInfo::OK_UniformConstantValue
|
|
|
|
: TargetTransformInfo::OK_AnyValue;
|
|
|
|
TargetTransformInfo::OperandValueKind Arg1OVK =
|
|
|
|
!IsArg0Constant ? TargetTransformInfo::OK_UniformConstantValue
|
|
|
|
: TargetTransformInfo::OK_AnyValue;
|
|
|
|
ScalarCost += TTI.getArithmeticInstrCost(
|
|
|
|
Inst->getOpcode(), Inst->getType(), Arg0OVK, Arg1OVK);
|
|
|
|
VectorCost += TTI.getArithmeticInstrCost(Inst->getOpcode(), PromotedType,
|
|
|
|
Arg0OVK, Arg1OVK);
|
|
|
|
}
|
|
|
|
DEBUG(dbgs() << "Estimated cost of computation to be promoted:\nScalar: "
|
|
|
|
<< ScalarCost << "\nVector: " << VectorCost << '\n');
|
|
|
|
return ScalarCost > VectorCost;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Generate a constant vector with \p Val with the same
|
|
|
|
/// number of elements as the transition.
|
|
|
|
/// \p UseSplat defines whether or not \p Val should be replicated
|
2015-08-09 02:27:36 +08:00
|
|
|
/// across the whole vector.
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
/// In other words, if UseSplat == true, we generate <Val, Val, ..., Val>,
|
|
|
|
/// otherwise we generate a vector with as many undef as possible:
|
|
|
|
/// <undef, ..., undef, Val, undef, ..., undef> where \p Val is only
|
|
|
|
/// used at the index of the extract.
|
|
|
|
Value *getConstantVector(Constant *Val, bool UseSplat) const {
|
|
|
|
unsigned ExtractIdx = UINT_MAX;
|
|
|
|
if (!UseSplat) {
|
|
|
|
// If we cannot determine where the constant must be, we have to
|
|
|
|
// use a splat constant.
|
|
|
|
Value *ValExtractIdx = Transition->getOperand(getTransitionIdx());
|
|
|
|
if (ConstantInt *CstVal = dyn_cast<ConstantInt>(ValExtractIdx))
|
|
|
|
ExtractIdx = CstVal->getSExtValue();
|
|
|
|
else
|
|
|
|
UseSplat = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
unsigned End = getTransitionType()->getVectorNumElements();
|
|
|
|
if (UseSplat)
|
|
|
|
return ConstantVector::getSplat(End, Val);
|
|
|
|
|
|
|
|
SmallVector<Constant *, 4> ConstVec;
|
|
|
|
UndefValue *UndefVal = UndefValue::get(Val->getType());
|
|
|
|
for (unsigned Idx = 0; Idx != End; ++Idx) {
|
|
|
|
if (Idx == ExtractIdx)
|
|
|
|
ConstVec.push_back(Val);
|
|
|
|
else
|
|
|
|
ConstVec.push_back(UndefVal);
|
|
|
|
}
|
|
|
|
return ConstantVector::get(ConstVec);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Check if promoting to a vector type an operand at \p OperandIdx
|
|
|
|
/// in \p Use can trigger undefined behavior.
|
|
|
|
static bool canCauseUndefinedBehavior(const Instruction *Use,
|
|
|
|
unsigned OperandIdx) {
|
|
|
|
// This is not safe to introduce undef when the operand is on
|
|
|
|
// the right hand side of a division-like instruction.
|
|
|
|
if (OperandIdx != 1)
|
|
|
|
return false;
|
|
|
|
switch (Use->getOpcode()) {
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
case Instruction::SDiv:
|
|
|
|
case Instruction::UDiv:
|
|
|
|
case Instruction::SRem:
|
|
|
|
case Instruction::URem:
|
|
|
|
return true;
|
|
|
|
case Instruction::FDiv:
|
|
|
|
case Instruction::FRem:
|
|
|
|
return !Use->hasNoNaNs();
|
|
|
|
}
|
|
|
|
llvm_unreachable(nullptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
public:
|
2015-07-09 10:09:04 +08:00
|
|
|
VectorPromoteHelper(const DataLayout &DL, const TargetLowering &TLI,
|
|
|
|
const TargetTransformInfo &TTI, Instruction *Transition,
|
|
|
|
unsigned CombineCost)
|
|
|
|
: DL(DL), TLI(TLI), TTI(TTI), Transition(Transition),
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
StoreExtractCombineCost(CombineCost), CombineInst(nullptr) {
|
|
|
|
assert(Transition && "Do not know how to promote null");
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Check if we can promote \p ToBePromoted to \p Type.
|
|
|
|
bool canPromote(const Instruction *ToBePromoted) const {
|
|
|
|
// We could support CastInst too.
|
|
|
|
return isa<BinaryOperator>(ToBePromoted);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Check if it is profitable to promote \p ToBePromoted
|
|
|
|
/// by moving downward the transition through.
|
|
|
|
bool shouldPromote(const Instruction *ToBePromoted) const {
|
|
|
|
// Promote only if all the operands can be statically expanded.
|
|
|
|
// Indeed, we do not want to introduce any new kind of transitions.
|
|
|
|
for (const Use &U : ToBePromoted->operands()) {
|
|
|
|
const Value *Val = U.get();
|
|
|
|
if (Val == getEndOfTransition()) {
|
|
|
|
// If the use is a division and the transition is on the rhs,
|
|
|
|
// we cannot promote the operation, otherwise we may create a
|
|
|
|
// division by zero.
|
|
|
|
if (canCauseUndefinedBehavior(ToBePromoted, U.getOperandNo()))
|
|
|
|
return false;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (!isa<ConstantInt>(Val) && !isa<UndefValue>(Val) &&
|
|
|
|
!isa<ConstantFP>(Val))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
// Check that the resulting operation is legal.
|
|
|
|
int ISDOpcode = TLI.InstructionOpcodeToISD(ToBePromoted->getOpcode());
|
|
|
|
if (!ISDOpcode)
|
|
|
|
return false;
|
|
|
|
return StressStoreExtract ||
|
2014-11-13 07:05:03 +08:00
|
|
|
TLI.isOperationLegalOrCustom(
|
2015-07-09 10:09:04 +08:00
|
|
|
ISDOpcode, TLI.getValueType(DL, getTransitionType(), true));
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Check whether or not \p Use can be combined
|
|
|
|
/// with the transition.
|
|
|
|
/// I.e., is it possible to do Use(Transition) => AnotherUse?
|
|
|
|
bool canCombine(const Instruction *Use) { return isa<StoreInst>(Use); }
|
|
|
|
|
|
|
|
/// \brief Record \p ToBePromoted as part of the chain to be promoted.
|
|
|
|
void enqueueForPromotion(Instruction *ToBePromoted) {
|
|
|
|
InstsToBePromoted.push_back(ToBePromoted);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Set the instruction that will be combined with the transition.
|
|
|
|
void recordCombineInstruction(Instruction *ToBeCombined) {
|
|
|
|
assert(canCombine(ToBeCombined) && "Unsupported instruction to combine");
|
|
|
|
CombineInst = ToBeCombined;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Promote all the instructions enqueued for promotion if it is
|
|
|
|
/// is profitable.
|
|
|
|
/// \return True if the promotion happened, false otherwise.
|
|
|
|
bool promote() {
|
|
|
|
// Check if there is something to promote.
|
|
|
|
// Right now, if we do not have anything to combine with,
|
|
|
|
// we assume the promotion is not profitable.
|
|
|
|
if (InstsToBePromoted.empty() || !CombineInst)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// Check cost.
|
|
|
|
if (!StressStoreExtract && !isProfitableToPromote())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// Promote.
|
|
|
|
for (auto &ToBePromoted : InstsToBePromoted)
|
|
|
|
promoteImpl(ToBePromoted);
|
|
|
|
InstsToBePromoted.clear();
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
} // End of anonymous namespace.
|
|
|
|
|
|
|
|
void VectorPromoteHelper::promoteImpl(Instruction *ToBePromoted) {
|
|
|
|
// At this point, we know that all the operands of ToBePromoted but Def
|
|
|
|
// can be statically promoted.
|
|
|
|
// For Def, we need to use its parameter in ToBePromoted:
|
|
|
|
// b = ToBePromoted ty1 a
|
|
|
|
// Def = Transition ty1 b to ty2
|
|
|
|
// Move the transition down.
|
|
|
|
// 1. Replace all uses of the promoted operation by the transition.
|
|
|
|
// = ... b => = ... Def.
|
|
|
|
assert(ToBePromoted->getType() == Transition->getType() &&
|
|
|
|
"The type of the result of the transition does not match "
|
|
|
|
"the final type");
|
|
|
|
ToBePromoted->replaceAllUsesWith(Transition);
|
|
|
|
// 2. Update the type of the uses.
|
|
|
|
// b = ToBePromoted ty2 Def => b = ToBePromoted ty1 Def.
|
|
|
|
Type *TransitionTy = getTransitionType();
|
|
|
|
ToBePromoted->mutateType(TransitionTy);
|
|
|
|
// 3. Update all the operands of the promoted operation with promoted
|
|
|
|
// operands.
|
|
|
|
// b = ToBePromoted ty1 Def => b = ToBePromoted ty1 a.
|
|
|
|
for (Use &U : ToBePromoted->operands()) {
|
|
|
|
Value *Val = U.get();
|
|
|
|
Value *NewVal = nullptr;
|
|
|
|
if (Val == Transition)
|
|
|
|
NewVal = Transition->getOperand(getTransitionOriginalValueIdx());
|
|
|
|
else if (isa<UndefValue>(Val) || isa<ConstantInt>(Val) ||
|
|
|
|
isa<ConstantFP>(Val)) {
|
|
|
|
// Use a splat constant if it is not safe to use undef.
|
|
|
|
NewVal = getConstantVector(
|
|
|
|
cast<Constant>(Val),
|
|
|
|
isa<UndefValue>(Val) ||
|
|
|
|
canCauseUndefinedBehavior(ToBePromoted, U.getOperandNo()));
|
|
|
|
} else
|
2015-01-05 18:15:49 +08:00
|
|
|
llvm_unreachable("Did you modified shouldPromote and forgot to update "
|
|
|
|
"this?");
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
ToBePromoted->setOperand(U.getOperandNo(), NewVal);
|
|
|
|
}
|
|
|
|
Transition->removeFromParent();
|
|
|
|
Transition->insertAfter(ToBePromoted);
|
|
|
|
Transition->setOperand(getTransitionOriginalValueIdx(), ToBePromoted);
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Some targets can do store(extractelement) with one instruction.
|
|
|
|
/// Try to push the extractelement towards the stores when the target
|
|
|
|
/// has this feature and this is profitable.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeExtractElementInst(Instruction *Inst) {
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
unsigned CombineCost = UINT_MAX;
|
|
|
|
if (DisableStoreExtract || !TLI ||
|
|
|
|
(!StressStoreExtract &&
|
|
|
|
!TLI->canCombineStoreAndExtract(Inst->getOperand(0)->getType(),
|
|
|
|
Inst->getOperand(1), CombineCost)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// At this point we know that Inst is a vector to scalar transition.
|
|
|
|
// Try to move it down the def-use chain, until:
|
|
|
|
// - We can combine the transition with its single use
|
|
|
|
// => we got rid of the transition.
|
|
|
|
// - We escape the current basic block
|
|
|
|
// => we would need to check that we are moving it at a cheaper place and
|
|
|
|
// we do not do that for now.
|
|
|
|
BasicBlock *Parent = Inst->getParent();
|
|
|
|
DEBUG(dbgs() << "Found an interesting transition: " << *Inst << '\n');
|
2015-07-09 10:09:04 +08:00
|
|
|
VectorPromoteHelper VPH(*DL, *TLI, *TTI, Inst, CombineCost);
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
// If the transition has more than one use, assume this is not going to be
|
|
|
|
// beneficial.
|
|
|
|
while (Inst->hasOneUse()) {
|
|
|
|
Instruction *ToBePromoted = cast<Instruction>(*Inst->user_begin());
|
|
|
|
DEBUG(dbgs() << "Use: " << *ToBePromoted << '\n');
|
|
|
|
|
|
|
|
if (ToBePromoted->getParent() != Parent) {
|
|
|
|
DEBUG(dbgs() << "Instruction to promote is in a different block ("
|
|
|
|
<< ToBePromoted->getParent()->getName()
|
|
|
|
<< ") than the transition (" << Parent->getName() << ").\n");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (VPH.canCombine(ToBePromoted)) {
|
|
|
|
DEBUG(dbgs() << "Assume " << *Inst << '\n'
|
|
|
|
<< "will be combined with: " << *ToBePromoted << '\n');
|
|
|
|
VPH.recordCombineInstruction(ToBePromoted);
|
|
|
|
bool Changed = VPH.promote();
|
|
|
|
NumStoreExtractExposed += Changed;
|
|
|
|
return Changed;
|
|
|
|
}
|
|
|
|
|
|
|
|
DEBUG(dbgs() << "Try promoting.\n");
|
|
|
|
if (!VPH.canPromote(ToBePromoted) || !VPH.shouldPromote(ToBePromoted))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
DEBUG(dbgs() << "Promoting is possible... Enqueue for promotion!\n");
|
|
|
|
|
|
|
|
VPH.enqueueForPromotion(ToBePromoted);
|
|
|
|
Inst = ToBePromoted;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeInst(Instruction *I, bool& ModifiedDT) {
|
2015-06-18 04:44:32 +08:00
|
|
|
// Bail out if we inserted the instruction to prevent optimizations from
|
|
|
|
// stepping on each other's toes.
|
|
|
|
if (InsertedInsts.count(I))
|
|
|
|
return false;
|
|
|
|
|
2011-01-06 10:37:26 +08:00
|
|
|
if (PHINode *P = dyn_cast<PHINode>(I)) {
|
|
|
|
// It is possible for very late stage optimizations (such as SimplifyCFG)
|
|
|
|
// to introduce PHI nodes too late to be cleaned up. If we detect such a
|
|
|
|
// trivial PHI, go ahead and zap it here.
|
2015-07-08 02:45:17 +08:00
|
|
|
if (Value *V = SimplifyInstruction(P, *DL, TLInfo, nullptr)) {
|
2011-01-06 10:37:26 +08:00
|
|
|
P->replaceAllUsesWith(V);
|
|
|
|
P->eraseFromParent();
|
|
|
|
++NumPHIsElim;
|
2011-01-15 15:29:01 +08:00
|
|
|
return true;
|
2011-01-06 10:37:26 +08:00
|
|
|
}
|
2011-01-15 15:29:01 +08:00
|
|
|
return false;
|
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (CastInst *CI = dyn_cast<CastInst>(I)) {
|
2011-01-06 10:37:26 +08:00
|
|
|
// If the source of the cast is a constant, then this should have
|
|
|
|
// already been constant folded. The only reason NOT to constant fold
|
|
|
|
// it is if something (e.g. LSR) was careful to place the constant
|
|
|
|
// evaluation in a block other than then one that uses it (e.g. to hoist
|
|
|
|
// the address of globals out of a loop). If this is the case, we don't
|
|
|
|
// want to forward-subst the cast.
|
|
|
|
if (isa<Constant>(CI->getOperand(0)))
|
|
|
|
return false;
|
|
|
|
|
2015-07-09 10:09:04 +08:00
|
|
|
if (TLI && OptimizeNoopCopyExpression(CI, *TLI, *DL))
|
2011-01-15 15:29:01 +08:00
|
|
|
return true;
|
2011-01-06 10:37:26 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (isa<ZExtInst>(I) || isa<SExtInst>(I)) {
|
2014-03-13 21:36:25 +08:00
|
|
|
/// Sink a zext or sext into its user blocks if the target type doesn't
|
|
|
|
/// fit in one register
|
2015-07-09 10:09:04 +08:00
|
|
|
if (TLI &&
|
|
|
|
TLI->getTypeAction(CI->getContext(),
|
|
|
|
TLI->getValueType(*DL, CI->getType())) ==
|
|
|
|
TargetLowering::TypeExpandInteger) {
|
2014-03-13 21:36:25 +08:00
|
|
|
return SinkCast(CI);
|
|
|
|
} else {
|
2015-09-22 07:03:16 +08:00
|
|
|
bool MadeChange = moveExtToFormExtLoad(I);
|
|
|
|
return MadeChange | optimizeExtUses(I);
|
2014-03-13 21:36:25 +08:00
|
|
|
}
|
2011-01-06 10:37:26 +08:00
|
|
|
}
|
2011-01-15 15:29:01 +08:00
|
|
|
return false;
|
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (CmpInst *CI = dyn_cast<CmpInst>(I))
|
2014-01-03 05:13:43 +08:00
|
|
|
if (!TLI || !TLI->hasMultipleConditionRegisters())
|
|
|
|
return OptimizeCmpExpression(CI);
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (LoadInst *LI = dyn_cast<LoadInst>(I)) {
|
2015-09-16 02:32:14 +08:00
|
|
|
stripInvariantGroupMetadata(*LI);
|
2015-06-05 00:17:38 +08:00
|
|
|
if (TLI) {
|
|
|
|
unsigned AS = LI->getPointerAddressSpace();
|
2015-09-22 07:03:16 +08:00
|
|
|
return optimizeMemoryInst(I, I->getOperand(0), LI->getType(), AS);
|
2015-06-05 00:17:38 +08:00
|
|
|
}
|
2012-10-30 19:23:25 +08:00
|
|
|
return false;
|
2011-01-15 15:29:01 +08:00
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (StoreInst *SI = dyn_cast<StoreInst>(I)) {
|
2015-09-16 02:32:14 +08:00
|
|
|
stripInvariantGroupMetadata(*SI);
|
2015-06-05 00:17:38 +08:00
|
|
|
if (TLI) {
|
|
|
|
unsigned AS = SI->getPointerAddressSpace();
|
2015-09-22 07:03:16 +08:00
|
|
|
return optimizeMemoryInst(I, SI->getOperand(1),
|
2015-06-05 00:17:38 +08:00
|
|
|
SI->getOperand(0)->getType(), AS);
|
|
|
|
}
|
2011-01-15 15:29:01 +08:00
|
|
|
return false;
|
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2014-04-22 03:34:27 +08:00
|
|
|
BinaryOperator *BinOp = dyn_cast<BinaryOperator>(I);
|
|
|
|
|
|
|
|
if (BinOp && (BinOp->getOpcode() == Instruction::AShr ||
|
|
|
|
BinOp->getOpcode() == Instruction::LShr)) {
|
|
|
|
ConstantInt *CI = dyn_cast<ConstantInt>(BinOp->getOperand(1));
|
|
|
|
if (TLI && CI && TLI->hasExtractBitsInsn())
|
2015-07-09 10:09:04 +08:00
|
|
|
return OptimizeExtractBits(BinOp, CI, *TLI, *DL);
|
2014-04-22 03:34:27 +08:00
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(I)) {
|
2011-01-06 10:44:52 +08:00
|
|
|
if (GEPI->hasAllZeroIndices()) {
|
|
|
|
/// The GEP operand must be a pointer, so must its result -> BitCast
|
|
|
|
Instruction *NC = new BitCastInst(GEPI->getOperand(0), GEPI->getType(),
|
|
|
|
GEPI->getName(), GEPI);
|
|
|
|
GEPI->replaceAllUsesWith(NC);
|
|
|
|
GEPI->eraseFromParent();
|
|
|
|
++NumGEPsElim;
|
2015-09-22 07:03:16 +08:00
|
|
|
optimizeInst(NC, ModifiedDT);
|
2011-01-15 15:29:01 +08:00
|
|
|
return true;
|
2011-01-06 10:44:52 +08:00
|
|
|
}
|
2011-01-15 15:29:01 +08:00
|
|
|
return false;
|
2011-01-06 10:37:26 +08:00
|
|
|
}
|
2012-07-24 18:51:42 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
if (CallInst *CI = dyn_cast<CallInst>(I))
|
2015-09-22 07:03:16 +08:00
|
|
|
return optimizeCallInst(CI, ModifiedDT);
|
2011-01-06 10:37:26 +08:00
|
|
|
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
if (SelectInst *SI = dyn_cast<SelectInst>(I))
|
2015-09-22 07:03:16 +08:00
|
|
|
return optimizeSelectInst(SI);
|
CodeGenPrepare: Add a transform to turn selects into branches in some cases.
This came up when a change in block placement formed a cmov and slowed down a
hot loop by 50%:
ucomisd (%rdi), %xmm0
cmovbel %edx, %esi
cmov is a really bad choice in this context because it doesn't get branch
prediction. If we emit it as a branch, an out-of-order CPU can do a better job
(if the branch is predicted right) and avoid waiting for the slow load+compare
instruction to finish. Of course it won't help if the branch is unpredictable,
but those are really rare in practice.
This patch uses a dumb conservative heuristic, it turns all cmovs that have one
use and a direct memory operand into branches. cmovs usually save some code
size, so we disable the transform in -Os mode. In-Order architectures are
unlikely to benefit as well, those are included in the
"predictableSelectIsExpensive" flag.
It would be better to reuse branch probability info here, but BPI doesn't
support select instructions currently. It would make sense to use the same
heuristics as the if-converter pass, which does the opposite direction of this
transform.
Test suite shows a small improvement here and there on corei7-level machines,
but the actual results depend a lot on the used microarchitecture. The
transformation is currently disabled by default and available by passing the
-enable-cgp-select2branch flag to the code generator.
Thanks to Chandler for the initial test case to him and Evan Cheng for providing
me with comments and test-suite numbers that were more stable than mine :)
llvm-svn: 156234
2012-05-05 20:49:22 +08:00
|
|
|
|
2014-02-19 18:02:43 +08:00
|
|
|
if (ShuffleVectorInst *SVI = dyn_cast<ShuffleVectorInst>(I))
|
2015-09-22 07:03:16 +08:00
|
|
|
return optimizeShuffleVectorInst(SVI);
|
2014-02-19 18:02:43 +08:00
|
|
|
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
if (isa<ExtractElementInst>(I))
|
2015-09-22 07:03:16 +08:00
|
|
|
return optimizeExtractElementInst(I);
|
[CodeGenPrepare] Move extractelement close to store if they can be combined.
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
llvm-svn: 220978
2014-11-01 01:52:53 +08:00
|
|
|
|
2011-01-15 15:29:01 +08:00
|
|
|
return false;
|
2011-01-06 10:37:26 +08:00
|
|
|
}
|
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
// In this pass we look for GEP and cast instructions that are used
|
|
|
|
// across basic blocks and rewrite them to improve basic-block-at-a-time
|
|
|
|
// selection.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::optimizeBlock(BasicBlock &BB, bool& ModifiedDT) {
|
2011-01-06 08:42:50 +08:00
|
|
|
SunkAddrs.clear();
|
2011-03-02 11:31:46 +08:00
|
|
|
bool MadeChange = false;
|
2008-09-24 13:32:41 +08:00
|
|
|
|
2011-01-15 15:14:54 +08:00
|
|
|
CurInstIterator = BB.begin();
|
2014-12-28 16:54:45 +08:00
|
|
|
while (CurInstIterator != BB.end()) {
|
2015-10-10 02:44:40 +08:00
|
|
|
MadeChange |= optimizeInst(&*CurInstIterator++, ModifiedDT);
|
2014-12-28 16:54:45 +08:00
|
|
|
if (ModifiedDT)
|
|
|
|
return true;
|
|
|
|
}
|
2015-09-22 07:03:16 +08:00
|
|
|
MadeChange |= dupRetToEnableTailCallOpts(&BB);
|
2012-11-24 03:17:06 +08:00
|
|
|
|
2007-03-31 12:06:36 +08:00
|
|
|
return MadeChange;
|
|
|
|
}
|
2011-08-18 08:50:51 +08:00
|
|
|
|
|
|
|
// llvm.dbg.value is far away from the value then iSel may not be able
|
2012-07-24 18:51:42 +08:00
|
|
|
// handle it properly. iSel will drop llvm.dbg.value if it can not
|
2011-08-18 08:50:51 +08:00
|
|
|
// find a node corresponding to the value.
|
2015-09-22 07:03:16 +08:00
|
|
|
bool CodeGenPrepare::placeDbgValues(Function &F) {
|
2011-08-18 08:50:51 +08:00
|
|
|
bool MadeChange = false;
|
2015-01-09 04:44:33 +08:00
|
|
|
for (BasicBlock &BB : F) {
|
2014-04-14 08:51:57 +08:00
|
|
|
Instruction *PrevNonDbgInst = nullptr;
|
2015-01-09 04:44:33 +08:00
|
|
|
for (BasicBlock::iterator BI = BB.begin(), BE = BB.end(); BI != BE;) {
|
2015-10-10 02:44:40 +08:00
|
|
|
Instruction *Insn = &*BI++;
|
2011-08-18 08:50:51 +08:00
|
|
|
DbgValueInst *DVI = dyn_cast<DbgValueInst>(Insn);
|
2014-04-26 04:49:25 +08:00
|
|
|
// Leave dbg.values that refer to an alloca alone. These
|
|
|
|
// instrinsics describe the address of a variable (= the alloca)
|
|
|
|
// being taken. They should not be moved next to the alloca
|
|
|
|
// (and to the beginning of the scope), but rather stay close to
|
|
|
|
// where said address is used.
|
|
|
|
if (!DVI || (DVI->getValue() && isa<AllocaInst>(DVI->getValue()))) {
|
2011-08-18 08:50:51 +08:00
|
|
|
PrevNonDbgInst = Insn;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
Instruction *VI = dyn_cast_or_null<Instruction>(DVI->getValue());
|
|
|
|
if (VI && VI != PrevNonDbgInst && !VI->isTerminator()) {
|
|
|
|
DEBUG(dbgs() << "Moving Debug Value before :\n" << *DVI << ' ' << *VI);
|
|
|
|
DVI->removeFromParent();
|
|
|
|
if (isa<PHINode>(VI))
|
2015-10-10 02:44:40 +08:00
|
|
|
DVI->insertBefore(&*VI->getParent()->getFirstInsertionPt());
|
2011-08-18 08:50:51 +08:00
|
|
|
else
|
|
|
|
DVI->insertAfter(VI);
|
|
|
|
MadeChange = true;
|
|
|
|
++NumDbgValueMoved;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return MadeChange;
|
|
|
|
}
|
2014-03-29 16:22:29 +08:00
|
|
|
|
|
|
|
// If there is a sequence that branches based on comparing a single bit
|
|
|
|
// against zero that can be combined into a single instruction, and the
|
|
|
|
// target supports folding these into a single instruction, sink the
|
|
|
|
// mask and compare into the branch uses. Do this before OptimizeBlock ->
|
|
|
|
// OptimizeInst -> OptimizeCmpExpression, which perturbs the pattern being
|
|
|
|
// searched for.
|
|
|
|
bool CodeGenPrepare::sinkAndCmp(Function &F) {
|
|
|
|
if (!EnableAndCmpSinking)
|
|
|
|
return false;
|
|
|
|
if (!TLI || !TLI->isMaskAndBranchFoldingLegal())
|
|
|
|
return false;
|
|
|
|
bool MadeChange = false;
|
|
|
|
for (Function::iterator I = F.begin(), E = F.end(); I != E; ) {
|
2015-10-10 02:44:40 +08:00
|
|
|
BasicBlock *BB = &*I++;
|
2014-03-29 16:22:29 +08:00
|
|
|
|
|
|
|
// Does this BB end with the following?
|
|
|
|
// %andVal = and %val, #single-bit-set
|
|
|
|
// %icmpVal = icmp %andResult, 0
|
|
|
|
// br i1 %cmpVal label %dest1, label %dest2"
|
|
|
|
BranchInst *Brcc = dyn_cast<BranchInst>(BB->getTerminator());
|
|
|
|
if (!Brcc || !Brcc->isConditional())
|
|
|
|
continue;
|
|
|
|
ICmpInst *Cmp = dyn_cast<ICmpInst>(Brcc->getOperand(0));
|
|
|
|
if (!Cmp || Cmp->getParent() != BB)
|
|
|
|
continue;
|
|
|
|
ConstantInt *Zero = dyn_cast<ConstantInt>(Cmp->getOperand(1));
|
|
|
|
if (!Zero || !Zero->isZero())
|
|
|
|
continue;
|
|
|
|
Instruction *And = dyn_cast<Instruction>(Cmp->getOperand(0));
|
|
|
|
if (!And || And->getOpcode() != Instruction::And || And->getParent() != BB)
|
|
|
|
continue;
|
|
|
|
ConstantInt* Mask = dyn_cast<ConstantInt>(And->getOperand(1));
|
|
|
|
if (!Mask || !Mask->getUniqueInteger().isPowerOf2())
|
|
|
|
continue;
|
|
|
|
DEBUG(dbgs() << "found and; icmp ?,0; brcc\n"); DEBUG(BB->dump());
|
|
|
|
|
|
|
|
// Push the "and; icmp" for any users that are conditional branches.
|
|
|
|
// Since there can only be one branch use per BB, we don't need to keep
|
|
|
|
// track of which BBs we insert into.
|
|
|
|
for (Value::use_iterator UI = Cmp->use_begin(), E = Cmp->use_end();
|
|
|
|
UI != E; ) {
|
|
|
|
Use &TheUse = *UI;
|
|
|
|
// Find brcc use.
|
|
|
|
BranchInst *BrccUser = dyn_cast<BranchInst>(*UI);
|
|
|
|
++UI;
|
|
|
|
if (!BrccUser || !BrccUser->isConditional())
|
|
|
|
continue;
|
|
|
|
BasicBlock *UserBB = BrccUser->getParent();
|
|
|
|
if (UserBB == BB) continue;
|
|
|
|
DEBUG(dbgs() << "found Brcc use\n");
|
|
|
|
|
|
|
|
// Sink the "and; icmp" to use.
|
|
|
|
MadeChange = true;
|
|
|
|
BinaryOperator *NewAnd =
|
|
|
|
BinaryOperator::CreateAnd(And->getOperand(0), And->getOperand(1), "",
|
|
|
|
BrccUser);
|
|
|
|
CmpInst *NewCmp =
|
|
|
|
CmpInst::Create(Cmp->getOpcode(), Cmp->getPredicate(), NewAnd, Zero,
|
|
|
|
"", BrccUser);
|
|
|
|
TheUse = NewCmp;
|
|
|
|
++NumAndCmpsMoved;
|
|
|
|
DEBUG(BrccUser->getParent()->dump());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return MadeChange;
|
|
|
|
}
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
|
2014-12-10 01:32:12 +08:00
|
|
|
/// \brief Retrieve the probabilities of a conditional branch. Returns true on
|
|
|
|
/// success, or returns false if no or invalid metadata was found.
|
|
|
|
static bool extractBranchMetadata(BranchInst *BI,
|
|
|
|
uint64_t &ProbTrue, uint64_t &ProbFalse) {
|
|
|
|
assert(BI->isConditional() &&
|
|
|
|
"Looking for probabilities on unconditional branch?");
|
|
|
|
auto *ProfileData = BI->getMetadata(LLVMContext::MD_prof);
|
|
|
|
if (!ProfileData || ProfileData->getNumOperands() != 3)
|
|
|
|
return false;
|
|
|
|
|
IR: Split Metadata from Value
Split `Metadata` away from the `Value` class hierarchy, as part of
PR21532. Assembly and bitcode changes are in the wings, but this is the
bulk of the change for the IR C++ API.
I have a follow-up patch prepared for `clang`. If this breaks other
sub-projects, I apologize in advance :(. Help me compile it on Darwin
I'll try to fix it. FWIW, the errors should be easy to fix, so it may
be simpler to just fix it yourself.
This breaks the build for all metadata-related code that's out-of-tree.
Rest assured the transition is mechanical and the compiler should catch
almost all of the problems.
Here's a quick guide for updating your code:
- `Metadata` is the root of a class hierarchy with three main classes:
`MDNode`, `MDString`, and `ValueAsMetadata`. It is distinct from
the `Value` class hierarchy. It is typeless -- i.e., instances do
*not* have a `Type`.
- `MDNode`'s operands are all `Metadata *` (instead of `Value *`).
- `TrackingVH<MDNode>` and `WeakVH` referring to metadata can be
replaced with `TrackingMDNodeRef` and `TrackingMDRef`, respectively.
If you're referring solely to resolved `MDNode`s -- post graph
construction -- just use `MDNode*`.
- `MDNode` (and the rest of `Metadata`) have only limited support for
`replaceAllUsesWith()`.
As long as an `MDNode` is pointing at a forward declaration -- the
result of `MDNode::getTemporary()` -- it maintains a side map of its
uses and can RAUW itself. Once the forward declarations are fully
resolved RAUW support is dropped on the ground. This means that
uniquing collisions on changing operands cause nodes to become
"distinct". (This already happened fairly commonly, whenever an
operand went to null.)
If you're constructing complex (non self-reference) `MDNode` cycles,
you need to call `MDNode::resolveCycles()` on each node (or on a
top-level node that somehow references all of the nodes). Also,
don't do that. Metadata cycles (and the RAUW machinery needed to
construct them) are expensive.
- An `MDNode` can only refer to a `Constant` through a bridge called
`ConstantAsMetadata` (one of the subclasses of `ValueAsMetadata`).
As a side effect, accessing an operand of an `MDNode` that is known
to be, e.g., `ConstantInt`, takes three steps: first, cast from
`Metadata` to `ConstantAsMetadata`; second, extract the `Constant`;
third, cast down to `ConstantInt`.
The eventual goal is to introduce `MDInt`/`MDFloat`/etc. and have
metadata schema owners transition away from using `Constant`s when
the type isn't important (and they don't care about referring to
`GlobalValue`s).
In the meantime, I've added transitional API to the `mdconst`
namespace that matches semantics with the old code, in order to
avoid adding the error-prone three-step equivalent to every call
site. If your old code was:
MDNode *N = foo();
bar(isa <ConstantInt>(N->getOperand(0)));
baz(cast <ConstantInt>(N->getOperand(1)));
bak(cast_or_null <ConstantInt>(N->getOperand(2)));
bat(dyn_cast <ConstantInt>(N->getOperand(3)));
bay(dyn_cast_or_null<ConstantInt>(N->getOperand(4)));
you can trivially match its semantics with:
MDNode *N = foo();
bar(mdconst::hasa <ConstantInt>(N->getOperand(0)));
baz(mdconst::extract <ConstantInt>(N->getOperand(1)));
bak(mdconst::extract_or_null <ConstantInt>(N->getOperand(2)));
bat(mdconst::dyn_extract <ConstantInt>(N->getOperand(3)));
bay(mdconst::dyn_extract_or_null<ConstantInt>(N->getOperand(4)));
and when you transition your metadata schema to `MDInt`:
MDNode *N = foo();
bar(isa <MDInt>(N->getOperand(0)));
baz(cast <MDInt>(N->getOperand(1)));
bak(cast_or_null <MDInt>(N->getOperand(2)));
bat(dyn_cast <MDInt>(N->getOperand(3)));
bay(dyn_cast_or_null<MDInt>(N->getOperand(4)));
- A `CallInst` -- specifically, intrinsic instructions -- can refer to
metadata through a bridge called `MetadataAsValue`. This is a
subclass of `Value` where `getType()->isMetadataTy()`.
`MetadataAsValue` is the *only* class that can legally refer to a
`LocalAsMetadata`, which is a bridged form of non-`Constant` values
like `Argument` and `Instruction`. It can also refer to any other
`Metadata` subclass.
(I'll break all your testcases in a follow-up commit, when I propagate
this change to assembly.)
llvm-svn: 223802
2014-12-10 02:38:53 +08:00
|
|
|
const auto *CITrue =
|
|
|
|
mdconst::dyn_extract<ConstantInt>(ProfileData->getOperand(1));
|
|
|
|
const auto *CIFalse =
|
|
|
|
mdconst::dyn_extract<ConstantInt>(ProfileData->getOperand(2));
|
2014-12-10 01:32:12 +08:00
|
|
|
if (!CITrue || !CIFalse)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
ProbTrue = CITrue->getValue().getZExtValue();
|
|
|
|
ProbFalse = CIFalse->getValue().getZExtValue();
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
/// \brief Scale down both weights to fit into uint32_t.
|
|
|
|
static void scaleWeights(uint64_t &NewTrue, uint64_t &NewFalse) {
|
|
|
|
uint64_t NewMax = (NewTrue > NewFalse) ? NewTrue : NewFalse;
|
|
|
|
uint32_t Scale = (NewMax / UINT32_MAX) + 1;
|
|
|
|
NewTrue = NewTrue / Scale;
|
|
|
|
NewFalse = NewFalse / Scale;
|
|
|
|
}
|
|
|
|
|
|
|
|
/// \brief Some targets prefer to split a conditional branch like:
|
|
|
|
/// \code
|
|
|
|
/// %0 = icmp ne i32 %a, 0
|
|
|
|
/// %1 = icmp ne i32 %b, 0
|
|
|
|
/// %or.cond = or i1 %0, %1
|
|
|
|
/// br i1 %or.cond, label %TrueBB, label %FalseBB
|
|
|
|
/// \endcode
|
|
|
|
/// into multiple branch instructions like:
|
|
|
|
/// \code
|
|
|
|
/// bb1:
|
|
|
|
/// %0 = icmp ne i32 %a, 0
|
|
|
|
/// br i1 %0, label %TrueBB, label %bb2
|
|
|
|
/// bb2:
|
|
|
|
/// %1 = icmp ne i32 %b, 0
|
|
|
|
/// br i1 %1, label %TrueBB, label %FalseBB
|
|
|
|
/// \endcode
|
|
|
|
/// This usually allows instruction selection to do even further optimizations
|
|
|
|
/// and combine the compare with the branch instruction. Currently this is
|
|
|
|
/// applied for targets which have "cheap" jump instructions.
|
|
|
|
///
|
|
|
|
/// FIXME: Remove the (equivalent?) implementation in SelectionDAG.
|
|
|
|
///
|
|
|
|
bool CodeGenPrepare::splitBranchCondition(Function &F) {
|
2015-03-09 09:57:13 +08:00
|
|
|
if (!TM || !TM->Options.EnableFastISel || !TLI || TLI->isJumpExpensive())
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
bool MadeChange = false;
|
|
|
|
for (auto &BB : F) {
|
|
|
|
// Does this BB end with the following?
|
|
|
|
// %cond1 = icmp|fcmp|binary instruction ...
|
|
|
|
// %cond2 = icmp|fcmp|binary instruction ...
|
|
|
|
// %cond.or = or|and i1 %cond1, cond2
|
|
|
|
// br i1 %cond.or label %dest1, label %dest2"
|
|
|
|
BinaryOperator *LogicOp;
|
|
|
|
BasicBlock *TBB, *FBB;
|
|
|
|
if (!match(BB.getTerminator(), m_Br(m_OneUse(m_BinOp(LogicOp)), TBB, FBB)))
|
|
|
|
continue;
|
|
|
|
|
2015-09-03 03:23:23 +08:00
|
|
|
auto *Br1 = cast<BranchInst>(BB.getTerminator());
|
|
|
|
if (Br1->getMetadata(LLVMContext::MD_unpredictable))
|
|
|
|
continue;
|
|
|
|
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
unsigned Opc;
|
2014-12-10 01:50:10 +08:00
|
|
|
Value *Cond1, *Cond2;
|
|
|
|
if (match(LogicOp, m_And(m_OneUse(m_Value(Cond1)),
|
|
|
|
m_OneUse(m_Value(Cond2)))))
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
Opc = Instruction::And;
|
2014-12-10 01:50:10 +08:00
|
|
|
else if (match(LogicOp, m_Or(m_OneUse(m_Value(Cond1)),
|
|
|
|
m_OneUse(m_Value(Cond2)))))
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
Opc = Instruction::Or;
|
|
|
|
else
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!match(Cond1, m_CombineOr(m_Cmp(), m_BinOp())) ||
|
|
|
|
!match(Cond2, m_CombineOr(m_Cmp(), m_BinOp())) )
|
|
|
|
continue;
|
|
|
|
|
|
|
|
DEBUG(dbgs() << "Before branch condition splitting\n"; BB.dump());
|
|
|
|
|
|
|
|
// Create a new BB.
|
|
|
|
auto *InsertBefore = std::next(Function::iterator(BB))
|
|
|
|
.getNodePtrUnchecked();
|
|
|
|
auto TmpBB = BasicBlock::Create(BB.getContext(),
|
|
|
|
BB.getName() + ".cond.split",
|
|
|
|
BB.getParent(), InsertBefore);
|
|
|
|
|
|
|
|
// Update original basic block by using the first condition directly by the
|
|
|
|
// branch instruction and removing the no longer needed and/or instruction.
|
|
|
|
Br1->setCondition(Cond1);
|
|
|
|
LogicOp->eraseFromParent();
|
2014-12-10 01:50:10 +08:00
|
|
|
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
// Depending on the conditon we have to either replace the true or the false
|
|
|
|
// successor of the original branch instruction.
|
|
|
|
if (Opc == Instruction::And)
|
|
|
|
Br1->setSuccessor(0, TmpBB);
|
|
|
|
else
|
|
|
|
Br1->setSuccessor(1, TmpBB);
|
|
|
|
|
|
|
|
// Fill in the new basic block.
|
|
|
|
auto *Br2 = IRBuilder<>(TmpBB).CreateCondBr(Cond2, TBB, FBB);
|
2014-12-10 01:50:10 +08:00
|
|
|
if (auto *I = dyn_cast<Instruction>(Cond2)) {
|
|
|
|
I->removeFromParent();
|
|
|
|
I->insertBefore(Br2);
|
|
|
|
}
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
|
|
|
|
// Update PHI nodes in both successors. The original BB needs to be
|
|
|
|
// replaced in one succesor's PHI nodes, because the branch comes now from
|
|
|
|
// the newly generated BB (NewBB). In the other successor we need to add one
|
|
|
|
// incoming edge to the PHI nodes, because both branch instructions target
|
|
|
|
// now the same successor. Depending on the original branch condition
|
|
|
|
// (and/or) we have to swap the successors (TrueDest, FalseDest), so that
|
|
|
|
// we perfrom the correct update for the PHI nodes.
|
|
|
|
// This doesn't change the successor order of the just created branch
|
|
|
|
// instruction (or any other instruction).
|
|
|
|
if (Opc == Instruction::Or)
|
|
|
|
std::swap(TBB, FBB);
|
|
|
|
|
|
|
|
// Replace the old BB with the new BB.
|
|
|
|
for (auto &I : *TBB) {
|
|
|
|
PHINode *PN = dyn_cast<PHINode>(&I);
|
|
|
|
if (!PN)
|
|
|
|
break;
|
|
|
|
int i;
|
|
|
|
while ((i = PN->getBasicBlockIndex(&BB)) >= 0)
|
|
|
|
PN->setIncomingBlock(i, TmpBB);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Add another incoming edge form the new BB.
|
|
|
|
for (auto &I : *FBB) {
|
|
|
|
PHINode *PN = dyn_cast<PHINode>(&I);
|
|
|
|
if (!PN)
|
|
|
|
break;
|
|
|
|
auto *Val = PN->getIncomingValueForBlock(&BB);
|
|
|
|
PN->addIncoming(Val, TmpBB);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Update the branch weights (from SelectionDAGBuilder::
|
|
|
|
// FindMergedConditions).
|
|
|
|
if (Opc == Instruction::Or) {
|
|
|
|
// Codegen X | Y as:
|
|
|
|
// BB1:
|
|
|
|
// jmp_if_X TBB
|
|
|
|
// jmp TmpBB
|
|
|
|
// TmpBB:
|
|
|
|
// jmp_if_Y TBB
|
|
|
|
// jmp FBB
|
|
|
|
//
|
|
|
|
|
|
|
|
// We have flexibility in setting Prob for BB1 and Prob for NewBB.
|
|
|
|
// The requirement is that
|
|
|
|
// TrueProb for BB1 + (FalseProb for BB1 * TrueProb for TmpBB)
|
|
|
|
// = TrueProb for orignal BB.
|
|
|
|
// Assuming the orignal weights are A and B, one choice is to set BB1's
|
|
|
|
// weights to A and A+2B, and set TmpBB's weights to A and 2B. This choice
|
|
|
|
// assumes that
|
|
|
|
// TrueProb for BB1 == FalseProb for BB1 * TrueProb for TmpBB.
|
|
|
|
// Another choice is to assume TrueProb for BB1 equals to TrueProb for
|
|
|
|
// TmpBB, but the math is more complicated.
|
|
|
|
uint64_t TrueWeight, FalseWeight;
|
2014-12-10 01:32:12 +08:00
|
|
|
if (extractBranchMetadata(Br1, TrueWeight, FalseWeight)) {
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
uint64_t NewTrueWeight = TrueWeight;
|
|
|
|
uint64_t NewFalseWeight = TrueWeight + 2 * FalseWeight;
|
|
|
|
scaleWeights(NewTrueWeight, NewFalseWeight);
|
|
|
|
Br1->setMetadata(LLVMContext::MD_prof, MDBuilder(Br1->getContext())
|
|
|
|
.createBranchWeights(TrueWeight, FalseWeight));
|
|
|
|
|
|
|
|
NewTrueWeight = TrueWeight;
|
|
|
|
NewFalseWeight = 2 * FalseWeight;
|
|
|
|
scaleWeights(NewTrueWeight, NewFalseWeight);
|
|
|
|
Br2->setMetadata(LLVMContext::MD_prof, MDBuilder(Br2->getContext())
|
|
|
|
.createBranchWeights(TrueWeight, FalseWeight));
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Codegen X & Y as:
|
|
|
|
// BB1:
|
|
|
|
// jmp_if_X TmpBB
|
|
|
|
// jmp FBB
|
|
|
|
// TmpBB:
|
|
|
|
// jmp_if_Y TBB
|
|
|
|
// jmp FBB
|
|
|
|
//
|
|
|
|
// This requires creation of TmpBB after CurBB.
|
|
|
|
|
|
|
|
// We have flexibility in setting Prob for BB1 and Prob for TmpBB.
|
|
|
|
// The requirement is that
|
|
|
|
// FalseProb for BB1 + (TrueProb for BB1 * FalseProb for TmpBB)
|
|
|
|
// = FalseProb for orignal BB.
|
|
|
|
// Assuming the orignal weights are A and B, one choice is to set BB1's
|
|
|
|
// weights to 2A+B and B, and set TmpBB's weights to 2A and B. This choice
|
|
|
|
// assumes that
|
|
|
|
// FalseProb for BB1 == TrueProb for BB1 * FalseProb for TmpBB.
|
|
|
|
uint64_t TrueWeight, FalseWeight;
|
2014-12-10 01:32:12 +08:00
|
|
|
if (extractBranchMetadata(Br1, TrueWeight, FalseWeight)) {
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
uint64_t NewTrueWeight = 2 * TrueWeight + FalseWeight;
|
|
|
|
uint64_t NewFalseWeight = FalseWeight;
|
|
|
|
scaleWeights(NewTrueWeight, NewFalseWeight);
|
|
|
|
Br1->setMetadata(LLVMContext::MD_prof, MDBuilder(Br1->getContext())
|
|
|
|
.createBranchWeights(TrueWeight, FalseWeight));
|
|
|
|
|
|
|
|
NewTrueWeight = 2 * TrueWeight;
|
|
|
|
NewFalseWeight = FalseWeight;
|
|
|
|
scaleWeights(NewTrueWeight, NewFalseWeight);
|
|
|
|
Br2->setMetadata(LLVMContext::MD_prof, MDBuilder(Br2->getContext())
|
|
|
|
.createBranchWeights(TrueWeight, FalseWeight));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Note: No point in getting fancy here, since the DT info is never
|
2015-03-19 07:17:28 +08:00
|
|
|
// available to CodeGenPrepare.
|
[CodeGenPrepare] Split branch conditions into multiple conditional branches.
This optimization transforms code like:
bb1:
%0 = icmp ne i32 %a, 0
%1 = icmp ne i32 %b, 0
%or.cond = or i1 %0, %1
br i1 %or.cond, label %TrueBB, label %FalseBB
into a multiple branch instructions like:
bb1:
%0 = icmp ne i32 %a, 0
br i1 %0, label %TrueBB, label %bb2
bb2:
%1 = icmp ne i32 %b, 0
br i1 %1, label %TrueBB, label %FalseBB
This optimization is already performed by SelectionDAG, but not by FastISel.
FastISel cannot perform this optimization, because it cannot generate new
MachineBasicBlocks.
Performing this optimization at CodeGenPrepare time makes it available to both -
SelectionDAG and FastISel - and the implementation in SelectiuonDAG could be
removed. There are currenty a few differences in codegen for X86 and PPC, so
this commmit only enables it for FastISel.
Reviewed by Jim Grosbach
This fixes rdar://problem/19034919.
llvm-svn: 223786
2014-12-10 00:36:13 +08:00
|
|
|
ModifiedDT = true;
|
|
|
|
|
|
|
|
MadeChange = true;
|
|
|
|
|
|
|
|
DEBUG(dbgs() << "After branch condition splitting\n"; BB.dump();
|
|
|
|
TmpBB->dump());
|
|
|
|
}
|
|
|
|
return MadeChange;
|
|
|
|
}
|
2015-09-16 02:32:14 +08:00
|
|
|
|
|
|
|
void CodeGenPrepare::stripInvariantGroupMetadata(Instruction &I) {
|
2015-09-18 04:25:07 +08:00
|
|
|
if (auto *InvariantMD = I.getMetadata(LLVMContext::MD_invariant_group))
|
2015-09-16 02:32:14 +08:00
|
|
|
I.dropUnknownNonDebugMetadata(InvariantMD->getMetadataID());
|
|
|
|
}
|