As noted in the code comment, transforming this in the other direction might require
a separate transform here in CGP given the block-at-a-time DAG constraint.
Besides that theoretical motivation, there are 2 practical motivations for the
subtract-of-cmps form:
1. The codegen for both x86 and PPC is better for this IR (though PPC could be better still).
There is discussion about canonicalizing IR to the select form
( http://lists.llvm.org/pipermail/llvm-dev/2017-July/114885.html ),
so we probably need to add DAG transforms for those patterns anyway, but this improves the
memcmp output without waiting for that step.
2. If we allow vector-sized chunks for the load and compare, x86 is better prepared to convert
that to optimal code when using subtract-of-cmps, so another prerequisite patch is avoided
if we choose to enable that.
Differential Revision: https://reviews.llvm.org/D34904
llvm-svn: 309597
Summary:
Since r293359, most dump() function are only defined when
`!defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)` holds. print() functions
only used by dump() functions are now unused in release builds,
generating lots of warnings. This patch only defines some print()
functions if they are used.
Reviewers: MatzeB
Reviewed By: MatzeB
Subscribers: arsenm, mzolotukhin, nhaehnle, llvm-commits
Differential Revision: https://reviews.llvm.org/D35949
llvm-svn: 309553
This avoids excessive compile time. The case I'm looking at is
Function.cpp from an old version of LLVM that still had the giant memcmp
string matcher in it. Before r308322 this compiled in about 2 minutes,
after it, clang takes infinite* time to compile it. With this patch
we're at 5 min, which is still bad but this is a pathological case.
The cut off at 20 uses was chosen by looking at other cut-offs in LLVM
for user scanning. It's probably too high, but does the job and is very
unlikely to regress anything.
Fixes PR33900.
* I'm impatient and aborted after 15 minutes, on the bug report it was
killed after 2h.
llvm-svn: 308891
Allowing cycles in Phi traversal increases the scope of optimize memory instruction
in case we are in loop.
The added test shows an example of enabling optimization inside a loop.
Reviewers: loladiro, spatel, efriedma
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D35294
llvm-svn: 308419
optimizeMemoryInst contains a vector AddrModeInsts.
The only use of this vector is to check that all instructions are in the same
block as memory instruction. This check is guarded by PhiSeen flag,
so if we traversed through phi node then we do not need to keep information
in AddrModeInsts. AddModeInsts is set first time we found some addressing mode
and updated if we found new one later.
We can find next addressing mode only if we traverse phi node so all code
related to update of AddModeInsts can be safely removed.
Reviewers: loladiro, spatel, efriedma
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D35291
llvm-svn: 308265
Now, getUserCost() only checks the src and dst types of EXT to decide it is free
or not. This change first checks the types, then calls isExtFreeImpl(), and
check if EXT can form ExtLoad at last. Currently, only AArch64 has customized
implementation of isExtFreeImpl() to check if EXT can be folded into its use.
Differential Revision: https://reviews.llvm.org/D34458
llvm-svn: 308076
When we fail to sink an instruction, we must make sure not to modify
the function; otherwise, we end up in an infinite loop because
CodeGenPrepare iterates until it doesn't make any changes.
Fixes https://bugs.llvm.org/show_bug.cgi?id=33608 .
llvm-svn: 307866
CodeGenPrepare::optimizeMemoryInst contains a check that we do nothing
if all instructions combining the address for memory instruction is in the same
block as memory instruction itself.
However if any of these instruction are placed after memory instruction then
address calculation will not be folded to memory instruction.
The added test case shows an example.
Reviewers: loladiro, spatel, efriedma
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D34862
llvm-svn: 307628
Summary:
Arguably non-integral pointers probably shouldn't show up here at all,
but since the backend doesn't complain and this takes valid (according
to the Verifier) IR and makes it invalid, make sure not to introduce
any inttoptr instructions if we're dealing with non-integral pointers.
Reviewed By: sanjoy
Differential Revision: https://reviews.llvm.org/D33110
llvm-svn: 306737
As noted in D34071, there are some IR optimization opportunities that could be
handled by normal IR passes if this expansion wasn't happening so late in CGP.
Regardless of that, it seems wasteful to knowingly produce suboptimal IR here,
so I'm proposing this change:
%s = sub i32 %x, %y
%r = icmp ne %s, 0
=>
%r = icmp ne %x, %y
Changing the predicate to 'eq' mimics what InstCombine would do, so that's just
an efficiency improvement if we decide this expansion should happen sooner.
The fact that the PowerPC backend doesn't eliminate the 'subf.' might be
something for PPC folks to investigate separately.
Differential Revision: https://reviews.llvm.org/D34416
llvm-svn: 306471
This is the last step needed to avoid regressions for x86 before we flip the switch to allow
expansion of the smallest set of memcpy() via CGP. The DAG version checks for constant strings,
so we need to do that here too.
FWIW, the 2 constant test is not handled by LibCallSimplifier::optimizeMemCmp() because that
code is limited to 8-bit constant arrays. LibCallSimplifier will also fail to optimize some 1
constant tests because its alignment requirements are too strict (shouldn't require alignment
for a constant operand).
Differential Revision: https://reviews.llvm.org/D34071
llvm-svn: 305734
Summary:
At present, `-profile-guided-section-prefix` is a `cl::Optional` option, which means it demands to be passed exactly zero or one times. Our build system makes it pretty tricky to guarantee this. We often accidentally pass the flag more than once (but always with the same "false" value) which results in an error, after which compilation fails:
```
clang (LLVM option parsing): for the -profile-guided-section-prefix option: may only occur zero or one times!
```
While we work on improving our build system, it also seems reasonable just to allow `-profile-guided-section-prefix` to be passed more than once, by to `cl::ZeroOrMore`. Quoting [[ http://llvm.org/docs/CommandLine.html#controlling-the-number-of-occurrences-required-and-allowed | the documentation ]]:
> The cl::ZeroOrMore modifier ... indicates that your program will allow the option to be specified zero or more times.
> ...
> If an option is specified multiple times for an option of the cl::opt class, only the last value will be retained.
Reviewers: danielcdh
Reviewed By: danielcdh
Subscribers: twoh, david2050, llvm-commits
Differential Revision: https://reviews.llvm.org/D34219
llvm-svn: 305413
We're currently passing endian-ness around as a param (and not uniformly),
so this eliminates the need for that. I'd like to add a constant fold
call too, and that requires a DL.
llvm-svn: 305129
The test diff for PowerPC shows we can better optimize if this case is one block.
For x86, there's would be a substantial difference if CGP expansion was enabled because branches are assumed
cheap and SDAG can't optimize across blocks.
Instead of this:
_cmp_eq8:
movq (%rdi), %rax
cmpq (%rsi), %rax
je LBB23_1
## BB#2: ## %res_block
movl $1, %ecx
jmp LBB23_3
LBB23_1:
xorl %ecx, %ecx
LBB23_3: ## %endblock
xorl %eax, %eax
testl %ecx, %ecx
sete %al
retq
We get this:
cmp_eq8:
movq (%rdi), %rcx
xorl %eax, %eax
cmpq (%rsi), %rcx
sete %al
retq
And that matches the optimal codegen that we get from the current expansion in SelectionDAGBuilder::visitMemCmpCall().
If this looks right, then I just need to confirm that vector-sized expansion will work from here, and we can enable
CGP memcmp() expansion for x86. Ie, we'll bypass the power-of-2 special cases currently optimized in SDAG because we
can lower the IR produced here optimally.
Differential Revision: https://reviews.llvm.org/D34005
llvm-svn: 304987
This could be viewed as another shortcoming of the DAGCombiner:
when both operands of a compare are zexted from the same source
type, we should be able to compare the original types.
The effect on PowerPC perf is likely unnoticeable, but there's a
visible regression for x86 if we feed the suboptimal IR for memcmp
expansion to the DAG:
_cmp_eq4_zexted_to_i64:
movl (%rdi), %ecx
movl (%rsi), %edx
xorl %eax, %eax
cmpq %rdx, %rcx
sete %al
_cmp_eq4_better:
movl (%rdi), %ecx
xorl %eax, %eax
cmpl (%rsi), %ecx
sete %al
llvm-svn: 304923
In the special (but also the likely common) case, we can avoid
the multi-block complexity of the general algorithm, so moving
this part off on its own will make it re-usable.
llvm-svn: 304908
I'd like to enable CGP memcmp expansion for x86, but the output from CGP would regress the
special cases (memcmp(x,y,N) != 0 for N=1,2,4,8,16,32 bytes) that we already handle.
I'm not sure if we'll actually be able to produce the optimal code given the block-at-a-time
limitation in the DAG. We might have to just avoid those special-cases here in CGP. But
regardless of that, I think this is a win for the more general cases.
http://rise4fun.com/Alive/cbQ
Differential Revision: https://reviews.llvm.org/D33963
llvm-svn: 304849
I did this a long time ago with a janky python script, but now
clang-format has built-in support for this. I fed clang-format every
line with a #include and let it re-sort things according to the precise
LLVM rules for include ordering baked into clang-format these days.
I've reverted a number of files where the results of sorting includes
isn't healthy. Either places where we have legacy code relying on
particular include ordering (where possible, I'll fix these separately)
or where we have particular formatting around #include lines that
I didn't want to disturb in this patch.
This patch is *entirely* mechanical. If you get merge conflicts or
anything, just ignore the changes in this patch and run clang-format
over your #include lines in the files.
Sorry for any noise here, but it is important to keep these things
stable. I was seeing an increasing number of patches with irrelevant
re-ordering of #include lines because clang-format was used. This patch
at least isolates that churn, makes it easy to skip when resolving
conflicts, and gets us to a clean baseline (again).
llvm-svn: 304787
This patch does an inline expansion of memcmp.
It changes the memcmp library call into an inline expansion when the size is
known at compile time and is under a target specified threshold.
This expansion is implemented in CodeGenPrepare and expands into straight line
code. The target specifies a maximum load size and the expansion works by using
this size to load the two sources, compare, and exit early if a difference is
found. It also has a special case when the memcmp result is used in a compare
to zero equality.
Differential Revision: https://reviews.llvm.org/D28637
llvm-svn: 304313
Rename the DEBUG_TYPE to match the names of corresponding passes where
it makes sense. Also establish the pattern of simply referencing
DEBUG_TYPE instead of repeating the passname where possible.
llvm-svn: 303921
Summary:
Implements PR889
Removing the virtual table pointer from Value saves 1% of RSS when doing
LTO of llc on Linux. The impact on time was positive, but too noisy to
conclusively say that performance improved. Here is a link to the
spreadsheet with the original data:
https://docs.google.com/spreadsheets/d/1F4FHir0qYnV0MEp2sYYp_BuvnJgWlWPhWOwZ6LbW7W4/edit?usp=sharing
This change makes it invalid to directly delete a Value, User, or
Instruction pointer. Instead, such code can be rewritten to a null check
and a call Value::deleteValue(). Value objects tend to have their
lifetimes managed through iplist, so for the most part, this isn't a big
deal. However, there are some places where LLVM deletes values, and
those places had to be migrated to deleteValue. I have also created
llvm::unique_value, which has a custom deleter, so it can be used in
place of std::unique_ptr<Value>.
I had to add the "DerivedUser" Deleter escape hatch for MemorySSA, which
derives from User outside of lib/IR. Code in IR cannot include MemorySSA
headers or call the MemoryAccess object destructors without introducing
a circular dependency, so we need some level of indirection.
Unfortunately, no class derived from User may have any virtual methods,
because adding a virtual method would break User::getHungOffOperands(),
which assumes that it can find the use list immediately prior to the
User object. I've added a static_assert to the appropriate OperandTraits
templates to help people avoid this trap.
Reviewers: chandlerc, mehdi_amini, pete, dberlin, george.burgess.iv
Reviewed By: chandlerc
Subscribers: krytarowski, eraman, george.burgess.iv, mzolotukhin, Prazek, nlewycky, hans, inglorion, pcc, tejohnson, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D31261
llvm-svn: 303362
This provides a new way to access the TargetMachine through
TargetPassConfig, as a dependency.
The patterns replaced here are:
* Passes handling a null TargetMachine call
`getAnalysisIfAvailable<TargetPassConfig>`.
* Passes not handling a null TargetMachine
`addRequired<TargetPassConfig>` and call
`getAnalysis<TargetPassConfig>`.
* MachineFunctionPasses now use MF.getTarget().
* Remove all the TargetMachine constructors.
* Remove INITIALIZE_TM_PASS.
This fixes a crash when running `llc -start-before prologepilog`.
PEI needs StackProtector, which gets constructed without a TargetMachine
by the pass manager. The StackProtector pass doesn't handle the case
where there is no TargetMachine, so it segfaults.
Related to PR30324.
Differential Revision: https://reviews.llvm.org/D33222
llvm-svn: 303360
Currently, when masked load, store, gather or scatter intrinsics are used, we check in CodeGenPrepare pass if the subtarget support these intrinsics, if not we replace them with scalar code - this is a functional transformation not an optimization (not optional).
CodeGenPrepare pass does not run when the optimization level is set to CodeGenOpt::None (-O0).
Functional transformation should run with all optimization levels, so here I created a new pass which runs on all optimization levels and does no more than this transformation.
Differential Revision: https://reviews.llvm.org/D32487
llvm-svn: 303050
Summary:
r284533 added hot and cold section prefixes based on profile
information, to enable grouping of hot/cold functions at link time.
However, it used "cold" as the prefix for cold sections, but gold only
recognizes "unlikely" (which is used by gcc for cold sections).
Therefore, cold sections were not properly being grouped. Switch to
using "unlikely"
Reviewers: danielcdh, davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D32983
llvm-svn: 302502
Commits were:
"Use WeakVH instead of WeakTrackingVH in AliasSetTracker's UnkownInsts"
"Add a new WeakVH value handle; NFC"
"Rename WeakVH to WeakTrackingVH; NFC"
The changes assumed pointers are 8 byte aligned on all architectures.
llvm-svn: 301429
Summary:
I plan to use WeakVH to mean "nulls itself out on deletion, but does
not track RAUW" in a subsequent commit.
Reviewers: dblaikie, davide
Reviewed By: davide
Subscribers: arsenm, mehdi_amini, mcrosier, mzolotukhin, jfb, llvm-commits, nhaehnle
Differential Revision: https://reviews.llvm.org/D32266
llvm-svn: 301424
This patch uses lshrInPlace to replace code where the object that lshr is called on is being overwritten with the result.
This adds an lshrInPlace(const APInt &) version as well.
Differential Revision: https://reviews.llvm.org/D32155
llvm-svn: 300566
The splitIndirectCriticalEdges function generates and invalid CFG when the
'Target' basic block is a loop to itself. When this occurs, the code that
updates the predecessor terminator needs to update the terminator in the split
basic block.
This occurs when there is an edge from block D back to D. Since D is split in
to D0 and D1, the code needs to update the terminator in D1. But D1 is not in
the OtherPreds vector, so it was not getting updated.
Differential Revision: https://reviews.llvm.org/D32126
llvm-svn: 300480
and to expose a handle to represent the actual case rather than having
the iterator return a reference to itself.
All of this allows the iterator to be used with common STL facilities,
standard algorithms, etc.
Doing this exposed some missing facilities in the iterator facade that
I've fixed and required some work to the actual iterator to fully
support the necessary API.
Differential Revision: https://reviews.llvm.org/D31548
llvm-svn: 300032
The new codepath has been in the tree for years, and there isn't any
reason to use two codepaths here.
Differential Revision: https://reviews.llvm.org/D30596
llvm-svn: 299723
Summary:
Move the aarch64-type-promotion pass within the existing type promotion framework in CGP.
This change also support forking sexts when a new sext is required for promotion.
Note that change is based on D27853 and I am submitting this out early to provide a better idea on D27853.
Reviewers: jmolloy, mcrosier, javed.absar, qcolombet
Reviewed By: qcolombet
Subscribers: llvm-commits, aemerson, rengolin, mcrosier
Differential Revision: https://reviews.llvm.org/D28680
llvm-svn: 299379
This moves the isMask and isShiftedMask functions to be class methods. They now use the MathExtras.h function for single word size and leading/trailing zeros/ones or countPopulation for the multiword size. The previous implementation made multiple temorary memory allocations to do the bitwise arithmetic operations to match the MathExtras.h implementation.
Differential Revision: https://reviews.llvm.org/D31565
llvm-svn: 299362
Summary: The current prefix based function layout algorithm only looks at function's entry count, which is not sufficient. A function should be grouped together if its entry count or any call edge count is hot.
Reviewers: davidxl, eraman
Reviewed By: eraman
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D31225
llvm-svn: 298656
Summary:
This class is a list of AttributeSetNodes corresponding the function
prototype of a call or function declaration. This class used to be
called ParamAttrListPtr, then AttrListPtr, then AttributeSet. It is
typically accessed by parameter and return value index, so
"AttributeList" seems like a more intuitive name.
Rename AttributeSetImpl to AttributeListImpl to follow suit.
It's useful to rename this class so that we can rename AttributeSetNode
to AttributeSet later. AttributeSet is the set of attributes that apply
to a single function, argument, or return value.
Reviewers: sanjoy, javed.absar, chandlerc, pete
Reviewed By: pete
Subscribers: pete, jholewinski, arsenm, dschuff, mehdi_amini, jfb, nhaehnle, sbc100, void, llvm-commits
Differential Revision: https://reviews.llvm.org/D31102
llvm-svn: 298393
Summary:
Instead of just looking for a load which is mergable with Ext to form ExtLoad, trying to promote Exts as long as the cost is acceptable. This change is not a NFC as it continue promoting Exts even after finding a load during promotions; the change in arm64-codegen-prepare-extload.ll described in 2.b might show the case.
This change was motivated from D26524. Based on this change, I will move the transformation performed in aarch64-type-promotion into CGP.
Reviewers: jmolloy, qcolombet, mcrosier, javed.absar
Reviewed By: qcolombet
Subscribers: rengolin, llvm-commits, aemerson
Differential Revision: https://reviews.llvm.org/D27853
llvm-svn: 298114
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296416
When we construct addressing modes, we use isNoopAddrSpaceCast to ignore
addrspacecast instructions. Make sure we insert the correct addrspacecast
when we reconstruct the addressing mode.
Differential Revision: https://reviews.llvm.org/D30114
llvm-svn: 296167
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296149
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296060
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64. The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.
The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used. One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.
This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).
Reviewers: t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, mcrosier, sebpop, llvm-commits
Differential Revision: https://reviews.llvm.org/D28813
llvm-svn: 295746
Summary: This change prevent the signed value of cost from being negative as the value is passed as an unsigned argument.
Reviewers: mcrosier, jmolloy, qcolombet, javed.absar
Reviewed By: mcrosier, qcolombet
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28871
llvm-svn: 293307
This is a succeeding patch of https://reviews.llvm.org/D22840 to address the
issue when a value to be merged into an int64 pair is in a different BB. Redoing
the store splitting in CodeGenPrepare so we can match the pattern across multiple
BBs and move some instructions into the same BB. We still keep the code in dag
combine so that we can catch cases that show up after DAG combining runs.
Differential Revision: https://reviews.llvm.org/D25914
llvm-svn: 290365
We're currently doing nearly the same thing for @llvm.objectsize in
three different places: two of them are missing checks for overflow,
and one of them could subtly break if InstCombine gets much smarter
about removing alloc sites. Seems like a good idea to not do that.
llvm-svn: 290214
This is recommit of r287553 after fixing the invalid loop info after eliminating an empty block and unit test failures in AVR and WebAssembly :
Summary: Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, joerg, davidxl
Subscribers: joerg, qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 289988
`dropUnknownNonDebugMetadata` takes a list of "known" metadata IDs. The
only reason it worked at all is that `getMetadataID` returns something
unrelated -- it returns the subclass ID of the receiver (which is used
in `dyn_cast` etc.). That does not numerically match
`LLVMContext::MD_invariant_group` and ends up dropping `invariant_group`
along with every other metadata that does not numerically match
`LLVMContext::MD_invariant_group`.
llvm-svn: 289973
This is recommit of r287553 after fixing the invalid loop info after eliminating an empty block:
Summary: Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, joerg, davidxl
Subscribers: joerg, qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 289951
Instead, expose whether the current type is an array or a struct, if an array
what the upper bound is, and if a struct the struct type itself. This is
in preparation for a later change which will make PointerType derive from
Type rather than SequentialType.
Differential Revision: https://reviews.llvm.org/D26594
llvm-svn: 288458
Summary:
Previously, CGP would unconditionally sink addrspacecast instructions,
even going so far as to sink them into a loop.
Now we check that the cast is "cheap", as defined by TLI.
We introduce a new "is-cheap" function to TLI rather than using
isNopAddrSpaceCast because some GPU platforms want the ability to ask
for non-nop casts to be sunk.
Reviewers: arsenm, tra
Subscribers: jholewinski, wdng, llvm-commits
Differential Revision: https://reviews.llvm.org/D26923
llvm-svn: 287591
Summary: Merging an empty case block into the header block of switch could cause
ISel to add COPY instructions in the header of switch, instead of the case
block, if the case block is used as an incoming block of a PHI. This could
potentially increase dynamic instructions, especially when the switch is in a
loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, davidxl
Subscribers: qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 287553
of which that is hidden inside a separate function call) and helpfully
before building expensive transaction infrastructure. This will avoid
crashing when running CGP in a generic mode if we ever managed to hit
this case.
Note that I spent some time looking at alternatives. CGP is actually
used without a TM or TLI in order to do some target-independent testing.
Further, all of the neighboring optimization techniques actually have
some paths that are effective even in the absence of TLI so this seemed
the correct scope at which to check and bypass logic. It still isn't
clear that long-term support for missing TM/TLI is the right
cost/benefit tradeoff for CGP -- we seem to get relatively little for it
and the code is just littered with checks (and assumptions which
I suspect are still missing some checks).
This at least fixes the potential bug in this code spotted by
PVS-Studio, so we've got that going for us. ;]
llvm-svn: 285987
Summary:
The original implementation is in r261607, which was reverted in r269726 to accomendate the ProfileSummaryInfo analysis pass. The new implementation:
1. add a new metadata for function section prefix
2. query against ProfileSummaryInfo in CGP to set the correct section prefix for each function
3. output the section prefix set by CGP
Reviewers: davidxl, eraman
Subscribers: vsk, llvm-commits
Differential Revision: https://reviews.llvm.org/D24989
llvm-svn: 284533
CodeGenPrepare knows how to move a zext of a load into the same basic block
where the load lives. The goal is to help ISel match a zero-extending load
instead of two separated instructions.
CGP attempts to move a zext computation even if it lives in a basic block that
does not post-dominate the load's basic block. That means, the hoisted zext may
be speculated. Preserving the zext location would hurt the debugging experience
and the quality of sample pgo.
With this patch, when moving a zext near to its associated load, CGP no longer
propagates the zext's debug location. Instead, CGP conservatively reuses the
same debug location for the load and the zext.
An alternative approach would be to assign an artificial line-0 location to the
zext. However we don't want to over-use the 'line-0' for this particular case
because it would have a size cost in the line-table section for no additional
benefit.
Differential Revision: https://reviews.llvm.org/D25611
llvm-svn: 284377
Summary: If consecutive select instructions are lowered separately in CGP, it will introduce redundant condition check and branches that cannot be removed by later optimization phases. This patch lowers all consecutive select instructions at the same to to avoid inefficent code as demonstrated in https://llvm.org/bugs/show_bug.cgi?id=29095
Reviewers: davidxl
Subscribers: vsk, llvm-commits
Differential Revision: https://reviews.llvm.org/D24147
llvm-svn: 281252
CGP tail-duplicates rets into blocks that end with a call that feed the ret.
This puts the call in tail position, potentially allowing the DAG builder to
lower it as a tail call. To avoid tail duplication in cases where we won't
form the tail call, CGP tried to predict whether this is going to be possible,
and avoids doing it when lowering as a tail call will definitely fail.
However, it was being too conservative by always throwing away calls to
functions with a signext/zeroext attribute on the return type.
Instead, we can use the same logic the builder uses to determine whether the
attributes work out.
Differential Revision: https://reviews.llvm.org/D24315
llvm-svn: 280894
CGP currently drops select's MD_prof profile data when
generating conditional branch which can lead to bad
code layout. The patch fixes the issue.
Differential Revision: http://reviews.llvm.org/D24169
llvm-svn: 280600
Elsewhere (particularly computeKnownBits) we assume that a global will be
aligned to the value returned by Value::getPointerAlignment. This is used to
boost the alignment on memcpy/memset, so any target-specific request can only
increase that value.
llvm-svn: 275866
Also, rename recognizeBitReverseOrBSwapIdiom to recognizeBSwapOrBitReverseIdiom,
so the ordering of the MatchBSwaps and MatchBitReversals arguments are
consistent with the function name.
llvm-svn: 270715
The sink cast machinery is supposed to sink casts as close to their user
as possible. However, an EH pad is the first instruction in it's basic
block. Don't sink if the user is an EH pad.
This fixes PR27536.
llvm-svn: 267767
This is part of solving PR27344:
https://llvm.org/bugs/show_bug.cgi?id=27344
CGP should undo the SimplifyCFG transform for the same reason that earlier patches have used this
same mechanism: it's possible that passes between SimplifyCFG and CGP may be able to optimize the
IR further with a select in place.
For the TLI hook default, >99% taken or not taken is chosen as the default threshold for a highly
predictable branch. Even the most limited HW branch predictors will be correct on this branch almost
all the time, so even a massive mispredict penalty perf loss would be overcome by the win from all
the times the branch was predicted correctly.
As a follow-up, we could make the default target hook less conservative by using the SchedMachineModel's
MispredictPenalty. Or we could just let targets override the default by implementing the hook with that
and other target-specific options. Note that trying to statically determine mispredict rates for
close-to-balanced profile weight data is generally impossible if the HW is sufficiently advanced. Ie,
50/50 taken/not-taken might still be 100% predictable.
Finally, note that this patch as-is will not solve PR27344 because the current __builtin_unpredictable()
branch weight default values are 4 and 64. A proposal to change that is in D19435.
Differential Revision: http://reviews.llvm.org/D19488
llvm-svn: 267572
The original commit was reverted because of a buildbot problem with LazyCallGraph::SCC handling (not related to the OptBisect handling).
Differential Revision: http://reviews.llvm.org/D19172
llvm-svn: 267231
This patch implements a optimization bisect feature, which will allow optimizations to be selectively disabled at compile time in order to track down test failures that are caused by incorrect optimizations.
The bisection is enabled using a new command line option (-opt-bisect-limit). Individual passes that may be skipped call the OptBisect object (via an LLVMContext) to see if they should be skipped based on the bisect limit. A finer level of control (disabling individual transformations) can be managed through an addition OptBisect method, but this is not yet used.
The skip checking in this implementation is based on (and replaces) the skipOptnoneFunction check. Where that check was being called, a new call has been inserted in its place which checks the bisect limit and the optnone attribute. A new function call has been added for module and SCC passes that behaves in a similar way.
Differential Revision: http://reviews.llvm.org/D19172
llvm-svn: 267022
This patch fixes calculating of builtin_object_size if it depends on a
condition. Before this patch compiler did not know how to calculate the
object size when it finds a condition that cannot be eliminated.
This patch enables calculating of builtin_object_size even in case when
condition cannot be eliminated by choosing minimum or maximum value as a
result from condition. Choosing minimum or maximum value from condition
is based on the second argument of __builtin_object_size function.
Patch by Strahinja Petrovic.
Differential Revision: http://reviews.llvm.org/D18438
llvm-svn: 266193
Presently, CodeGenPrepare deletes all nearly empty (only phi and branch)
basic blocks. This pass can delete loop preheaders which frequently creates
critical edges. A preheader can be a convenient place to spill registers to
the stack. If the entrance to a loop body is a critical edge, then spills
may occur in the loop body rather than immediately before it. This patch
protects loop preheaders from deletion in CodeGenPrepare even if they are
nearly empty.
Since the patch alters the CFG, it affects a large number of test cases.
In most cases, the changes are merely cosmetic (basic blocks have different
names or instruction orders change slightly). I am somewhat concerned about
the test/CodeGen/Mips/brdelayslot.ll test case. If the loop preheader is not
deleted, then the MIPS backend does not take advantage of a branch delay
slot. Consequently, I would like some close review by a MIPS expert.
The patch also partially subsumes D16893 from George Burgess IV. George
correctly notes that CodeGenPrepare does not actually preserve the dominator
tree. I think the dominator tree was usually not valid when CodeGenPrepare
ran, but I am using LoopInfo to mark preheaders, so the dominator tree is
now always valid before CodeGenPrepare.
Author: Tom Jablin (tjablin)
Reviewers: hfinkel george.burgess.iv vkalintiris dsanders kbarton cycheng
http://reviews.llvm.org/D16984
llvm-svn: 265397
Sinking comparisons in CGP can undo the job of hoisting them done
earlier by LICM, and soft-FP makes this an expensive mistake.
A common pattern that produces floating point comparisons uniform
over a loop is an explicit check for division by zero. If the divisor
is hoisted out of the loop, the comparison can also be, but hoisting
the function that unwinds is never legal, since it may cause side
effects in the loop body prior to the unwinding to not be executed.
Differential Revision: http://reviews.llvm.org/D18744
llvm-svn: 265264
CGP modifies the domtree in some cases, so saying that it preserves the
domtree is a lie. We'll be able to selectively preserve it with the new
pass manager.
Differential Revision: http://reviews.llvm.org/D16893
llvm-svn: 264099
This patch teaches CGP to duplicate addressing mode computations into cold paths (detected via explicit cold attribute on calls) if required to let addressing mode be safely sunk into the basic block containing each load and store.
In general, duplicating code into cold blocks may result in code growth, but should not effect performance. In this case, it's better to duplicate some code than to put extra pressure on the register allocator by making it keep the address through the entirely of the fast path.
This patch only handles addressing computations, but in principal, we could implement a more general cold cold scheduling heuristic which tries to reduce register pressure in the fast path by duplicating code into the cold path. Getting the profitability of the general case right seemed likely to be challenging, so I stuck to the existing case (addressing computation) we already had.
Differential Revision: http://reviews.llvm.org/D17652
llvm-svn: 263074
Summary:
Both the hardware and LLVM have changed since 2012.
Now, load-based heuristic don't show big differences any more on OoO cores.
There is no notable regressons and improvements on spec2000/2006. (Cortex-A57, Core i5).
Reviewers: spatel, zansari
Differential Revision: http://reviews.llvm.org/D16836
llvm-svn: 261809
Stop using `getNodePtrUnchecked()` when building IR. Eventually a
dereference will be required to get at the downcast node, since the
iterator will only store an `ilist_node_base` of some sort.
This should have no functionality change for now, but is a path towards
removing some more UB from ilist.
llvm-svn: 261495
`ilist_iterator<NodeTy>::getNodePtrUnchecked()` is documented as being
for internal use only, but CodeGenPrepare was using it anyway. This
code relies on pulling out the `Value*` pointer even after the lifetime
of the iterator is over. But having this pointer available in
ilist_iterator depends on UB in the first place.
Instead, safely pull out the `Value*` when the iterator is alive and
stop using the internal-only API.
There should be no functionality change here.
llvm-svn: 261493
platforms.
With ELF, the alignment of a global variable in a shared library will
get copied into an executables linked against it, if the executable even
accesss the variable. So, it's not possible to implicitly increase
alignment based on access patterns, or you'll break existing binaries.
This happened to affect libc++'s std::cout symbol, for example. See
thread: http://thread.gmane.org/gmane.comp.compilers.clang.devel/45311
(This is a re-commit of r257719, without the bug reported in
PR26144. I've tweaked the code to not assert-fail in
enforceKnownAlignment when computeKnownBits doesn't recurse far enough
to find the underlying Alloca/GlobalObject value.)
Differential Revision: http://reviews.llvm.org/D16145
llvm-svn: 257902
There are several requirements that ended up with this design;
1. Matching bitreversals is too heavyweight for InstCombine and doesn't really need to be done so early.
2. Bitreversals and byteswaps are very related in their matching logic.
3. We want to implement support for matching more advanced bswap/bitreverse patterns like partial bswaps/bitreverses.
4. Bswaps are best matched early in InstCombine.
The result of these is that a new utility function is created in Transforms/Utils/Local.h that can be configured to search for bswaps, bitreverses or both. InstCombine uses it to find only bswaps, CGP uses it to find only bitreversals.
We can then extend the matching logic in one place only.
llvm-svn: 257875
platforms.
With ELF, the alignment of a global variable in a shared library will
get copied into an executables linked against it, if the executable even
accesss the variable. So, it's not possible to implicitly increase
alignment based on access patterns, or you'll break existing binaries.
This happened to affect libc++'s std::cout symbol, for example. See
thread: http://thread.gmane.org/gmane.comp.compilers.clang.devel/45311
llvm-svn: 257719
Summary:
This commit renames GCRelocateOperands to GCRelocateInst and makes it an
intrinsic wrapper, similar to e.g. MemCpyInst. Also, all users of
GCRelocateOperands were changed to use the new intrinsic wrapper instead.
Reviewers: sanjoy, reames
Subscribers: reames, sanjoy, llvm-commits
Differential Revision: http://reviews.llvm.org/D15762
llvm-svn: 256811
Update some comments to be more explicit.
Change bypassSlowDivision and the functions it calls so that they take
BasicBlock*s and Instruction*s, rather than Function::iterator&s and
BasicBlock::iterator&s.
Change the APIs so that the caller is responsible for updating the
iterator, rather than the callee. This makes control flow much easier
to follow.
Patch by Justin Lebar!
llvm-svn: 256789
Summary:
Add and instructions immediately after loads that only have their low
bits used, assuming that the (and (load x) c) will be matched as a
extload and the ands/truncs fed by the extload will be removed by isel.
Reviewers: mcrosier, qcolombet, ab
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D14584
llvm-svn: 253722
This is another step towards allowing SimplifyCFG to speculate harder, but then have
CGP clean things up if the target doesn't like it.
Previous patches in this series:
http://reviews.llvm.org/D12882http://reviews.llvm.org/D13297
D13297 should catch most expensive ops, but speculation of cttz/ctlz requires special
handling because of weirdness in the intrinsic definition for handling a zero input
(that definition can probably be blamed on x86).
For example, if we have the usual speculated-by-select expensive op pattern like this:
%tobool = icmp eq i64 %A, 0
%0 = tail call i64 @llvm.cttz.i64(i64 %A, i1 true) ; is_zero_undef == true
%cond = select i1 %tobool, i64 64, i64 %0
ret i64 %cond
There's an instcombine that will turn it into:
%0 = tail call i64 @llvm.cttz.i64(i64 %A, i1 false) ; is_zero_undef == false
This CGP patch is looking for that case and despeculating it back into:
entry:
%tobool = icmp eq i64 %A, 0
br i1 %tobool, label %cond.end, label %cond.true
cond.true:
%0 = tail call i64 @llvm.cttz.i64(i64 %A, i1 true) ; is_zero_undef == true
br label %cond.end
cond.end:
%cond = phi i64 [ %0, %cond.true ], [ 64, %entry ]
ret i64 %cond
This unfortunately may lead to poorer codegen (see the changes in the existing x86 test),
but if we increase speculation in SimplifyCFG (the next step in this patch series), then
we should avoid those kinds of cases in the first place.
The need for this patch was originally mentioned here:
http://reviews.llvm.org/D7506
with follow-up here:
http://reviews.llvm.org/D7554
Differential Revision: http://reviews.llvm.org/D14630
llvm-svn: 253573
Note, this was reviewed (and more details are in) http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151109/312083.html
These intrinsics currently have an explicit alignment argument which is
required to be a constant integer. It represents the alignment of the
source and dest, and so must be the minimum of those.
This change allows source and dest to each have their own alignments
by using the alignment attribute on their arguments. The alignment
argument itself is removed.
There are a few places in the code for which the code needs to be
checked by an expert as to whether using only src/dest alignment is
safe. For those places, they currently take the minimum of src/dest
alignments which matches the current behaviour.
For example, code which used to read:
call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dest, i8* %src, i32 500, i32 8, i1 false)
will now read:
call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 8 %dest, i8* align 8 %src, i32 500, i1 false)
For out of tree owners, I was able to strip alignment from calls using sed by replacing:
(call.*llvm\.memset.*)i32\ [0-9]*\,\ i1 false\)
with:
$1i1 false)
and similarly for memmove and memcpy.
I then added back in alignment to test cases which needed it.
A similar commit will be made to clang which actually has many differences in alignment as now
IRBuilder can generate different source/dest alignments on calls.
In IRBuilder itself, a new argument was added. Instead of calling:
CreateMemCpy(Dst, Src, getInt64(Size), DstAlign, /* isVolatile */ false)
you now call
CreateMemCpy(Dst, Src, getInt64(Size), DstAlign, SrcAlign, /* isVolatile */ false)
There is a temporary class (IntegerAlignment) which takes the source alignment and rejects
implicit conversion from bool. This is to prevent isVolatile here from passing its default
parameter to the source alignment.
Note, changes in future can now be made to codegen. I didn't change anything here, but this
change should enable better memcpy code sequences.
Reviewed by Hal Finkel.
llvm-svn: 253511
This is a redo of r251849 except the tests have been split into arch-specific folders
to hopefully make the bots happy.
This is a follow-up from the discussion in D12965. The block-at-a-time limitation of
SelectionDAG also came up in D13297.
Without the InstCombine change from D12965, I don't expect this patch to make any
difference in the real world because InstCombine does not shrink cases like this in
visitSwitchInst(). But we need to have this CGP safety harness in place before
proceeding with any shrinkage in D12965, so we won't generate extra extends for compares.
I've opted for IR regression tests in the patch because that seems like a clearer way to
test the transform, but PowerPC CodeGen for an i16 widening test is shown below. x86
will need more work to solve: https://llvm.org/bugs/show_bug.cgi?id=22473
Before:
BB#0:
mr 4, 3
extsh. 3, 4
ble 0, .LBB0_5
BB#1:
cmpwi 3, 99
bgt 0, .LBB0_9
BB#2:
rlwinm 4, 4, 0, 16, 31 <--- 32-bit mask/extend
li 3, 0
cmplwi 4, 1
beqlr 0
BB#3:
cmplwi 4, 10
bne 0, .LBB0_12
BB#4:
li 3, 1
blr
.LBB0_5:
rlwinm 3, 4, 0, 16, 31 <--- 32-bit mask/extend
cmplwi 3, 65436
beq 0, .LBB0_13
BB#6:
cmplwi 3, 65526
beq 0, .LBB0_15
BB#7:
cmplwi 3, 65535
bne 0, .LBB0_12
BB#8:
li 3, 4
blr
.LBB0_9:
rlwinm 3, 4, 0, 16, 31 <--- 32-bit mask/extend
cmplwi 3, 100
beq 0, .LBB0_14
...
After:
BB#0:
rlwinm 4, 3, 0, 16, 31 <--- mask/extend to 32-bit and then use that for comparisons
cmpwi 4, 999
ble 0, .LBB0_5
BB#1:
lis 3, 0
ori 3, 3, 65525
cmpw 4, 3
bgt 0, .LBB0_9
BB#2:
cmplwi 4, 1000
beq 0, .LBB0_14
BB#3:
cmplwi 4, 65436
bne 0, .LBB0_13
BB#4:
li 3, 6
blr
.LBB0_5:
li 3, 0
cmplwi 4, 1
beqlr 0
BB#6:
cmplwi 4, 10
beq 0, .LBB0_12
BB#7:
cmplwi 4, 100
bne 0, .LBB0_13
BB#8:
li 3, 2
blr
.LBB0_9:
cmplwi 4, 65526
beq 0, .LBB0_15
BB#10:
cmplwi 4, 65535
bne 0, .LBB0_13
...
Differential Revision: http://reviews.llvm.org/D13532
llvm-svn: 251857
This is a follow-up from the discussion in D12965. The block-at-a-time limitation of
SelectionDAG also came up in D13297.
Without the InstCombine change from D12965, I don't expect this patch to make any
difference in the real world because InstCombine does not shrink cases like this in
visitSwitchInst(). But we need to have this CGP safety harness in place before
proceeding with any shrinkage in D12965, so we won't generate extra extends for compares.
I've opted for IR regression tests in the patch because that seems like a clearer way to
test the transform, but PowerPC CodeGen for an i16 widening test is shown below. x86
will need more work to solve: https://llvm.org/bugs/show_bug.cgi?id=22473
Before:
BB#0:
mr 4, 3
extsh. 3, 4
ble 0, .LBB0_5
BB#1:
cmpwi 3, 99
bgt 0, .LBB0_9
BB#2:
rlwinm 4, 4, 0, 16, 31 <--- 32-bit mask/extend
li 3, 0
cmplwi 4, 1
beqlr 0
BB#3:
cmplwi 4, 10
bne 0, .LBB0_12
BB#4:
li 3, 1
blr
.LBB0_5:
rlwinm 3, 4, 0, 16, 31 <--- 32-bit mask/extend
cmplwi 3, 65436
beq 0, .LBB0_13
BB#6:
cmplwi 3, 65526
beq 0, .LBB0_15
BB#7:
cmplwi 3, 65535
bne 0, .LBB0_12
BB#8:
li 3, 4
blr
.LBB0_9:
rlwinm 3, 4, 0, 16, 31 <--- 32-bit mask/extend
cmplwi 3, 100
beq 0, .LBB0_14
...
After:
BB#0:
rlwinm 4, 3, 0, 16, 31 <--- mask/extend to 32-bit and then use that for comparisons
cmpwi 4, 999
ble 0, .LBB0_5
BB#1:
lis 3, 0
ori 3, 3, 65525
cmpw 4, 3
bgt 0, .LBB0_9
BB#2:
cmplwi 4, 1000
beq 0, .LBB0_14
BB#3:
cmplwi 4, 65436
bne 0, .LBB0_13
BB#4:
li 3, 6
blr
.LBB0_5:
li 3, 0
cmplwi 4, 1
beqlr 0
BB#6:
cmplwi 4, 10
beq 0, .LBB0_12
BB#7:
cmplwi 4, 100
bne 0, .LBB0_13
BB#8:
li 3, 2
blr
.LBB0_9:
cmplwi 4, 65526
beq 0, .LBB0_15
BB#10:
cmplwi 4, 65535
bne 0, .LBB0_13
...
Differential Revision: http://reviews.llvm.org/D13532
llvm-svn: 251849
When the target does not support these intrinsics they should be converted to a chain of scalar load or store operations.
If the mask is not constant, the scalarizer will build a chain of conditional basic blocks.
I added isLegalMaskedGather() isLegalMaskedScatter() APIs.
Differential Revision: http://reviews.llvm.org/D13722
llvm-svn: 251237
When we have to convert the masked.load, masked.store to scalar code, we generate a chain of conditional basic blocks.
I added optimization for constant mask vector.
Differential Revision: http://reviews.llvm.org/D13855
llvm-svn: 250893
This was originally checked in at r250527, but reverted at r250570 because of PR25222.
There were at least 2 problems:
1. The cost check was checking for an instruction with an exact cost of TCC_Expensive;
that should have been >=.
2. The cause of the clang stage 1 failures was illegally sinking 'call' instructions;
we can't sink instructions that may have side effects / are not safe to execute speculatively.
Fixed those conditions in sinkSelectOperand() and added test cases.
Original commit message:
This is a follow-up to the discussion in D12882.
Ideally, we would like SimplifyCFG to be able to form select instructions even when the operands
are expensive (as defined by the TTI cost model) because that may expose further optimizations.
However, we would then like a later pass like CodeGenPrepare to undo that transformation if the
target would likely benefit from not speculatively executing an expensive op (this patch).
Once we have this safety mechanism in place, we can adjust SimplifyCFG to restore its
select-formation behavior that changed with r248439.
Differential Revision: http://reviews.llvm.org/D13297
llvm-svn: 250743
Originally I planned to use the same interface for masked gather/scatter and set isConsecutive to "false" in this case.
Now I'm implementing masked gather/scatter and see that the interface is inconvenient. I want to add interfaces isLegalMaskedGather() / isLegalMaskedScatter() instead of using the "Consecutive" parameter in the existing interfaces.
Differential Revision: http://reviews.llvm.org/D13850
llvm-svn: 250686
Ideally, we would like SimplifyCFG to be able to form select instructions even when the operands
are expensive (as defined by the TTI cost model) because that may expose further optimizations.
However, we would then like a later pass like CodeGenPrepare to undo that transformation if the
target would likely benefit from not speculatively executing an expensive op (this patch).
Once we have this safety mechanism in place, we can adjust SimplifyCFG to restore its
select-formation behavior that changed with r248439.
Differential Revision: http://reviews.llvm.org/D13297
llvm-svn: 250527