This pass is like DeadCodeEliminationPass, but only does one pass
through a function instead of iterating on users of eliminated
instructions.
DeadCodeEliminationPass should be used in all cases.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D87933
The implementation of gather() should be reduced too,
but this change by itself makes things a little clearer:
we don't try to gather to a different type or
number-of-values than whatever is passed in as the value
list itself.
If some leaves have the same instructions to be vectorized, we may
incorrectly evaluate the best order for the root node (it is built for the
vector of instructions without repeated instructions and, thus, has less
elements than the root node). In this case we just can not try to reorder
the tree + we may calculate the wrong number of nodes that requre the
same reordering.
For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
be reordered, the best order will be \<1, 0\>. We need to extend this
order for the root node. For the root node this order should look like
\<3, 0, 1, 2\>. This patch allows extension of the orders of the nodes
with the reused instructions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D45263
LSR claims to MemorySSA, but we also have to make sure it is preserved
when splitting critical edges. This can be done by passing MSSAU to
SplitCriticalEdge.
Fixes PR47557.
We do similar factorization folds in SimplifyUsingDistributiveLaws,
but that drops no-wrap properties. Propagating those optimally may
help solve:
https://llvm.org/PR47430
The propagation is all-or-nothing for these patterns: when all
3 incoming ops have nsw or nuw, the 2 new ops should have the
same no-wrap property:
https://alive2.llvm.org/ce/z/Dv8wsU
This also solves:
https://llvm.org/PR47584
The test (currently crashing) is reduced from the example provided
in the post-commit discussion in D87149.
Differential Revision: https://reviews.llvm.org/D87965
I want to export this function, and the current API was a bit
weird: It took an additional Alignment argument that didn't really
have anything to do with what the function does. Drop it, and
perform a max at the callsite.
Also rename it to tryEnforceAlignment().
Currently SCEVExpander creates inttoptr for non-integral pointers if the
base is a null constant for example. This results in invalid IR.
This patch changes InsertNoopCastOfTo to emit a GEP & bitcast to convert
to a non-integral pointer. First, a GEP of i8* null is generated and the
integral value is used as index. The GEP is then bitcasted to the target
type.
This was exposed by D71539.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87827
When address sanitizing a function, stack unpinsoning code is inserted before each ret instruction. However if the ret instruciton is preceded by a musttail call, such transformation broke the musttail call contract and generates invalid IR.
This patch fixes the issue by moving the insertion point prior to the musttail call if there is one.
Differential Revision: https://reviews.llvm.org/D87777
We cannot iterate on scalable vector, the number of elements is unknown at compile-time.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87918
If some leaves have the same instructions to be vectorized, we may
incorrectly evaluate the best order for the root node (it is built for the
vector of instructions without repeated instructions and, thus, has less
elements than the root node). In this case we just can not try to reorder
the tree + we may calculate the wrong number of nodes that requre the
same reordering.
For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
be reordered, the best order will be \<1, 0\>. We need to extend this
order for the root node. For the root node this order should look like
\<3, 0, 1, 2\>. This patch allows extension of the orders of the nodes
with the reused instructions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D45263
This switches to using DSE + MemorySSA by default again, after
fixing the issues reported after the first commit.
Notable fixes fc82006331, a0017c2bc2.
This reverts commit 3a59628f3c.
Enable canonicalization of SPF_ABS and SPF_NABS to the abs intrinsic.
To be conservative, the one-use check on the comparison is retained,
this may be relaxed if all goes well.
It's pretty likely that this will uncover places that missing
handling for the abs() intrinsic. Please report any seen performance
regressions.
Differential Revision: https://reviews.llvm.org/D87188
Summary: Allow unroll and jam loops forced by user.
LoopUnrollAndJamPass is still disabled by default in the NPM pipeline,
and can be controlled by -enable-npm-unroll-and-jam.
Reviewed By: Meinersbur, dmgreen
Differential Revision: https://reviews.llvm.org/D87786
As discussed in:
https://llvm.org/PR47558
...there are several potential fixes/follow-ups visible
in the test case, but this is the quickest and safest
fix of the perf regression.
When a spill definition is before CoroBegin, we cannot spill it to the frame immediately after the definition. We have to spill it after the frame is ready.
The current implementation handles it properly for any other kinds of instructions except for PhINode and InvokeInst, which could also be defined before CoroBegin.
This patch fixes it by moving the CoroBegin dominance check earlier, so that it covers all cases.
Added a test.
Differential Revision: https://reviews.llvm.org/D87810
This is one (small) part of improving PR41312:
https://llvm.org/PR41312
As shown there and in the smaller tests here, if we have some member of the
reduction values that does not match the others, we want to push it to the
end (bring the matching members forward and together).
In the regression tests, we have 5 candidates for the 4 slots of the reduction.
If the one "wrong" compare is grouped with the others, it prevents forming the
ideal v4i1 compare reduction.
Differential Revision: https://reviews.llvm.org/D87772
Reapply after fixing SimplifyWithOpReplaced() to never return
the original value, which would lead to an infinite loop in this
transform.
-----
For selects of the type X == Y ? A : B, check if we can simplify A
by using the X == Y equality and replace the operand if that's
possible. We already try to do this in InstSimplify, but will only
fold if the result of the simplification is the same as B, in which
case the select can be dropped entirely. Here the select will be
retained, just one operand simplified.
As we are performing an actual replacement here, we don't have
problems with refinement / poison values.
Differential Revision: https://reviews.llvm.org/D87480
https://bugs.llvm.org/show_bug.cgi?id=45932
assert(OutlinedFunctionCost >= Cloner.OutlinedRegionCost && "Outlined function cost should be no less than the outlined region") getting triggered in computeBBInlineCost.
Intrinsics like "assume" are considered regular function calls while computing costs.
This patch enables computeBBInlineCost to queries TTI for intrinsic call cost.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D87132
D65060 was reverted because it introduced non-determinism by using BFI counts from already freed blocks. The parent of this revision fixes that by using a VH callback on blocks to prevent this from happening and makes sure BFI data is passed correctly in LoopStandardAnalysisResults.
This re-introduces the previous optimization of using BFI data to prevent LICM from hoisting/sinking if the instruction will end up moving to a colder block.
Internally at Facebook this change results in a ~7% win in a CPU related metric in one of our big services by preventing hoisting cold code into a hot pre-header like the added test case demonstrates.
Testing:
ninja check
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D87551
~~D65060 uncovered that trying to use BFI in loop passes can lead to non-deterministic behavior when blocks are re-used while retaining old BFI data.~~
~~To make sure BFI is preserved through loop passes a Value Handle (VH) callback is registered on blocks themselves. When a block is freed it now also wipes out the accompanying BFI entry such that stale BFI data can no longer persist resolving the determinism issue. ~~
~~An optimistic approach would be to incrementally update BFI information throughout the loop passes rather than only invalidating them on removed blocks. The issues with that are:~~
~~1. It is not clear how BFI information should be incrementally updated: If a block is duplicated does its BFI information come with? How about if it's split/modified/moved around? ~~
~~2. Assuming we can address these problems the implementation here will be a massive undertaking. ~~
~~There's a known need of BFI in LICM analysis which requires correct but not incrementally updated BFI data. A follow-up change can register BFI in all loop passes so this preserved but potentially lossy data is available to any loop pass that wants it.~~
See: D75341 for an identical implementation of preserving BFI via VH callbacks. The previous statements do still apply but this change no longer has to be in this diff because it's already upstream 😄 .
This diff also moves BFI to be a part of LoopStandardAnalysisResults since the previous method using getCachedResults now (correctly!) statically asserts (D72893) that this data isn't static through the loop passes.
Testing
Ninja check
Reviewed By: asbirlea, nikic
Differential Revision: https://reviews.llvm.org/D86156
Call instructions with musttail tag must be optimized as a tailcall, otherwise could lead to incorrect program behavior.
When TSAN is instrumenting functions, it broke the contract by adding a call to the tsan exit function inbetween the musttail call and return instruction, and also inserted exception handling code.
This happend throguh EscapeEnumerator, which adds exception handling code and returns ret instructions as the place to insert instrumentation calls.
This becomes especially problematic for coroutines, because coroutines rely on tail calls to do symmetric transfers properly.
To fix this, this patch moves the location to insert instrumentation calls prior to the musttail call for ret instructions that are following musttail calls, and also does not handle exception for musttail calls.
Differential Revision: https://reviews.llvm.org/D87620
For scalable type, the aggregated size is unknown at compile-time.
Skip instructions with scalable type to ensure the list of instructions
for vectorizeSimpleInstructions does not contains any scalable-vector instructions.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D87550
Invalid IR in unreachable code is technically valid IR. In this case,
the address space of the value was never inferred, and we tried to
rewrite it with an invalid address space value which would assert.
This patch is a first draft of a new pass that adds a more flexible way
to eliminate compares based on more complex constraints collected from
dominating conditions.
In particular, it aims at simplifying conditions of the forms below
using a forward propagation approach, rather than instcomine-style
ad-hoc backwards walking of def-use chains.
if (x < y)
if (y < z)
if (x < z) <- simplify
or
if (x + 2 < y)
if (x + 1 < y) <- simplify assuming no wraps
The general approach is to collect conditions and blocks, sort them by
dominance and then iterate over the sorted list. Conditions are turned
into a linear inequality and add it to a system containing the linear
inequalities that hold on entry to the block. For blocks, we check each
compare against the system and see if it is implied by the constraints
in the system.
We also keep a stack of processed conditions and remove conditions from
the stack and the constraint system once they go out-of-scope (= do not
dominate the current block any longer).
Currently there still are the least the following areas for improvements
* Currently large unsigned constants cannot be added to the system
(coefficients must be represented as integers)
* The way constraints are managed currently is not very optimized.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D84547
Similar to the tsan suppression in
`Utils/VNCoercion.cpp:getLoadLoadClobberFullWidthSize` (rL175034; load widening used by GVN),
the D81766 optimization should be suppressed under tsan due to potential
spurious data race reports:
struct A {
int i;
const short s; // the load cannot be vectorized because
int modify; // it overlaps with bytes being concurrently modified
long pad1, pad2;
};
// __tsan_read16 does not know that some bytes are undef and accessing is safe
Similarly, under asan, users can mark memory regions with
`__asan_poison_memory_region`. A widened load can lead to a spurious
use-after-poison error. hwasan/memtag should be similarly suppressed.
`mustSuppressSpeculation` suppresses asan/hwasan/tsan but not memtag, so
we need to exclude memtag in `vectorizeLoadInsert`.
Note, memtag suppression can be relaxed if the load is aligned to the
its granule (usually 16), but that is out of scope of this patch.
Reviewed By: spatel, vitalybuka
Differential Revision: https://reviews.llvm.org/D87538
Forward declare AAResults instead of the (old) AliasAnalysis type.
Remove includes from SLPVectorizer.cpp that are already included in SLPVectorizer.h.
The FailureReason input parameter maybe null, we check this in all other cases in the method but this one was missed somehow.
Fixes clang-tidy warning.
The "takeName" logic in ScalarizerVisitor::gather did not consider
that the value vector could refer to non-instructions, such as
global variables. This patch make sure that we avoid changing the
name of a value if it isn't an instruction.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D87685
I've amended the isLoadInvariantInLoop function to bail out for
scalable vectors for now since the invariant.start intrinsic is only
ever generated by the clang frontend for thread locals or struct
and class constructors, neither of which support sizeless types.
In addition, the intrinsic itself does not currently support the
concept of a scaled size, which makes it impossible to compare
the sizes of different scalable objects, e.g. <vscale x 32 x i8>
and <vscale x 16 x i8>.
Added new tests here:
Transforms/LICM/AArch64/sve-load-hoist.ll
Transforms/LICM/hoisting.ll
Differential Revision: https://reviews.llvm.org/D87227
As to not conflict with the legacy PM example passes under
llvm/lib/Transforms/Hello, this is under HelloNew. This makes the
CMakeLists.txt and general directory structure less confusing for people
following the example.
Much of the doc structure was taken from WritinAnLLVMPass.rst.
This adds a HelloWorld pass which simply prints out each function name.
More will follow after this, e.g. passes over different units of IR, analyses.
https://llvm.org/docs/WritingAnLLVMPass.html contains a lot more.
Relanded with missing "Support" dependency in LLVMBuild.txt.
Reviewed By: ychen, asbirlea
Differential Revision: https://reviews.llvm.org/D86979
As to not conflict with the legacy PM example passes under
llvm/lib/Transforms/Hello, this is under HelloNew. This makes the
CMakeLists.txt and general directory structure less confusing for people
following the example.
Much of the doc structure was taken from WritinAnLLVMPass.rst.
This adds a HelloWorld pass which simply prints out each function name.
More will follow after this, e.g. passes over different units of IR, analyses.
https://llvm.org/docs/WritingAnLLVMPass.html contains a lot more.
Reviewed By: ychen, asbirlea
Differential Revision: https://reviews.llvm.org/D86979
This is consistent with the clang option added in
7ed8124d46, and the comments on the
runtime patch in D87120.
Differential Revision: https://reviews.llvm.org/D87622
For selects of the type X == Y ? A : B, check if we can simplify A
by using the X == Y equality and replace the operand if that's
possible. We already try to do this in InstSimplify, but will only
fold if the result of the simplification is the same as B, in which
case the select can be dropped entirely. Here the select will be
retained, just one operand simplified.
As we are performing an actual replacement here, we don't have
problems with refinement / poison values.
Differential Revision: https://reviews.llvm.org/D87480
AliasAnalysis/MemoryLocation does not account for loops. Two
MemoryLocation can be must-overwrite, even if the first one writes
multiple locations in a loop.
This patch prevents removing such stores, by only considering candidates
that are known to be loop invariant, or executed in the same BB.
Currently the invariant check is quite conservative and only considers
Alloca and Alloca-like instructions and arguments as invariant base pointers.
It also considers GEPs with all constant indices and invariant bases as
invariant.
This can be improved in the future, but the current implementation has
only minor impact on the total number of stores eliminated (25903 vs
26047 for the baseline). There are some 2-10% swings for some individual
benchmarks. In roughly half of the cases, the number of stores removed
increases actually, because we skip candidates that are unlikely to be
valid candidates early.
The function LoopIdiomRecognize::isLegalStore looks for stores in loops
that could be transformed into memset or memcpy. However, the algorithm
currently requires that we know how big the store is at runtime, i.e.
that the store size will not overflow an unsigned integer. For scalable
vectors we cannot guarantee this so I have changed the code to bail out
for now. In addition, even if we add a way to query the maximum value of
vscale in future we will still need to update the algorithm to cope with
non-constant strides. The additional cost associated with calculating
the memset and memcpy arguments will need to be taken into account as
well.
This patch also fixes up an implicit TypeSize -> uint64_t cast,
thereby removing a warning. I've added tests here showing a fixed
width vector loop being transformed into memcpy, and a scalable
vector loop remaining unchanged:
Transforms/LoopIdiom/memcpy-vectors.ll
Differential Revision: https://reviews.llvm.org/D87439
This fixes PR47297.
When ProcessBlock() was able to constant fold the terminator's
condition, but not do any more transformations, the function would
return false, which would lead to the JumpThreading pass returning an
incorrect modified status. This patch makes so that ProcessBlock()
returns true in such cases. This will trigger an unnecessary invocation
of ProcessBlock() in such cases, but this should be rare to occur.
This was caught using the check introduced by D80916.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87392
Turns out this was use-after-move of function_ref, which is trivially
copyable and movable, so the move did nothing and use after move was
safe.
But since this function_ref is being copied into a std::function, change
the function_ref to be std::function to avoid extra layers of type
erasure indirection - and then it's a real use after move, and fix that
by referring to the moved-to member variable rather than the moved-from
parameter.
gcov is an "Edge Profiling with Edge Counters" application according to
Optimally Profiling and Tracing Programs (1994).
The minimum number of counters necessary is |E|-(|V|-1). The unmeasured edges
form a spanning tree. Both GCC --coverage and clang -fprofile-generate leverage
this optimization. This patch implements the optimization for clang --coverage.
The produced .gcda files are much smaller now.
i.e. change the work flow from
* .gcno for function A
* .gcno for function B
* .gcno for function C
* .gcda for function A
* .gcda for function B
* .gcda for function C
to
* .gcno for function A
* .gcda for function A
* .gcno for function B
* .gcda for function B
* .gcno for function C
* .gcda for function C
Currently there is duplicate logic in .gcno & .gcda processing: how functions
are filtered, which edges are instrumented, etc. This refactor enables simplification.
Since we always process .gcno, in -fprofile-arcs -fno-test-coverage mode,
__llvm_internal_gcov_emit_function_args.0 will have non-zero checksums.
When deleting stores at the end of a function, we have to do PHI
translation, otherwise we might miss reads in different iterations of a
loop. See multiblock-loop-carried-dependence.ll for details.
This fixes a mis-compile and surprisingly also increases the number of
eliminated stores from 26047 to 26572 for MultiSource/SPEC2000/SPEC2006
on X86 with -O3 -flto. This is most likely because we save budget by not
exploring through MemoryPhis, which are less likely to result in valid
candidates for elimination.
The issue was reported post-commit for fb109c42d9.
This allows the backend to tell the vectorizer to produce inloop
reductions through a TTI hook.
For the moment on ARM under MVE this means allowing integer add
reductions of the correct size. In the future this can include integer
min/max too, under -Os.
Differential Revision: https://reviews.llvm.org/D75512
NOTE: There is a mailing list discussion on this: http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
Complemantary to the assumption outliner prototype in D71692, this patch
shows how we could simplify the code emitted for an alignemnt
assumption. The generated code is smaller, less fragile, and it makes it
easier to recognize the additional use as a "assumption use".
As mentioned in D71692 and on the mailing list, we could adopt this
scheme, and similar schemes for other patterns, without adopting the
assumption outlining.
This is a followup to D86834, which partially fixed this issue in
InstSimplify. However, InstCombine repeats the same transform while
dropping poison flags -- which does not cover cases where poison is
introduced in some other way.
The fix here is a bit more comprehensive, because things are quite
entangled, and it's hard to only partially address it without
regressing optimization. There are really two changes here:
* Export the SimplifyWithOpReplaced API from InstSimplify, with an
added AllowRefinement flag. For replacements inside the TrueVal
we don't actually care whether refinement occurs or not, the
replacement is always legal. This part of the transform is now
done in InstSimplify only. (It should be noted that the current
AllowRefinement check is not sufficient -- that's an issue we
need to address separately.)
* Change the InstCombine fold to work by temporarily dropping
poison generating flags, running the fold and then restoring the
flags if it didn't work out. This will ensure that the InstCombine
fold is correct as long as the InstSimplify fold is correct.
Differential Revision: https://reviews.llvm.org/D87445
The test example based on PR47450 shows that we can
match non-byte-sized shifts, but those won't ever be
bswap opportunities. This isn't a full fix (we'd still
match if the shifts were by 8-bits for example), but
this should be enough until there's evidence that we
need to do more (this is a borderline case for
vectorization in the first place).
When inlining functions containing allocas of scalable vectors we
cannot specify the size in the lifetime markers, since we don't
know this at compile time.
Added new test here:
test/Transforms/Inline/AArch64/sve-alloca-merge.ll
Differential Revision: https://reviews.llvm.org/D87139
In addition to calculate hash consistently by swapping SELECT's
operands, we also need to inverse the select pattern favor to match the
original logic.
[EarlyCSE] Equivalent SELECTs should hash equally
DenseMap<SimpleValue> assumes that, if its isEqual method returns true
for two elements, then its getHashValue method must return the same value
for them. This invariant is broken when one SELECT node is a min/max
operation, and the other can be transformed into an equivalent min/max by
inverting its predicate and swapping its operands. This patch fixes an
assertion failure that would occur intermittently while compiling the
following IR:
define i32 @t(i32 %i) {
%cmp = icmp sle i32 0, %i
%twin1 = select i1 %cmp, i32 %i, i32 0
%cmpinv = icmp sgt i32 0, %i
%twin2 = select i1 %cmpinv, i32 0, i32 %i
%sink = add i32 %twin1, %twin2
ret i32 %sink
}
Differential Revision: https://reviews.llvm.org/D86843
The tests have been updated and I plan to move them from the MSSA
directory up.
Some end-to-end tests needed small adjustments. One difference to the
legacy DSE is that legacy DSE also deletes trivially dead instructions
that are unrelated to memory operations. Because MemorySSA-backed DSE
just walks the MemorySSA, we only visit/check memory instructions. But
removing unrelated dead instructions is not really DSE's job and other
passes will clean up.
One noteworthy change is in llvm/test/Transforms/Coroutines/ArgAddr.ll,
but I think this comes down to legacy DSE not handling instructions that
may throw correctly in that case. To cover this with MemorySSA-backed
DSE, we need an update to llvm.coro.begin to treat it's return value to
belong to the same underlying object as the passed pointer.
There are some minor cases MemorySSA-backed DSE currently misses, e.g. related
to atomic operations, but I think those can be implemented after the switch.
This has been discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-August/144417.html
For the MultiSource/SPEC2000/SPEC2006 the number of eliminated stores
goes from ~17500 (legayc DSE) to ~26300 (MemorySSA-backed). More numbers
and details in the thread on llvm-dev.
Impact on CTMark:
```
Legacy Pass Manager
exec instrs size-text
O3 + 0.60% - 0.27%
ReleaseThinLTO + 1.00% - 0.42%
ReleaseLTO-g. + 0.77% - 0.33%
RelThinLTO (link only) + 0.87% - 0.42%
RelLO-g (link only) + 0.78% - 0.33%
```
http://llvm-compile-time-tracker.com/compare.php?from=3f22e96d95c71ded906c67067d75278efb0a2525&to=ae8be4642533ff03803967ee9d7017c0d73b0ee0&stat=instructions
```
New Pass Manager
exec instrs. size-text
O3 + 0.95% - 0.25%
ReleaseThinLTO + 1.34% - 0.41%
ReleaseLTO-g. + 1.71% - 0.35%
RelThinLTO (link only) + 0.96% - 0.41%
RelLO-g (link only) + 2.21% - 0.35%
```
http://195.201.131.214:8000/compare.php?from=3f22e96d95c71ded906c67067d75278efb0a2525&to=ae8be4642533ff03803967ee9d7017c0d73b0ee0&stat=instructions
Reviewed By: asbirlea, xbolva00, nikic
Differential Revision: https://reviews.llvm.org/D87163
DenseMap<SimpleValue> assumes that, if its isEqual method returns true
for two elements, then its getHashValue method must return the same value
for them. This invariant is broken when one SELECT node is a min/max
operation, and the other can be transformed into an equivalent min/max by
inverting its predicate and swapping its operands. This patch fixes an
assertion failure that would occur intermittently while compiling the
following IR:
define i32 @t(i32 %i) {
%cmp = icmp sle i32 0, %i
%twin1 = select i1 %cmp, i32 %i, i32 0
%cmpinv = icmp sgt i32 0, %i
%twin2 = select i1 %cmpinv, i32 0, i32 %i
%sink = add i32 %twin1, %twin2
ret i32 %sink
}
Differential Revision: https://reviews.llvm.org/D86843
Bail from maskIsAllZeroOrUndef and maskIsAllOneOrUndef prior to iterating over the number of
elements for scalable vectors.
Assert that the mask type is not scalable in possiblyDemandedEltsInMask .
Assert that the types are correct in all three functions.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87424
Previously we could match fcmp+select to a reduction if the fcmp had
the nonans fast math flag. But if the select had the nonans fast
math flag, InstCombine would turn it into a fminnum/fmaxnum intrinsic
before SLP gets to it. Seems fairly likely that if one of the
fcmp+select pair have the fast math flag, they both would.
My plan is to start vectorizing the fmaxnum/fminnum version soon,
but I wanted to get this code out as it had some of the strangest
fast math flag behaviors.
PGOInstrumentation runs `SplitIndirectBrCriticalEdges` but some IndirectBrInst
critical edge cannot be split. `getInstrBB` will crash when calling `SplitCriticalEdge`, e.g.
int foo(char *p) {
void *targets[2];
targets[0] = &&indirect;
targets[1] = &&end;
for (;; p++)
if (*p == 7) {
indirect:
goto *targets[p[1]]; // the self loop is critical in -O
}
end:
return 0;
}
Skip such critical edges to prevent a crash.
Reviewed By: davidxl, lebedev.ri
Differential Revision: https://reviews.llvm.org/D87435
The argument promotion pass currently fails to copy function annotations
over to the modified function after promoting arguments.
This patch copies the original function annotation to the new function.
Reviewed By: fhann
Differential Revision: https://reviews.llvm.org/D86630
See discussion in D87149. Dropping volatile stores here is legal
per LLVM semantics, but causes issues for real code and may result
in a change to LLVM volatile semantics. Temporarily treat volatile
stores as "not guaranteed to transfer execution" in just this place,
until this issue has been resolved.
MemoryLocation has been taught about memcpy.inline, which means we can
get the memory locations read and written by it. This means DSE can
handle memcpy.inline