This attempts to teach the cost model in Arm that code such as:
%s = shl i32 %a, 3
%a = and i32 %s, %b
Can under Arm or Thumb2 become:
and r0, r1, r2, lsl #3
So the cost of the shift can essentially be free. To do this without
trying to artificially adjust the cost of the "and" instruction, it
needs to get the users of the shl and check if they are a type of
instruction that the shift can be folded into. And so it needs to have
access to the actual instruction in getArithmeticInstrCost, which if
available is added as an extra parameter much like getCastInstrCost.
We otherwise limit it to shifts with a single user, which should
hopefully handle most of the cases. The list of instruction that the
shift can be folded into include ADC, ADD, AND, BIC, CMP, EOR, MVN, ORR,
ORN, RSB, SBC and SUB. This translates to Add, Sub, And, Or, Xor and
ICmp.
Differential Revision: https://reviews.llvm.org/D70966
Currently we fail to pick the right insertion point when
PreviousLastPart of a first-order-recurrence is a PHI node not in the
LoopVectorBody. This can happen when PreviousLastPart is produce in a
predicated block. In that case, we should pick the insertion point in
the BB the PHI is in.
Fixes PR44020.
Reviewers: hsaito, fhahn, Ayal, dorit
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D71071
AssumptionCache can be null in SimplifyCFGOptions. However, FoldCondBranchOnPHI() was not properly handling that when passing a null AssumptionCache to simplifyCFG.
Patch by Rodrigo Caetano Rocha <rcor.cs@gmail.com>
Reviewers: fhahn, lebedev.ri, spatel
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D69963
The file is intended to gather various VPlan transformations, not only
CFG related transforms. Actually, the only transformation there is not
CFG related.
Reviewers: Ayal, gilr, hsaito, rengolin
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D70732
Summary:
Sample profile loader of AutoFDO tries to replay previous inlining using context sensitive profile. The replay only repeats inlining if the call site block is hot. As a result it punts inlining of small functions, some of which can be beneficial for size, and will still be inlined by CSGCC inliner later. The oscillation between sample profile loader's inlining and regular CGSSC inlining cause unnecessary loss of context-sensitive profile. It doesn't have much impact for inline decision itself, but it negatively affects post-inline profile quality as CGSCC inliner have to scale counts which is not as accurate as the original context sensitive profile, and bad post-inline profile can misguide code layout.
This change added regular Inline Cost calculation for sample profile loader, so we can inline small functions upfront under switch -sample-profile-inline-size. In addition -sample-profile-cold-inline-threshold is added so we can tune the separate size threshold - currently the default is chosen to be the same as regular inliner's cold call-site threshold.
Reviewers: wmi, davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70750
InnerLoopVectorizer's code called during VPlan execution still relies on
original IR's def-use relations to decide which vector code to generate,
limiting VPlan transformations ability to modify def-use relations and still
have ILV generate the vector code.
This commit moves GEP operand queries controlling how GEPs are widened to a
dedicated recipe and extracts GEP widening code to its own ILV method taking
those recorded decisions as arguments. This reduces ingredient def-use usage by
ILV as a step towards full VPlan-based def-use relations.
Differential revision: https://reviews.llvm.org/D69067
In general ValueHandleBase::ValueIsRAUWd shouldn't be called when not
all uses of the value were actually replaced, though, currently
formLCSSAForInstructions calls it when it inserts LCSSA-phis.
Calls of ValueHandleBase::ValueIsRAUWd were added to LCSSA specifically
to update/invalidate SCEV. In the best case these calls duplicate some
of the work already done by SE->forgetValue, though in case when SCEV of
the value is SCEVUnknown, SCEV replaces the underlying value of
SCEVUnknown with the new value (i.e. acts like LCSSA-phi actually fully
replaces the value it is created for), which leads to SCEV being
corrupted because LCSSA-phi rarely dominates all uses of its inputs.
Fixes bug https://bugs.llvm.org/show_bug.cgi?id=44058.
Reviewers: fhahn, efriedma, reames, sanjoy.google
Reviewed By: fhahn
Subscribers: hiraditya, javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70593
Summary:
Add an option to allow the attribute propagation on the index to be
disabled, to allow a workaround for issues (such as that fixed by
D70977).
Also move the setting of the WithAttributePropagation flag on the index
into propagateAttributes(), and remove some old stale code that predated
this flag and cleared the maybe read/write only bits when we need to
disable the propagation (previously only when importing disabled, now
also when the new option disables it).
Reviewers: evgeny777, steven_wu
Subscribers: mehdi_amini, inglorion, hiraditya, dexonsmith, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70984
Summary:
AutoFDO's sample profile loader processes function in arbitrary source code order, so if I change the order of two functions in source code, the inline decision can change. This also prevented the use of context-sensitive profile to do specialization while inlining. This commit enforces SCC top-down order for sample profile loader. With this change, we can now do specialization, as illustrated by the added test case:
Say if we have A->B->C and D->B->C call path, we want to inline C into B when root inliner is B, but not when root inliner is A or D, this is not possible without enforcing top-down order. E.g. Once C is inlined into B, A and D can only choose to inline (B->C) as a whole or nothing, but what we want is only inline B into A and D, not its recursive callee C. If we process functions in top-down order, this is no longer a problem, which is what this commit is doing.
This change is guarded with a new switch "-sample-profile-top-down-load" for tuning, and it depends on D70653. Eventually, top-down can be the default order for sample profile loader.
Reviewers: wmi, davidxl
Subscribers: hiraditya, llvm-commits, tejohnson
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70655
Summary:
When sample profile loader decides not to inline a previously inlined call-site, we adjust the profile of outlined function simply by scaling up its profile counts by call-site count. This means the context-sensitive profile of that inlined instance will be thrown away. This commit try to keep context-sensitive profile for such cases:
- Instead of scaling outlined function's profile, we now properly merge the FunctionSamples of inlined instance into outlined function, including all recursively inlined profile.
- Instead of adjusting the profile for negative inline decision at the end of the sample profile loader pass, we do the profile merge right after processing each function. This change paired with top-down ordering of annotation/inline-replay (a separate diff) will make sure we recursively merge profile back before the profile is used for annotation and inline replay.
A new switch -sample-profile-merge-inlinee is added to enable the new profile merge for tuning. It should be the default behavior eventually.
Reviewers: wmi, davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70653
Summary:
Emit a value debug intrinsic (with OP_deref) when an alloca address is
passed to a function call after going through a bitcast.
This generates an FP or SP-relative location for the local variable in
the following case:
int x;
use((void *)&x;
Reviewers: aprantl, vsk, pcc
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70752
Summary:
This reverts commit c3b06d0c39.
Reason for revert: Caused miscompiles when inserting assume for undef.
Also adds a test to prevent similar breakage in future.
Fixes PR44154.
Reviewers: rnk, jdoerfert, efriedma, xbolva00
Reviewed By: rnk
Subscribers: thakis, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70933
Summary:
D68408 proposes to greatly improve our negation sinking abilities.
But in current canonicalization, we produce `sub A, zext(B)`,
which we will consider non-canonical and try to sink that negation,
undoing the existing canonicalization.
So unless we explicitly stop producing previous canonicalization,
we will have two conflicting folds, and will end up endlessly looping.
This inverts canonicalization, and adds back the obvious fold
that we'd miss:
* `sub [nsw] Op0, sext/zext (bool Y) -> add [nsw] Op0, zext/sext (bool Y)`
https://rise4fun.com/Alive/xx4
* `sext(bool) + C -> bool ? C - 1 : C`
https://rise4fun.com/Alive/fBl
It is obvious that `@ossfuzz_9880()` / `@lshr_out_of_range()`/`@ashr_out_of_range()`
(oss-fuzz 4871) are no longer folded as much, though those aren't really worrying.
Reviewers: spatel, efriedma, t.p.northover, hfinkel
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71064
The patch makes sure that the LastThrowing pointer does not point to any instruction deleted by call to DeleteDeadInstruction.
While iterating through the instructions the pass maintains a pointer to the lastThrowing Instruction. A call to deleteDeadInstruction deletes a dead store and other instructions feeding the original dead instruction which also become dead. The instruction pointed by the lastThrowing pointer could also be deleted by the call to DeleteDeadInstruction and thus it becomes a dangling pointer. Because of this, we see an error in the next iteration.
In the patch, we maintain a list of throwing instructions encountered previously and use the last non deleted throwing instruction from the container.
Patch by Ankit <quic_aankit@quicinc.com>
Reviewers: fhahn, bcahoon, efriedma
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D65326
This makes no difference currently because we don't apply FMF
to FP casts, but that may change.
This could also be a place to add a fold for select with fptrunc,
so it will make that patch easier/smaller.
Summary:
D69561/dde5893 enabled importing of readonly variables with references,
however, it introduced a bug relating to importing/internalization of
writeonly variables with references.
A fix for this was added in D70006/7f92d66. But this didn't work in
distributed ThinLTO mode. The reason is that the fix (importing the
writeonly var with a zeroinitializer) was only applied when there were
references on the writeonly var summary. In distributed ThinLTO mode,
where we only have a small slice of the index, we will not have the
references on the importing side if we are not importing those
referenced values. Rather than changing this handshaking (which will
require a lot of other changes, since that's how we know what to import
in the distributed backend clang invocation), we can simply always give
the writeonly variable a zero initializer.
Reviewers: evgeny777, steven_wu
Subscribers: mehdi_amini, inglorion, hiraditya, dexonsmith, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70977
Revise the coverage mapping format to reduce binary size by:
1. Naming function records and marking them `linkonce_odr`, and
2. Compressing filenames.
This shrinks the size of llc's coverage segment by 82% (334MB -> 62MB)
and speeds up end-to-end single-threaded report generation by 10%. For
reference the compressed name data in llc is 81MB (__llvm_prf_names).
Rationale for changes to the format:
- With the current format, most coverage function records are discarded.
E.g., more than 97% of the records in llc are *duplicate* placeholders
for functions visible-but-not-used in TUs. Placeholders *are* used to
show under-covered functions, but duplicate placeholders waste space.
- We reached general consensus about giving (1) a try at the 2017 code
coverage BoF [1]. The thinking was that using `linkonce_odr` to merge
duplicates is simpler than alternatives like teaching build systems
about a coverage-aware database/module/etc on the side.
- Revising the format is expensive due to the backwards compatibility
requirement, so we might as well compress filenames while we're at it.
This shrinks the encoded filenames in llc by 86% (12MB -> 1.6MB).
See CoverageMappingFormat.rst for the details on what exactly has
changed.
Fixes PR34533 [2], hopefully.
[1] http://lists.llvm.org/pipermail/llvm-dev/2017-October/118428.html
[2] https://bugs.llvm.org/show_bug.cgi?id=34533
Differential Revision: https://reviews.llvm.org/D69471
The PHI node checks for inner loop exits are too permissive currently.
As indicated by an existing comment, we should only allow LCSSA PHI
nodes that are part of reductions or are only used outside of the loop
nest. We ensure this by checking the users of the LCSSA PHIs.
Specifically, it is not safe to use an exiting value from the inner loop in the latch of the outer
loop.
It also moves the inner loop exit check before the outer loop exit
check.
Fixes PR43473.
Reviewers: efriedma, mcrosier
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D68144
Summary:
This is one more prep step necessary before the code gen pass instrumentation
code could go in.
Reviewers: davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70988
When basic blocks are killed, either due to being empty or to being an if.then
or if.else block whose complement contains identical instructions, some of the
debug intrinsics in that block are lost. This patch sinks those intrinsics
into the single successor block, setting them Undef if necessary to
prevent debug info from falling out-of-date.
Differential Revision: https://reviews.llvm.org/D70318
SCEV caches the exiting blocks when computing exit counts. In
SimpleLoopUnswitch, we split the exit block of the loop to unswitch.
Currently we only invalidate the loop containing that exit block, but if
that block is the exiting block for a parent loop, we have stale cache
entries. We have to invalidate the top-most loop that contains the exit
block as exiting block. We might also be able to skip invalidating the
loop containing the exit block, if the exit block is not an exiting
block of that loop.
There are also 2 more places in SimpleLoopUnswitch, that use a similar
problematic approach to get the loop to invalidate. If the patch makes
sense, I will also update those places to a similar approach (they deal
with multiple exit blocks, so we cannot directly re-use
getTopMostExitingLoop).
Fixes PR43972.
Reviewers: skatkov, reames, asbirlea, chandlerc
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D70786
Constructor invocations such as `APFloat(APFloat::IEEEdouble(), 0.0)`
may seem like they accept a FP (floating point) value, but the overload
they reach is actually the `integerPart` one, not a `float` or `double`
overload (which only exists when `fltSemantics` isn't passed).
This may lead to possible loss of data, by the conversion from `float`
or `double` to `integerPart`.
To prevent future mistakes, a new constructor overload, which accepts
any FP value and marked with `delete`, to prevent its usage.
Fixes PR34095.
Differential Revision: https://reviews.llvm.org/D70425
This reverts these two commits
[InstCombine] Turn (extractelement <1 x i64/double> (bitcast (x86_mmx))) into a single bitcast from x86_mmx to i64/double.
[InstCombine] Don't transform bitcasts between x86_mmx and v1i64 into insertelement/extractelement
We're seeing at least one internal test failure related to a
bitcast that was previously before an inline assembly block
containing emms being placed after it. This leads to the mmx
state ending up not empty after the emms. IR has no way to
make any specific guarantees about this. Reverting these patches
to get back to previous behavior which at least worked for this
test.
Fix PR40816: avoid considering scalar-with-predication instructions as also
uniform-after-vectorization.
Instructions identified as "scalar with predication" will be "vectorized" using
a replicating region. If such instructions are also optimized as "uniform after
vectorization", namely when only the first of VF lanes is used, such a
replicating region becomes erroneous - only the first instance of the region can
and should be formed. Fix such cases by not considering such instructions as
"uniform after vectorization".
Differential Revision: https://reviews.llvm.org/D70298
Summary:
Make SLPVectorize to recognize homogeneous aggregates like
`{<2 x float>, <2 x float>}`, `{{float, float}, {float, float}}`,
`[2 x {float, float}]` and so on.
It's a follow-up of https://reviews.llvm.org/D70068.
Merged `findBuildVector()` and `findBuildAggregate()` to
one `findBuildAggregate()` function making it recursive
to recognize multidimensional aggregates. Aggregates required
to be homogeneous.
Reviewers: RKSimon, ABataev, dtemirbulatov, spatel, vporpo
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70587
This adds a dump() function to VPlan, which uses the existing
operator<<.
This method provides a convenient way to dump a VPlan while debugging,
e.g. from lldb.
Reviewers: hsaito, Ayal, gilr, rengolin
Reviewed By: hsaito
Differential Revision: https://reviews.llvm.org/D70920
Summary:
This fixes https://llvm.org/PR26673
"Wrong debugging information with -fsanitize=address"
where asan instrumentation causes the prologue end to be computed
incorrectly: findPrologueEndLoc, looks for the first instruction
with a debug location to determine the prologue end. Since the asan
instrumentation instructions had debug locations, that prologue end was
at some instruction, where the stack frame is still being set up.
There seems to be no good reason for extra debug locations for the
asan instrumentations that set up the frame; they don't have a natural
source location. In the debugger they are simply located at the start
of the function.
For certain other instrumentations like -fsanitize-coverage=trace-pc-guard
the same problem persists - that might be more work to fix, since it
looks like they rely on locations of the tracee functions.
This partly reverts aaf4bb2394
"[asan] Set debug location in ASan function prologue"
whose motivation was to give debug location info to the coverage callback.
Its test only ensures that the call to @__sanitizer_cov_trace_pc_guard is
given the correct source location; as the debug location is still set in
ModuleSanitizerCoverage::InjectCoverageAtBlock, the test does not break.
So -fsanitize-coverage is hopefully unaffected - I don't think it should
rely on the debug locations of asan-generated allocas.
Related revision: 3c6c14d14b
"ASAN: Provide reliable debug info for local variables at -O0."
Below is how the X86 assembly version of the added test case changes.
We get rid of some .loc lines and put prologue_end where the user code starts.
```diff
--- 2.master.s 2019-12-02 12:32:38.982959053 +0100
+++ 2.patch.s 2019-12-02 12:32:41.106246674 +0100
@@ -45,8 +45,6 @@
.cfi_offset %rbx, -24
xorl %eax, %eax
movl %eax, %ecx
- .Ltmp2:
- .loc 1 3 0 prologue_end # 2.c:3:0
cmpl $0, __asan_option_detect_stack_use_after_return
movl %edi, 92(%rbx) # 4-byte Spill
movq %rsi, 80(%rbx) # 8-byte Spill
@@ -57,9 +55,7 @@
callq __asan_stack_malloc_0
movq %rax, 72(%rbx) # 8-byte Spill
.LBB1_2:
- .loc 1 0 0 is_stmt 0 # 2.c:0:0
movq 72(%rbx), %rax # 8-byte Reload
- .loc 1 3 0 # 2.c:3:0
cmpq $0, %rax
movq %rax, %rcx
movq %rax, 64(%rbx) # 8-byte Spill
@@ -72,9 +68,7 @@
movq %rax, %rsp
movq %rax, 56(%rbx) # 8-byte Spill
.LBB1_4:
- .loc 1 0 0 # 2.c:0:0
movq 56(%rbx), %rax # 8-byte Reload
- .loc 1 3 0 # 2.c:3:0
movq %rax, 120(%rbx)
movq %rax, %rcx
addq $32, %rcx
@@ -99,7 +93,6 @@
movb %r8b, 31(%rbx) # 1-byte Spill
je .LBB1_7
# %bb.5:
- .loc 1 0 0 # 2.c:0:0
movq 40(%rbx), %rax # 8-byte Reload
andq $7, %rax
addq $3, %rax
@@ -118,7 +111,8 @@
movl %ecx, (%rax)
movq 80(%rbx), %rdx # 8-byte Reload
movq %rdx, 128(%rbx)
- .loc 1 4 3 is_stmt 1 # 2.c:4:3
+.Ltmp2:
+ .loc 1 4 3 prologue_end # 2.c:4:3
movq %rax, %rdi
callq f
movq 48(%rbx), %rax # 8-byte Reload
```
Reviewers: eugenis, aprantl
Reviewed By: eugenis
Subscribers: ormris, aprantl, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70894
Summary:
This cropped up in the Linux kernel where cold code was placed in an
incompatible section.
Reviewers: compnerd, vsk, tejohnson
Reviewed By: vsk
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70925
Summary:
In case of a need to distinguish different query sites for gradual commit or
debugging of PGSO. NFC.
Reviewers: davidxl
Subscribers: hiraditya, zzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70510
By defining the graph traits right after the VPBlockBase definitions, we
can make use of them earlier in the file.
Reviewers: hsaito, Ayal, gilr
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D70733
As described here:
https://bugs.llvm.org/show_bug.cgi?id=44186
The match() code safely allows undef values, but we can't safely
propagate a vector constant that contains an undef to the new
compare instruction.
Summary:
If a user writing C code using the ACLE MVE intrinsics generates a
predicate and then complements it, then the resulting IR will use the
`pred_v2i` IR intrinsic to turn some `<n x i1>` vector into a 16-bit
integer; complement that integer; and convert back. This will generate
machine code that moves the predicate out of the `P0` register,
complements it in an integer GPR, and moves it back in again.
This InstCombine rule replaces `i2v(~v2i(x))` with a direct complement
of the original predicate vector, which we can already instruction-
select as the VPNOT instruction which complements P0 in place.
Reviewers: ostannard, MarkMurrayARM, dmgreen
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70484
rL341831 moved one-use check higher up, restricting a few folds
that produced a single instruction from two instructions to the case
where the inner instruction would go away.
Original commit message:
> InstCombine: move hasOneUse check to the top of foldICmpAddConstant
>
> There were two combines not covered by the check before now,
> neither of which actually differed from normal in the benefit analysis.
>
> The most recent seems to be because it was just added at the top of the
> function (naturally). The older is from way back in 2008 (r46687)
> when we just didn't put those checks in so routinely, and has been
> diligently maintained since.
From the commit message alone, there doesn't seem to be a
deeper motivation, deeper problem that was trying to solve,
other than 'fixing the wrong one-use check'.
As i have briefly discusses in IRC with Tim, the original motivation
can no longer be recovered, too much time has passed.
However i believe that the original fold was doing the right thing,
we should be performing such a transformation even if the inner `add`
will not go away - that will still unchain the comparison from `add`,
it will no longer need to wait for `add` to compute.
Doing so doesn't seem to break any particular idioms,
as least as far as i can see.
References https://bugs.llvm.org/show_bug.cgi?id=44100
If the sign of the sign argument is known (this could be extended to use ValueTracking),
then we can use fneg+fabs to clear/set the sign bit of the magnitude argument.
http://llvm.org/docs/LangRef.html#llvm-copysign-intrinsic
This transform is already done in DAGCombiner, but we can do it sooner in IR as
suggested in PR44153:
https://bugs.llvm.org/show_bug.cgi?id=44153
We have effectively no analysis for copysign in IR, so we are taking the unusual step
of increasing the number of IR instructions for the negative constant case.
Differential Revision: https://reviews.llvm.org/D70792