Port the wide immediate case from AArch64DAGToDAGISel::SelectAddrModeXRO.
If we have a wide immediate which can't be represented in an add, we can end up
with code like this:
```
mov x0, imm
add x1, base, x0
ldr x2, [x1, 0]
```
If we use the [base, xN] addressing mode instead, we can produce this:
```
mov x0, imm
ldr x2, [base, x0]
```
This saves 0.4% code size on 7zip at -O3, and gives a geomean code size
improvement of 0.1% on CTMark.
Differential Revision: https://reviews.llvm.org/D84784
I never completed the work on the patches referenced by
f8bf7d7f42, but this was intended to
avoid folding immediate writes into m0 which the coalescer doesn't
understand very well. Relax this to allow simple SGPR immediates to
fold directly into VGPR copies. This pattern shows up routinely in
current GlobalISel code since nothing is smart enough to emit VGPR
constants yet.
When it was first created, CFGSort only made sure BBs in each
`MachineLoop` are sorted together. After we added exception support,
CFGSort now also sorts BBs in each `WebAssemblyException`, which
represents a `catch` block, together, and
`Region` class was introduced to be a thin wrapper for both
`MachineLoop` and `WebAssemblyException`.
But how we compute those loops and exceptions is different.
`MachineLoopInfo` is constructed using the standard loop computation
algorithm in LLVM; the definition of loop is "a set of BBs that are
dominated by a loop header and have a path back to the loop header". So
even if some BBs are semantically contained by a loop in the original
program, or in other words dominated by a loop header, if they don't
have a path back to the loop header, they are not considered a part of
the loop. For example, if a BB is dominated by a loop header but
contains `call abort()` or `rethrow`, it wouldn't have a path back to
the header, so it is not included in the loop.
But `WebAssemblyException` is wasm-specific data structure, and its
algorithm is simple: a `WebAssemblyException` consists of an EH pad and
all BBs dominated by the EH pad. So this scenario is possible: (This is
also the situation in the newly added test in cfg-stackify-eh.ll)
```
Loop L: header, A, ehpad, latch
Exception E: ehpad, latch, B
```
(B contains `abort()`, so it does not have a path back to the loop
header, so it is not included in L.)
And it is sorted in this order:
```
header
A
ehpad
latch
B
```
And when CFGStackify places `end_loop` or `end_try` markers, it
previously used `WebAssembly::getBottom()`, which returns the latest BB
in the sorted order, and placed the marker there. So in this case the
marker placements will be like this:
```
loop
header
try
A
catch
ehpad
latch
end_loop <-- misplaced!
B
end_try
```
in which nesting between the loop and the exception is not correct.
`end_loop` marker has to be placed after `B`, and also after `end_try`.
Maybe the fundamental way to solve this problem is to come up with our
own algorithm for computing loop region too, in which we include all BBs
dominated by a loop header in a loop. But this takes a lot more effort.
The only thing we need to fix is actually, `getBottom()`. If we make it
return the right BB, which means in case of a loop, the latest BB of the
loop itself and all exceptions contained in there, we are good.
This renames `Region` and `RegionInfo` to `SortRegion` and
`SortRegionInfo` and extracts them into their own file. And add
`getBottom` to `SortRegionInfo` class, from which it can access
`WebAssemblyExceptionInfo`, so that it can compute a correct bottom
block for loops.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D84724
Currently, `target create` has no --platform option. However,
TargetList::CreateTargetInternal which is called under the hood, will
return an error when either no platform or multiple matching platforms
are found, saying that a platform should be specified with --platform.
This patch adds the platform option, but that doesn't solve either of
these errors.
- If more than one platform matches, specifying the platform isn't
going to fix that. The current code will only look at the
architecture instead. I've updated the error message to ask the user
to specify an architecture.
- If no architecture is found, specifying a new one via platform isn't
going to change that either because we already try to find one that
matches the given architecture.
Differential revision: https://reviews.llvm.org/D84809
-fno-lto is in SANITIZER_COMMON_CFLAGS but not here.
Don't use SANITIZER_COMMON_CFLAGS because of performance issues.
See https://bugs.llvm.org/show_bug.cgi?id=46838.
Fixes
$ ninja TScudoCUnitTest-i386-Test
on an LLVM build with -DLLVM_ENABLE_LTO=Thin.
check-scudo now passes.
Reviewed By: cryptoad
Differential Revision: https://reviews.llvm.org/D84805
This is to avoid the need to update a bunch of test files when the PGO
instrumentation function hashing changes.
Split off of D84782.
Differential Revision: https://reviews.llvm.org/D84865
This matches the behavior of simplify calls for regular opcodes -
rely on ConstantFolding before spending time on folds with variables.
I am not aware of any diffs from this re-ordering currently, but there was
potential for unintended behavior from the min/max intrinsics because that
code is implicitly assuming that only 1 of the input operands is constant.
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.
After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.
Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.
I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.
Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.
Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.
As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S
llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement
The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.
If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
* -14 (-73.68%) loops not rotated due to the header size (yay)
* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
* -3937 (-64.19%) common instructions hoisted
* +561 (+0.06%) x86 asm instructions
* -2 basic blocks
* +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable {F12360201}
* -36396 (-65.29%) common instructions hoisted
* +1676 (+0.02%) x86 asm instructions
* +662 (+0.06%) basic blocks
* +4395 (+0.04%) IR instructions
It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D84108
Summary:
PPC only supports the instruction selection for v16i8, v8i16, v4i32,
v2i64, v4f32 and v2f64 for ISD::SETCC, don't support the v1i128, so
v1i128 for ISD::SETCC will crash.
This patch is to set v1i128 to expand to avoid crash.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D84238
This builds on 3da1a96 on the path towards supporting invokes and cross block relocations. The actual change attempts to be NFC, but does fail in one corner-case explained below.
The change itself is fairly mechanical. Rather than remember SDValues - which are inherently block local - immediately produce a virtual register copy and remember that.
Once this lands, we'll update the FunctionLoweringInfo::StatepointSpillMap map to allow register based lowerings, delete VirtRegs from StatepointLowering, and drop the restriction against cross block relocations. I deliberately separate the semantic part into it's own change for easy of understanding and fault isolation.
The corner-case which isn't quite NFC is that the old implementation implicitly CSEd gc.relocates of the same SDValue regardless of type. The new implementation still only relocates once, but it produces distinct vregs for the bitcast and it's source, whereas SelectionDAG's generic CSE was able to remove the bitcast in the old implementation. Note that the final assembly doesn't change (at least in the test), as our MI level optimizations catch the duplication.
I assert that this is an uninteresting corner-case. It's functionally correct, and if we find a case where this influences performance, we should really be canonicalizing types to i8* at the IR level.
Differential Revision: https://reviews.llvm.org/D84692
This patch implements OpenMP runtime support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
previous patch in this series implements Clang front end support.
Reviewed By: grokos
Differential Revision: https://reviews.llvm.org/D84712
This patch implements Clang front end support for the OpenMP TR8
`present` motion modifier for `omp target update` directives. The
next patch in this series implements OpenMP runtime support.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D84711
All these tests already explicitly test against both legacy PM and NPM.
$ sed -i 's/ -attributor / -attributor -enable-new-pm=0 /g' $(rg --path-separator // -l -- -passes=)
$ sed -i 's/ -attributor-cgscc / -attributor-cgscc -enable-new-pm=0 /g' $(rg --path-separator // -l -- -passes=)
Now all tests in Transforms/Attributor/ pass under NPM.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D84813
Summary:
When doing MachineVerifier for LiveVariables, the MachineVerifier pass
will calculate the LiveVariables, and compares the result with the
result livevars pass gave. If they are different, verifyLiveVariables()
will give error.
But when we calculate the LiveVariables in MachineVerifier, we don't
consider the PHI node, while livevars considers.
This patch is to fix above bug.
Reviewed By: bjope
Differential Revision: https://reviews.llvm.org/D80274
expressions.
Also "fix" the longstanding bug where the computed size depends on the
order of the visitation. We could try to predict the allocation order
used by legalization, but it would never be 100% perfect. Until we
start fixing the addresses somehow (or have a more reliable allocation
scheme later), just try to compute the size based on the worst case
padding.
Handle insertion fix-its when removing incompatible errors by introducting a new EventType `ET_Insert`
This has lower prioirty than End events, but higher than begin.
Idea being If an insert is at the same place as a begin event, the insert should be processed first to reduce unnecessary conflicts.
Likewise if its at the same place as an end event, process the end event first for the same reason.
This also fixes https://bugs.llvm.org/show_bug.cgi?id=46511.
Reviewed By: aaron.ballman
Differential Revision: https://reviews.llvm.org/D82898
In vectorizeChainsInBlock we try to collect chains of PHI nodes
that have the same element type, but the code is relying upon
the implicit conversion from TypeSize -> uint64_t. For now, I have
modified the code to ignore PHI nodes with scalable types.
Differential Revision: https://reviews.llvm.org/D83542
TODO
* PrintIRInstrumentation and TimePassesHandler would be using this new callback.
* "Running pass" logging will also be moved to use this callback.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D84772
It was unclear what `isa` was supposed to mean so we did not provide any
traits for this context selector. With this patch we will allow *any*
string or identifier. We use the target attribute and target info to
determine if the trait matches. In other words, we will check if the
provided value is a target feature that is available (at the call site).
Fixes PR46338
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D83281
In MachineCopyPropagation::BackwardPropagatableCopy(),
a check is added for multiple destination registers.
The copy propagation is avoided if the copied destination register
is the same register as another destination on the same instruction.
A new test is added. This used to fail on ARM like this:
error: unpredictable instruction, RdHi and RdLo must be different
umull r9, r9, lr, r0
Reviewed By: lkail
Differential Revision: https://reviews.llvm.org/D82638
Not a bug that is ever likely to materialise, but still worth fixing
Reviewed By: DmitryPolukhin
Differential Revision: https://reviews.llvm.org/D84850
This commit is part of a greater project which aims to add
full end-to-end support for convolutions inside mlir. The
reason behind having conv ops for each rank rather than
having one generic ConvOp is to enable better optimizations
for every N-D case which reflects memory layout of input/kernel
buffers better and simplifies code as well. We expect plain linalg.conv
to be progressively retired.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D83879
The previous fix for this, https://reviews.llvm.org/D76761, Passed test cases but failed in the real world as std::string has a non trivial destructor so creates a CXXBindTemporaryExpr.
This handles that shortfall and updates the test case std::basic_string implementation to use a non trivial destructor to reflect real world behaviour.
Reviewed By: gribozavr2
Differential Revision: https://reviews.llvm.org/D84831
This patch introduces 2 new address spaces in OpenCL: global_device and global_host
which are a subset of a global address space, so the address space scheme will be
looking like:
```
generic->global->host
->device
->private
->local
constant
```
Justification: USM allocations may be associated with both host and device memory. We
want to give users a way to tell the compiler the allocation type of a USM pointer for
optimization purposes. (Link to the Unified Shared Memory extension:
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_unified_shared_memory.asciidoc)
Before this patch USM pointer could be only in opencl_global
address space, hence a device backend can't tell if a particular pointer
points to host or device memory. On FPGAs at least we can generate more
efficient hardware code if the user tells us where the pointer can point -
being able to distinguish between these types of pointers at compile time
allows us to instantiate simpler load-store units to perform memory
transactions.
Patch by Dmitry Sidorov.
Reviewed By: Anastasia
Differential Revision: https://reviews.llvm.org/D82174
When an instrinsic function is declared in a type declaration statement
we need to set the INTRINSIC attribute and (per 8.2(3)) ignore the
specified type.
To simplify the check, add IsIntrinsic utility to BaseVisitor.
Also, intrinsics and external procedures were getting assigned a size
and offset and they shouldn't be.
Differential Revision: https://reviews.llvm.org/D84702
This patch teaches SCEVExpander to directly preserve LCSSA.
As it is currently, SCEV does not look through PHI nodes in loops,
as it might break LCSSA form. Once SCEVExpander can preserve
LCSSA form, it should be safe for SCEV to look through PHIs.
To preserve LCSSA form, this patch uses formLCSSAForInstructions
on operands of newly created instructions, if the definition is inside
a different loop than the new instruction.
The final value we return from expandCodeFor may also need LCSSA
phis, depending on the insert point. As no user for it exists there yet,
create a temporary instruction at the insert point, which can be passed
to formLCSSAForInstructions. This temporary instruction is removed
after LCSSA construction.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D71538
The lowering does not support all types for its source operations. This change
makes the patterns fail in a well-defined manner.
Differential Revision: https://reviews.llvm.org/D84443
There's a slight difference in functionality with the new CHECK lines:
before, we allowed either -0.0 or 0.0 for maxnum/minnum. That matches
the definition, but we should always get a deterministic result from
constant folding within the compiler, so now we assert that we got
the single expected result in all cases.