EFLAGS copy lowering.
If you have a branch of LLVM, you may want to cherrypick this. It is
extremely unlikely to hit this case empirically, but it will likely
manifest as an "impossible" branch being taken somewhere, and will be
... very hard to debug.
Hitting this requires complex conditions living across complex control
flow combined with some interesting memory (non-stack) initialized with
the results of a comparison. Also, because you have to arrange for an
EFLAGS copy to be in *just* the right place, almost anything you do to
the code will hide the bug. I was unable to reduce anything remotely
resembling a "good" test case from the place where I hit it, and so
instead I have constructed synthetic MIR testing that directly exercises
the bug in question (as well as the good behavior for completeness).
The issue is that we would mistakenly assume any SETcc with a valid
condition and an initial operand that was a register and a virtual
register at that to be a register *defining* SETcc...
It isn't though....
This would in turn cause us to test some other bizarre register,
typically the base pointer of some memory. Now, testing this register
and using that to branch on doesn't make any sense. It even fails the
machine verifier (if you are running it) due to the wrong register
class. But it will make it through LLVM, assemble, and it *looks*
fine... But wow do you get a very unsual and surprising branch taken in
your actual code.
The fix is to actually check what kind of SETcc instruction we're
dealing with. Because there are a bunch of them, I just test the
may-store bit in the instruction. I've also added an assert for sanity
that ensure we are, in fact, *defining* the register operand. =D
llvm-svn: 338481
we aren't incorrectly generating any of it when doing SLH.
There was a bug that only occured with SLH that very much looked like it
could be caused by bad unwind info, and so this was a prime suspect.
Turns out that everything is fine, but this way we'll *see* if we end
up, for example, putting things we shouldn't inside the prolog.
llvm-svn: 338480
Previously we were just visiting the blocks in the function in IR order, which
is rather arbitrary. Therefore we wouldn't always visit defs before uses, but
the translation code relies on this assumption in some places.
Only codegen change seen in tests is an elision of a redundant copy.
Fixes PR38396
llvm-svn: 338476
Disable ARMCodeGenPrepare by default again. It is causing verifier
failues in V8 that look like:
Duplicate integer as switch case
switch i32 %trunc, label %if.end13 [
i32 0, label %cleanup36
i32 0, label %if.then8
], !dbg !4981
i32 0
fatal error: error in backend: Broken function found, compilation aborted!
I will continue reducing the test case and send it along.
llvm-svn: 338452
When lowering calling conventions, prefer to decompose vectors
into the constitute register types. This avoids artifical constraints
to satisfy a wide super-register.
This improves code quality because now optimizations don't need to
deal with the super-register constraint. For example the immediate
folding code doesn't deal with 4 component reg_sequences, so by
breaking the register down earlier the existing immediate folding
code is able to work.
This also avoids the need for the shader input processing code
to manually split vector types.
llvm-svn: 338416
As was done for vector rotations, we can efficiently use ISD::MULHU for vXi8/vXi16 ISD::SRL lowering.
Shift-by-zero cases are still problematic (mainly on v32i8 due to extra AND/ANDN/OR or VPBLENDVB blend masks but v8i16/v16i16 aren't great either if PBLENDW fails) so I've limited this first patch to known non-zero cases if we can't easily use PBLENDW.
Differential Revision: https://reviews.llvm.org/D49562
llvm-svn: 338407
Summary:
Similar to D49636, but for PMADDUBSW. This instruction has the additional complexity that the addition of the two products saturates to 16-bits rather than wrapping around. And one operand is treated as signed and the other as unsigned.
A C example that triggers this pattern
```
static const int N = 128;
int8_t A[2*N];
uint8_t B[2*N];
int16_t C[N];
void foo() {
for (int i = 0; i != N; ++i)
C[i] = MIN(MAX((int16_t)A[2*i]*(int16_t)B[2*i] + (int16_t)A[2*i+1]*(int16_t)B[2*i+1], -32768), 32767);
}
```
Reviewers: RKSimon, spatel, zvi
Reviewed By: RKSimon, zvi
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D49829
llvm-svn: 338402
This commit fixes two issues with the liveness information after the
call:
1) The code always spills RCX and RDX if InProlog == true, which results
in an use of undefined phys reg.
2) FinalReg, JoinReg, RoundedReg, SizeReg are not added as live-ins to
the basic blocks that use them, therefore they are seen undefined.
https://llvm.org/PR38376
Differential Revision: https://reviews.llvm.org/D50020
llvm-svn: 338400
We could choose a free 0 for this, but this
matches the behavior for fmul undef, 1.0. Also,
the NaN use is more useful for folding use operations
although if it's not eliminated it is more expensive
in terms of code size.
llvm-svn: 338376
Since z13, the max group size will be 2 if any μop has more than 3 register
sources.
This has been ignored sofar in the SystemZHazardRecognizer, but is now
handled by recognizing those instructions and adjusting the tracking of
decoding and the cost heuristic for grouping.
Review: Ulrich Weigand
https://reviews.llvm.org/D49847
llvm-svn: 338368
In one place we checked X86Subtarget.slowLEA() to decide if the pass should run. But to decide what the pass should we only check isSLM. This resulted in Goldmont going down the Bonnell path.
llvm-svn: 338342
Also refactors some existing code to materialize addresses for the large code
model so it can be shared between G_GLOBAL_VALUE and G_BLOCK_ADDR.
This implements PR36390.
Differential Revision: https://reviews.llvm.org/D49903
llvm-svn: 338337
This is exchanging a sub-of-1 with add-of-minus-1:
https://rise4fun.com/Alive/plKAH
This is another step towards improving select-of-constants codegen (see D48970).
x86 is the motivating target, and those diffs all appear to be wins. PPC and AArch64 look neutral.
I've limited this to early combining (!LegalOperations) in case a target wants to reverse it, but
I think canonicalizing to 'add' is more likely to produce further transforms because we have more
folds for 'add'.
Differential Revision: https://reviews.llvm.org/D49924
llvm-svn: 338317
This teaches the outliner to save LR to a register rather than the stack when
possible. This allows us to avoid bumping the stack in outlined functions in
some cases. By doing this, in a later patch, we can teach the outliner to do
something like this:
f1:
...
bl OUTLINED_FUNCTION
...
f2:
...
move LR's contents to a register
bl OUTLINED_FUNCTION
move the register's contents back
instead of falling back to saving LR in both cases.
llvm-svn: 338278
Previously, I thought this was a Windows failure. Then I realized it failed on
every bot that used the verifier. This makes it use the verifier always, and
adds that pass to the pipeline checks so that it's consistent across all bots.
llvm-svn: 338272
Summary:
Attempt to extract a shrl from a udiv or a shl from a mul if this allows a rotate to be formed. This targets cases where the input to a rotate pattern was a mul or udiv by a constant and InstCombine merged one of the shifts with the op.
Patch by: sameconrad (Sam Conrad)
Reviewers: RKSimon, craig.topper, spatel, lebedev.ri, javed.absar
Reviewed By: lebedev.ri
Subscribers: efriedma, kparzysz, llvm-commits
Differential Revision: https://reviews.llvm.org/D47681
llvm-svn: 338270
This reapplies commit r338206 reverted by r338214 since the bug that
r338206 uncovered has been fixed in r338268.
Add support for inline assembly with matching input operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR). Note that regular input is already handled by existing
code.
llvm-svn: 338269
It seems like the pass pipeline on Windows is slightly different than on Linux
and macOS. As a result, the arm64-opt-remarks-lazy-bfi test has been failing.
This switches a CHECK-NEXT to a CHECK-DAG to try and get this running properly
again.
It'd be nice to switch it back to a CHECK-NEXT if possible, but the CHECK-NEXT
lines following the line we care about (the optimization remark emitter)
do a pretty good job of enforcing the ordering we want.
Hopefully this works, since I don't have a Windows machine. ;)
Example failure: http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-win/builds/11295
llvm-svn: 338267
The machine verifier asserts with:
Assertion failed: (isMBB() && "Wrong MachineOperand accessor"), function getMBB, file ../include/llvm/CodeGen/MachineOperand.h, line 542.
It calls analyzeBranch which tries to call getMBB if the opcode is
JMP_1, but in this case we do:
JMP_1 @OUTLINED_FUNCTION
I believe we have to use TAILJMPd64 instead of JMP_1 since JMP_1 is used
with brtarget8.
Differential Revision: https://reviews.llvm.org/D49299
llvm-svn: 338237
Summary:
These instructions interact with hardware blocks outside the shader core,
and they can have "scalar" side effects even when EXEC = 0. We don't
want these scalar side effects to occur when all lanes want to skip
these instructions, so always add the execz skip branch instruction
for basic blocks that contain them.
Also ensure that we skip scalar stores / atomics, though we don't
code-gen those yet.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48431
Change-Id: Ieaeb58352e2789ffd64745603c14970c60819d44
llvm-svn: 338235
Code in `CC_ARM_AAPCS_Custom_Aggregate()` is responsible for handling
homogeneous aggregates for `CC_ARM_AAPCS_VFP`. When an aggregate ends up
fully on stack, the function tries to pack all resulting items of the
aggregate as tightly as possible according to AAPCS.
Once the first item was laid out, the alignment used for consecutive
items was the size of one item. This logic went wrong for 128-bit
vectors because their alignment is normally only 64 bits, and so could
result in inserting unexpected padding between the first and second
element.
The patch fixes the problem by updating the alignment with the item size
only if this results in reducing it.
Differential Revision: https://reviews.llvm.org/D49720
llvm-svn: 338233
Add support for inline assembly with matching input operand that do not
naturally go in the register class it is constrained to (eg. double in a
32-bit GPR). Note that regular input is already handled by existing
code.
llvm-svn: 338206
SelectionDAGBuilder widens v3i32/v3f32 arguments to
to v4i32/v4f32 which consume an additional register.
In addition to wasting argument space, this produces extra
instructions since now it appears the 4th vector component has
a meaningful value to most combines.
llvm-svn: 338197
Summary:
Moved Explicit Locals pass to last.
Made that pass obligatory.
Made it convert from register to stack based instructions, and removed the registers.
Fixes to related code that was expecting register based instructions.
Added the correct testing flag to all tests, depending on what the
format they were expecting so far.
Translated one test to stack format as example: reg-stackify-stack.ll
tested:
llvm-lit -v `find test -name WebAssembly`
unittests/MC/*
Reviewers: dschuff, sunfish
Subscribers: sbc100, jgravelle-google, eraman, aheejin, llvm-commits
Differential Revision: https://reviews.llvm.org/D49160
llvm-svn: 338164
Fixed the ASAN failure from before in r338148, so recommiting.
This patch enables the MachineOutliner by default in AArch64 under -Oz.
The MachineOutliner offers around a 4.5% improvement on the current -Oz code
size improvements.
We have done work into improving the debuggability of outlined code, so that
users of -Oz won't be surprised by the optimization. We have also been executing
the LLVM test suite and common external tests such as the SPEC suites
continuously with no issue. The outliner has a low compile-time overhead of
roughly 1%. At this point, the outliner would be a really good addition to the
-Oz pass pipeline!
llvm-svn: 338160
The tests with a constant sub operand were added with rL338143,
but the potential transform doesn't have that requirement, so
adding more tests with variable operands.
llvm-svn: 338150
This feature enables the fusion of such operations on Cortex A57 and Cortex
A72, as recommended in their Software Optimisation Guides, sections 4.14 and
4.11, respectively.
Differential revision: https://reviews.llvm.org/D49563
llvm-svn: 338147