Commit Graph

35057 Commits

Author SHA1 Message Date
dfukalov aa77232a63 [NFC][AMDGPU] Improve fused fmul+fadd tests.
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D84903
2020-07-31 04:00:09 +03:00
Eli Friedman 7e88efa7c5 [LegalizeTypes][SVE] Support widen/split legalization for SPLAT_VECTOR
Just the obvious implementation that rewrites the result type. Also fix
warning from EXTRACT_SUBVECTOR legalization that triggers on the test.

Differential Revision: https://reviews.llvm.org/D84706
2020-07-30 16:17:45 -07:00
Amara Emerson 09f9f7dd1b [AArch64][GlobalISel] Add legalization & selection support for G_INTRINSIC_LRINT.
Differential Revision: https://reviews.llvm.org/D84552
2020-07-30 16:14:56 -07:00
Matt Arsenault e56e9022bc AMDGPU: Fix liveness errors when copying AGPR tuples
Avoid recursively calling copyPhysReg for AGPR handling. This was
dropping the necessary super register implicit defs to avoid liveness
verifier errors.
2020-07-30 18:13:04 -04:00
Jon Roelofs afae6d97fa [SelectionDAG] Fix lowering of vector geps
This fixes an assertion failure that was being triggered in
SelectionDAG::getZeroExtendInReg(), where it was trying to extend the <2xi32>
to i64 (which should have been <2xi64>).

Fixes: rdar://66016901

Differential Revision: https://reviews.llvm.org/D84884
2020-07-30 14:56:53 -06:00
Wouter van Oortmerssen ce1eb7af9d [WebAssembly] Fixed 64-bit indices in br_table
LLVM selection dag assumes "switch" indices are pointer sized, which causes problems for our 32-bit br_table. The new function ensures 32-bit operands don't get unnecessarily extended, and 64-bit operands get truncated.

Note that the changes to the existing test test exactly that: the addition of -NEXT in 2 places ensures no extension is inserted (which the test previously ignored) and that the wrap is present (previously omitted in wasm64 mode).

Differential Revision: https://reviews.llvm.org/D84705
2020-07-30 10:52:16 -07:00
Stanislav Mekhanoshin 5b32518f96 [AMDGPU] Do not use undef on indirect source
We are using undef on the indirect move source subreg and then
using implicit super-reg. This creates a problem in RA when
Greedy decides to split the register. It reassigns the implicit
super-reg but does not bother to change undef source because
it is really does not matter. The fix is to stop lying to RA and
drop undef flag.

This has also hit a problem in SIFoldOperands as it can fold
immediate into an indirect move since there is no undef flag
anymore. That results in multiple test failures, so added the
check for this case.

Differential Revision: https://reviews.llvm.org/D84899
2020-07-30 10:41:59 -07:00
hsmahesha 33fd4a18e7 [AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic
Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as
follows. The existing heuristic roughly summarizes as below:

* Assume, all the mem ops instructions participating in the clustering process,  loads/stores same num bytes
* If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes
* If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes
* If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes

So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it
properly handles both the sub-word loads and the wide loads.

Reviewed By: arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D84354
2020-07-30 21:41:13 +05:30
Brendon Cahoon 7b114446c3 Align store conditional address
In cases where the alignment of the datatype is smaller than
expected by the instruction, the address is aligned. The aligned
address is used for the load, but wasn't used for the store
conditional, which resulted in a run-time alignment exception.
2020-07-30 10:42:00 -05:00
Matt Arsenault b8c8d1b309 AMDGPU: Convert some tests to use new buffer intrinsics
The legacy not struct or raw buffer intrinsics should now all be
consolidated into the tests specifically for those intrinsics.
2020-07-30 10:30:43 -04:00
Jinsong Ji dab8d6104b [PowerPC][AIX] Move the testcase to proper dir 2020-07-30 14:25:59 +00:00
jasonliu 04dc9691eb [XCOFF][AIX] Enable -ffunction-sections
Summary:
This patch implements -ffunction-sections on AIX.
This patch focuses on assembly generation.
Follow-on patch needs to handle:
1. -ffunction-sections implication for jump table.
2. Object file generation path and associated testing.

Differential Revision: https://reviews.llvm.org/D83875
2020-07-30 13:30:01 +00:00
Simon Pilgrim 2dec72ba5c [X86][SSE] combineExtractWithShuffle - extend extract(truncate(x),0) for any source vector size
As long as we can extract the lowest 128-bit subvector from the pre-truncated source vector, then we don't care what size it is.

The next stage will be to support non-zero extraction indices, as long as its still coming from the lowest 128-bit subvector.
2020-07-30 12:27:49 +01:00
Florian Hahn 44a4ba859d [AArch64] Add machine-combiner tests with instruction level FMFs. 2020-07-30 11:41:09 +01:00
Esme-Yi 141b64a340 [NFC] Failed cases for some patterns defined in DAGCombiner.cpp 2020-07-30 10:05:04 +00:00
Sam Tebbs 276ed5f7e4 [DAGCombiner] Fold sext_inreg of a masked load into a sign extended masked load
This patch adds a DAG combine fold for a sext(masked_load) into a sign extended masked load.

Differential Revision: https://reviews.llvm.org/D84332
2020-07-30 10:34:02 +01:00
Kang Zhang 0037a5f894 [PHIElimination] Fix the killed flag for LowerPHINode()
Summary:
In the phi-node-elimination pass, we set the killed flag incorrectly.
When we eliminate the PHI node, we replace the PHI with a copy for the
incoming value.

Before this patch, we will set incoming value as killed(PHICopy). And
we will remove the killed flag from last using incoming value(OldKill).
This is correct, only if the new PHICopy is after the OldKill.

Reviewed By: bjope

Differential Revision: https://reviews.llvm.org/D80886
2020-07-30 08:18:50 +00:00
David Sherwood 23ad660b5d [SVE][CodeGen] At -O0 fallback to DAG ISel when translating alloca with scalable types
When building code at -O0 We weren't falling back to DAG ISel correctly
when encountering alloca instructions with scalable vector types. This
is because the alloca has no operands that are scalable. I've fixed this by
adding a check in AArch64ISelLowering::fallBackToDAGISel for alloca
instructions with scalable types.

Differential Revision: https://reviews.llvm.org/D84746
2020-07-30 08:40:53 +01:00
Kang Zhang a18953c1c0 [PowerPC] Fix RM operands for some instructions
Summary:
Some instructions have set the wrong [RM] flag, this patch is to fix it.

Instructions x(v|s)r(d|s)pi[zmp]? and fri[npzm] use fixed rounding
directions without referencing current rounding mode.

Also, the SETRNDi, SETRND, BCLRn, MTFSFI, MTFSB0, MTFSB1, MTFSFb,
MTFSFI, MTFSFI_rec, MTFSF, MTFSF_rec should also fix the RM flag.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D81360
2020-07-30 02:10:49 +00:00
Matt Arsenault 66c572af55 GlobalISel: Handle assorted no-op intrinsics
SelectionDAGBuilder just drops these, so do the same.
2020-07-29 21:26:20 -04:00
Matt Arsenault 0da582d9b6 GlobalISel: Handle llvm.roundeven
I still think it's highly questionable that we have two intrinsics
with identical behavior and only vary by the name of the libcall used
if it happens to be lowered that way, but try to reduce the feature
delta between SDAG and GlobalISel for recently added intrinsics. I'm
not sure which opcode should be considered the canonical one, but
lower roundeven back to round.
2020-07-29 20:01:12 -04:00
Philip Reames 755f91f12c [Statepoint] Enable cross block relocates w/vreg lowering
This change is mechanical, it just removes the restriction and updates tests.  The key building blocks were submitted in 31342eb and 8fe2abc.

Note that this (and preceeding changes) entirely subsumes D83965.  I did includes a couple of it's tests.

From the codegen changes, an interesting observation: this doesn't actual reduce spilling, it just let's the register allocator do it's job.  That results in a slightly different overall result which has both pros and cons over the eager spill lowering.  (i.e. We'll have some perf tuning to do once this is stable.)
2020-07-29 13:32:51 -07:00
Simon Pilgrim a1c9529e60 [X86][AVX] isHorizontalBinOp - relax no-lane-crossing limit for AVX1-only targets.
Instead of never accepting v8f32/v4f64 FHADD/FHSUB if the input shuffle masks cross lanes, perform the matching and determine if the post shuffle mask simplifies to a 'whole lane shuffle' mask - in which case we are guaranteed to cheaply perform this as a VPERM2F128 shuffle.
2020-07-29 20:49:10 +01:00
Philip Reames e980913831 [Tests] Split a file for ease of update 2020-07-29 12:45:04 -07:00
Stanislav Mekhanoshin 13b63be472 [AMDGPU] prefer non-mfma in post-RA schedule
MFMA instructions shall not be scheduled back to back
to avoid MAI SIMD stall. Tell post-RA schedule we would
prefer some other instruction instead.

Differential Revision: https://reviews.llvm.org/D84883
2020-07-29 12:17:50 -07:00
Baptiste Saleil 7aaa85627b [PowerPC] Add options to control paired vector memops support
Adds frontend and backend options to enable and disable the
PowerPC paired vector memory operations added in ISA 3.1.
Instructions using these options will be added in subsequent patches.

Differential Revision: https://reviews.llvm.org/D83722
2020-07-29 14:00:53 -05:00
Amara Emerson 0c0e36061a [GlobalISel] Add G_INTRINSIC_LRINT and translate from llvm.lrint
Differential Revision: https://reviews.llvm.org/D84551
2020-07-29 11:51:04 -07:00
Amara Emerson d8ba622209 [AArch64][GlobalISel] Selection support for vector DUP[X]lane instructions.
In future, we'd like to use the perfect-shuffle mechanism to deal with these
shuffle permutations. For now, this improves performance by avoiding the
super-expensive const-pool load + tbl instruction.

Differential Revision: https://reviews.llvm.org/D84866
2020-07-29 11:41:37 -07:00
Matt Arsenault 59fac51ff2 AMDGPU/GlobalISel: Handle llvm.amdgcn.reloc.constant 2020-07-29 14:24:21 -04:00
Matt Arsenault 0b7de7966f GlobalISel: Implement lower for G_EXTRACT_VECTOR_ELT
Use the basic store to stack and reload.
2020-07-29 14:16:28 -04:00
Jessica Paquette 7ff9575594 [AArch64][GlobalISel] Select XRO addressing mode with wide immediates
Port the wide immediate case from AArch64DAGToDAGISel::SelectAddrModeXRO.

If we have a wide immediate which can't be represented in an add, we can end up
with code like this:

```
mov  x0, imm
add x1, base, x0
ldr  x2, [x1, 0]
```

If we use the [base, xN] addressing mode instead, we can produce this:

```
mov  x0, imm
ldr  x2, [base, x0]
```

This saves 0.4% code size on 7zip at -O3, and gives a geomean code size
improvement of 0.1% on CTMark.

Differential Revision: https://reviews.llvm.org/D84784
2020-07-29 11:02:10 -07:00
Matt Arsenault 766cb615a3 AMDGPU: Relax restriction on folding immediates into physregs
I never completed the work on the patches referenced by
f8bf7d7f42, but this was intended to
avoid folding immediate writes into m0 which the coalescer doesn't
understand very well. Relax this to allow simple SGPR immediates to
fold directly into VGPR copies. This pattern shows up routinely in
current GlobalISel code since nothing is smart enough to emit VGPR
constants yet.
2020-07-29 14:01:53 -04:00
Heejin Ahn 276f9e8cfa [WebAssembly] Fix getBottom for loops
When it was first created, CFGSort only made sure BBs in each
`MachineLoop` are sorted together. After we added exception support,
CFGSort now also sorts BBs in each `WebAssemblyException`, which
represents a `catch` block, together, and
`Region` class was introduced to be a thin wrapper for both
`MachineLoop` and `WebAssemblyException`.

But how we compute those loops and exceptions is different.
`MachineLoopInfo` is constructed using the standard loop computation
algorithm in LLVM; the definition of loop is "a set of BBs that are
dominated by a loop header and have a path back to the loop header". So
even if some BBs are semantically contained by a loop in the original
program, or in other words dominated by a loop header, if they don't
have a path back to the loop header, they are not considered a part of
the loop. For example, if a BB is dominated by a loop header but
contains `call abort()` or `rethrow`, it wouldn't have a path back to
the header, so it is not included in the loop.

But `WebAssemblyException` is wasm-specific data structure, and its
algorithm is simple: a `WebAssemblyException` consists of an EH pad and
all BBs dominated by the EH pad. So this scenario is possible: (This is
also the situation in the newly added test in cfg-stackify-eh.ll)

```
Loop L: header, A, ehpad, latch
Exception E: ehpad, latch, B
```
(B contains `abort()`, so it does not have a path back to the loop
header, so it is not included in L.)

And it is sorted in this order:
```
header
A
ehpad
latch
B
```

And when CFGStackify places `end_loop` or `end_try` markers, it
previously used `WebAssembly::getBottom()`, which returns the latest BB
in the sorted order, and placed the marker there. So in this case the
marker placements will be like this:
```
loop
  header
  try
    A
  catch
    ehpad
    latch
end_loop         <-- misplaced!
    B
  end_try
```
in which nesting between the loop and the exception is not correct.
`end_loop` marker has to be placed after `B`, and also after `end_try`.

Maybe the fundamental way to solve this problem is to come up with our
own algorithm for computing loop region too, in which we include all BBs
dominated by a loop header in a loop. But this takes a lot more effort.
The only thing we need to fix is actually, `getBottom()`. If we make it
return the right BB, which means in case of a loop, the latest BB of the
loop itself and all exceptions contained in there, we are good.

This renames `Region` and `RegionInfo` to `SortRegion` and
`SortRegionInfo` and extracts them into their own file. And add
`getBottom` to `SortRegionInfo` class, from which it can access
`WebAssemblyExceptionInfo`, so that it can compute a correct bottom
block for loops.

Reviewed By: dschuff

Differential Revision: https://reviews.llvm.org/D84724
2020-07-29 10:36:32 -07:00
Craig Topper c4823b24a4 [X86] Add custom lowering for llvm.roundeven with sse4.1.
We can use the roundss/sd/ps/pd instructions like we do for
ceil/floor/trunc/rint/nearbyint.

Differential Revision: https://reviews.llvm.org/D84592
2020-07-29 10:23:08 -07:00
Simon Pilgrim fdc902774e [DAG][AMDGPU][X86] Add SimplifyMultipleUseDemandedBits handling for SIGN/ZERO_EXTEND + SIGN/ZERO_EXTEND_VECTOR_INREG
Peek through multiple use ops like we already do for ANY_EXTEND/ANY_EXTEND_VECTOR_INREG

Differential Revision: https://reviews.llvm.org/D84863
2020-07-29 18:10:59 +01:00
Kang Zhang 802c043078 [PowerPC] Set v1i128 to expand for SETCC to avoid crash
Summary:
PPC only supports the instruction selection for v16i8, v8i16, v4i32,
v2i64, v4f32 and v2f64 for ISD::SETCC, don't support the v1i128, so
v1i128 for ISD::SETCC will crash.

This patch is to set v1i128 to expand to avoid crash.

Reviewed By: steven.zhang

Differential Revision: https://reviews.llvm.org/D84238
2020-07-29 16:39:27 +00:00
Philip Reames 31342eb63e [Statepoint] When using the tied def lowering, unconditionally use vregs [almost NFC]
This builds on 3da1a96 on the path towards supporting invokes and cross block relocations. The actual change attempts to be NFC, but does fail in one corner-case explained below.

The change itself is fairly mechanical. Rather than remember SDValues - which are inherently block local - immediately produce a virtual register copy and remember that.

Once this lands, we'll update the FunctionLoweringInfo::StatepointSpillMap map to allow register based lowerings, delete VirtRegs from StatepointLowering, and drop the restriction against cross block relocations. I deliberately separate the semantic part into it's own change for easy of understanding and fault isolation.

The corner-case which isn't quite NFC is that the old implementation implicitly CSEd gc.relocates of the same SDValue regardless of type. The new implementation still only relocates once, but it produces distinct vregs for the bitcast and it's source, whereas SelectionDAG's generic CSE was able to remove the bitcast in the old implementation. Note that the final assembly doesn't change (at least in the test), as our MI level optimizations catch the duplication.

I assert that this is an uninteresting corner-case. It's functionally correct, and if we find a case where this influences performance, we should really be canonicalizing types to i8* at the IR level.

Differential Revision: https://reviews.llvm.org/D84692
2020-07-29 09:23:52 -07:00
Kang Zhang a4ade9ed21 [MachineVerifier] Handle the PHI node for verifyLiveVariables()
Summary:
When doing MachineVerifier for LiveVariables, the MachineVerifier pass
will calculate the LiveVariables, and compares the result with the
result livevars pass gave. If they are different, verifyLiveVariables()
will give error.

But when we calculate the LiveVariables in MachineVerifier, we don't
consider the PHI node, while livevars considers.

This patch is to fix above bug.

Reviewed By: bjope

Differential Revision: https://reviews.llvm.org/D80274
2020-07-29 15:43:47 +00:00
Matt Arsenault d42c7b2211 AMDGPU: Account for the size of LDS globals used through constant
expressions.

Also "fix" the longstanding bug where the computed size depends on the
order of the visitation. We could try to predict the allocation order
used by legalization, but it would never be 100% perfect. Until we
start fixing the addresses somehow (or have a more reliable allocation
scheme later), just try to compute the size based on the worst case
padding.
2020-07-29 11:40:42 -04:00
Simon Wallis 6a05c6bfc8 [MachineCopyPropagation] BackwardPropagatableCopy: add check for hasOverlappingMultipleDef
In MachineCopyPropagation::BackwardPropagatableCopy(),
a check is added for multiple destination registers.

The copy propagation is avoided if the copied destination register
is the same register as another destination on the same instruction.

A new test is added.  This used to fail on ARM like this:
error: unpredictable instruction, RdHi and RdLo must be different
        umull   r9, r9, lr, r0

Reviewed By: lkail

Differential Revision: https://reviews.llvm.org/D82638
2020-07-29 16:21:01 +01:00
David Sherwood 2078771759 [SVE][CodeGen] Add simple integer add tests for SVE tuple types
I have added tests to:

  CodeGen/AArch64/sve-intrinsics-int-arith.ll

for doing simple integer add operations on tuple types. Since these
tests introduced new warnings due to incorrect use of
getVectorNumElements() I have also fixed up these warnings in the
same patch. These fixes are:

1. In narrowExtractedVectorBinOp I have changed the code to bail out
early for scalable vector types, since we've not yet hit a case that
proves the optimisations are profitable for scalable vectors.
2. In DAGTypeLegalizer::WidenVecRes_CONCAT_VECTORS I have replaced
calls to getVectorNumElements with getVectorMinNumElements in cases
that work with scalable vectors. For the other cases I have added
asserts that the vector is not scalable because we should not be
using shuffle vectors and build vectors in such cases.

Differential revision: https://reviews.llvm.org/D84016
2020-07-29 13:32:10 +01:00
Sjoerd Meijer 85342c27a3 [ARM] Optimize immediate selection
Optimize some specific immediates selection by materializing them with sub/mvn
instructions as opposed to loading them from the constant pool.

Patch by Ben Shi, powerman1st@163.com.

Differential Revision: https://reviews.llvm.org/D83745
2020-07-29 13:29:17 +01:00
Matt Arsenault c230965ccf AMDGPU: Make saturating add/sub legal for DAG path 2020-07-29 08:27:31 -04:00
Matt Arsenault cdd45d5f9c AMDGPU/GlobalISel: Select llvm.amdgcn.global.atomic.csub
Remove the custom node boilerplate. Not sure why this tried to handle
the LDS atomic stuff.
2020-07-29 08:27:31 -04:00
David Sherwood f43b5c7a76 [SVE] Add checks for no warnings in CodeGen/AArch64/sve-sext-zext.ll
Previous patches fixed up all the warnings in this test:

  llvm/test/CodeGen/AArch64/sve-sext-zext.ll

and this change simply checks that no new warnings are added in future.

Differential revision: https://reviews.llvm.org/D83205
2020-07-29 13:06:39 +01:00
Simon Pilgrim 0c005be6eb [X86][SSE] getV4X86ShuffleImm8 - canonicalize broadcast masks
If the mask input to getV4X86ShuffleImm8 only refers to a single source element (+ undefs) then canonicalize to a full broadcast.

getV4X86ShuffleImm8 defaults to inline values for undefs, which can be useful for shuffle widening/narrowing but does leave SimplifyDemanded* calls thinking the shuffle depends on unnecessary elements.

I'm still investigating what we should do more generally to avoid these undemanded elements, but broadcast cases was a simpler win.
2020-07-29 11:32:44 +01:00
Ikhlas Ajbar d50d4c3d44 [Hexagon] Correct the order of operands when lowering funnel shift-left
This patch corrects the order of operands in the pattern that lowers fshl
in Hexagon.
2020-07-28 21:22:41 -05:00
Matt Arsenault 44211f20a8 AMDGPU: Optimize copies to exec with other insts after exec def
It's possible to have terminator instructions after a write to exec,
so skip over them to find it.
2020-07-28 21:34:50 -04:00
Matt Arsenault b6ebc77326 AMDGPU/GlobalISel: Fix selecting llvm.amdgcn.s.getreg
This introduces the same bug llvm.amdgcn.s.setreg has where if the
user specified an immediate outside of the valid 16-bit range, it will
select into a verifier error.
2020-07-28 21:34:50 -04:00
Thomas Lively 11bb7eef41 [WebAssembly] Remove intrinsics for SIMD widening ops
Instead, pattern match extends of extract_subvectors to generate
widening operations. Since extract_subvector is not a legal node, this
is implemented via a custom combine that recognizes extract_subvector
nodes before they are legalized. The combine produces custom ISD nodes
that are later pattern matched directly, just like the intrinsic was.

Also removes the clang builtins for these operations since the
instructions can now be generated from portable code sequences.

Differential Revision: https://reviews.llvm.org/D84556
2020-07-28 18:25:55 -07:00