Commit Graph

33058 Commits

Author SHA1 Message Date
Francesco Petrogalli 4dde9e9b02 [llvm][CodeGen] IR intrinsics for SVE2 contiguous conflict detection instructions.
Summary:
The IR intrinsics are mapped to the following SVE2 instructions:

* WHILERW <Pd>.<T>, <Xn>, <Xm>
* WHILEWR <Pd>.<T>, <Xn>, <Xm>

The intrinsics introduced in this patch are the IR counterpart of the
SVE ACLE functions `svwhilerw` and `svwhilewr` (all data type
variants).

Patch by Maciej Gąbka <maciej.gabka@arm.com>.

Reviewers: kmclaughlin, rengolin

Reviewed By: kmclaughlin

Subscribers: tschuett, kristof.beyls, hiraditya, danielkiss, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75862
2020-03-11 18:28:02 +00:00
Stanislav Mekhanoshin 9801e5469b [AMDGPU] Disable nested endcf collapse
The assumption is that conditional regions are perfectly nested
and a mask restored at the exit from the inner block will be
completely covered by a mask restored in the outer.

It turns out with our current structurizer this is not always
the case.

Disable the optimization for now, but I want to keep it around
for a while to either try after further structurizer changes or
to move it into control flow lowering where we have more info
and reuse the test.

Differential Revision: https://reviews.llvm.org/D75958
2020-03-11 11:24:20 -07:00
Jay Foad a46dba24fa [AMDGPU] Extend macro fusion for ADDC and SUBB to SUBBREV
Summary:
There's a lot of test case churn but the overall effect is to increase
the number of back-to-back v_sub,v_subbrev pairs, which can execute with
no delay even on gfx10.

Reviewers: arsenm, rampitec, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75999
2020-03-11 17:59:21 +00:00
Andrzej Warzynski a9f1583228 [AArch64][SVE] Add the @llvm.aarch64.sve.sel intrinsic
Reviewers: sdesmalen, efriedma

Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75928
2020-03-11 17:05:21 +00:00
Philip Reames e671641844 [GC] Remove buggy untested optimization from statepoint lowering
A downstream test case (see included reduced test) revealed that we have a bug in how we handle duplicate relocations. If we have the same SDValue relocated twice, and that value happens to be a constant (such as null), we only export one of the two llvm::Values. Exporting on a per llvm::Value basis is required to allow lowering of gc.relocates in following basic blocks (e.g. invokes). Without it, we end up with a use of an undefined vreg and bad things happen.

Rather than fixing the optimization - which appears to be hard - I propose we simply remove it. There are no tests in tree that change with this code removed. If we find out later that this did matter for something, we can reimplement a variation of this in CodeGenPrepare to catch the easy cases without complicating the lowering code.

Thanks to Denis and Serguei who did all the hard work of figuring out what went wrong here. The patch is by far the easy part. :)

Differential Revision: https://reviews.llvm.org/D75964
2020-03-11 10:03:24 -07:00
David Green 72bf26feb3 [ARM] Extra VFMA tests. NFC 2020-03-11 15:14:07 +00:00
Matt Arsenault a2202f6a3f AMDGPU/GlobalISel: Manually RegBankSelect copies
This was failng on any pre-assigned copy to the VCC bank.

This is something of a workaround for the default implementation in
getInstrMappingImpl, and how it treats copy-like operations in
general.

Copy-like operations are considered to only have one result register
bank, rather than separate banks for each source like a normal
instruction. To avoid potentially mishandling reg_sequence with
impossible operand combinations, the generic implementation errors on
impossible costs. If the bank was already assigned, is treated it
as-if it were an unsatisfiable REG_SEQUENCE mapping. We really don't
get any value from any of what getInstrMappingImpl tries to do for
copies, so just directly emit the simple mapping we really want.
2020-03-11 11:12:12 -04:00
Sam Parker 51cad66e97 [NFC][ARM] Add test
Precommit test for LowOverheadLoops.
2020-03-11 11:51:52 +00:00
Simon Pilgrim b3b4727a3e [X86] Replace (most) X86ISD::SHLD/SHRD usage with ISD::FSHL/FSHR generic opcodes (PR39467)
For i32 and i64 cases, X86ISD::SHLD/SHRD are close enough to ISD::FSHL/FSHR that we can use them directly, we just need to account for the operand commutation for SHRD.

The i16 SHLD/SHRD case is annoying as the shift amount is modulo-32 (vs funnel shift modulo-16), so I've added X86ISD::FSHL/FSHR equivalents, which matches the generic implementation in all other terms.

Something I'm slightly concerned with is that ISD::FSHL/FSHR legality is controlled by the Subtarget.isSHLDSlow() feature flag - we don't normally use non-ISA features for this but it allows the DAG combines to continue to operate after legalization in a lot more cases.

The X86 *bits.ll changes are all affected by the same issue - we now have a "FSHR(-1,-1,amt) -> ROTR(-1,amt) -> (-1)" simplification that reduces the dependencies enough for the branch fall through code to mess up.

Differential Revision: https://reviews.llvm.org/D75748
2020-03-11 11:17:49 +00:00
Victor Campos 8a12553223 [ARM] Improve codegen of volatile load/store of i64
Summary:
Instead of generating two i32 instructions for each load or store of a volatile
i64 value (two LDRs or STRs), now emit LDRD/STRD.

These improvements cover architectures implementing ARMv5TE or Thumb-2.

The code generation explicitly deviates from using the register-offset
variant of LDRD/STRD. In this variant, the register allocated to the
register-offset cannot be reused in any of the remaining operands. Such
restriction seems to be non-trivial to implement in LLVM, thus it is
left as a to-do.

Reviewers: dmgreen, efriedma, john.brawn, nickdesaulniers

Reviewed By: efriedma, nickdesaulniers

Subscribers: danielkiss, alanphipps, hans, nathanchance, nickdesaulniers, vvereschaka, kristof.beyls, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D70072
2020-03-11 10:19:27 +00:00
QingShan Zhang 9304decdee [NFC][Test] Add a PowerPC test to verify the behavior of a*b +/- c*d 2020-03-11 09:35:40 +00:00
Sebastian Neubauer 2f857eadf5 [AMDGPU] Use script to generate atomic optimizations test
This is a preparation for introducing a llvm.amdgcn.ballot intrinsic in
D65088.
2020-03-11 09:59:36 +01:00
QingShan Zhang 5edf900da0 [NFC][Test] Format the test PowerPC/recipest.ll with update_llc_test_checks.py 2020-03-11 08:49:53 +00:00
Matt Arsenault c0ad75e758 GlobalISel: Don't try to narrow extending loads/trunc store
If the loaded memory size was smaller than the result size, this would
produce out of bounds memory accesses. I'm wondering if we need a
distinct narrow memory legalize action type, since a case I care about
is decomposing a 4-byte unaligned access into 4 extending loads, which
would leave the original result register type. I'm currently awkwardly
using narrowScalar to handle unaligned accesses that need to be split.
2020-03-10 23:34:10 -04:00
Matt Arsenault 206d46a192 AMDGPU/GlobalISel: Add some tests that used to infinite loop 2020-03-10 22:12:56 -04:00
Carl Ritson d07f9e7309 [AMDGPU] Allow struct.buffer.*.format intrinsics to accept i32
Summary:
In the same manner as struct.buffer.load / struct.buffer.store,
allow struct.buffer.load.format / struct.buffer.store.format to
return / accept any type.  This simplifies front-end code gen.

Reviewers: tpr, arsenm, nhaehnle

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75789
2020-03-11 08:20:32 +09:00
Matt Arsenault edd0dfca0d AMDGPU/GlobalISel: Refine G_TRUNC legality rules
Scalarize most truncates. Avoid touching cases that could end up in
unresolvable infinite loops.
2020-03-10 15:32:22 -07:00
Matt Arsenault ce8a1f7294 GlobalISel: Implement fewerElementsVector for G_TRUNC
Extend fewerElementsVectorBasic to handle operands with different
element types.
2020-03-10 15:17:20 -07:00
Matt Arsenault 200b20639a AMDGPU: Use V_MAC_F32 for fmad.ftz
This avoids regressions in a future patch. I'm confused by the use of
the gfx9 usage legacy_mad. Was this a pointless instruction rename, or
uses fmul_legacy handling? Why is regular mac avilable in that case?
2020-03-10 14:41:06 -07:00
Matt Arsenault c4de8935a5 ARM: Fixup some tests using denormal-fp-math attribute
Don't use the deprecated, single mode form in tests. Also make sure to
parse the attribute, in case of the deprecated form.
2020-03-10 14:02:06 -04:00
Kazushi (Jam) Marukawa 3dabad1af3 [VE] Target-specific bit size for sjljehprepare
Summary:
This patch extends the TargetMachine to let targets specify the integer size
used by the sjljehprepare pass. This is 64bit for the VE target and otherwise
defaults to 32bit for all targets, which was hard-wired before.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D71337
2020-03-10 17:51:16 +01:00
Simon Pilgrim c8ede5e485 [X86][SSE] getFauxShuffleMask - add support for INSERT_VECTOR_ELT(EXTRACT_VECTOR_ELT) shuffle pattern
We already do this for PINSRB/PINSRW and SCALAR_TO_VECTOR.
2020-03-10 15:42:37 +00:00
Simon Pilgrim e6a7e3b5e3 [X86][SSE] matchShuffleWithSHUFPD - add support for unary shuffles.
This causes one minor test change but is mainly necessary for an upcoming patch.
2020-03-10 15:42:36 +00:00
Simon Pilgrim 417fe39be5 [X86][SSE] Add some extract+insert shuffle tests
Shows failure to avoid xmm<->gpr transfers by using insertps/blendps
2020-03-10 15:42:36 +00:00
Matt Arsenault 67cfbec746 AMDGPU/GlobalISel: Insert readfirstlane on SGPR returns
In case the source value ends up in a VGPR, insert a readfirstlane to
avoid producing an illegal copy later. If it turns out to be
unnecessary, it can be folded out.
2020-03-10 11:18:48 -04:00
Jonas Paulsson 62ff9960d3 [SystemZ] Improve foldMemoryOperandImpl().
Swap the compare operands if LHS is spilled while updating the CCMask:s of
the CC users. This is relatively straight forward since the live-in lists for
the CC register can be assumed to be correct during register allocation
(thanks to 659efa2).

Also fold a spilled operand of an LOCR/SELR into an LOC(G).

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D67437
2020-03-10 15:54:47 +01:00
Simon Pilgrim e71fb46a8f [TargetLowering] SimplifyDemandedVectorElts - add DemandedElts mask to ISD::BITCAST SimplifyDemandedBits call.
This fixes most of the regressions introduced in the rG4bc6f6332028 bugfix. The vector-trunc.ll issue should be fixed by D66004.
2020-03-10 13:39:10 +00:00
Simon Pilgrim b9b96adcf5 [X86][SSE] Add SSE41 coverage for fmaxnum/fminnum tests 2020-03-10 11:18:27 +00:00
alex-t 39e1a90784 [AMDGPU] SI_INDIRECT_DST_V* pseudos expansion should place EXEC restore to separate basic block
Summary:
When SI_INDIRECT_DST_V* pseudos has indexes in VGPR, they get expanded into the self-looped basic block that modifies EXEC in a loop.

To keep EXEC consistent it is stored before and then re-stored after the pseudo expansion result.

%95:vreg_512 = SI_INDIRECT_DST_V16 %93:vreg_512(tied-def 0), %94:sreg_32, 0, killed %1500:vgpr_32

results to

    s_mov_b64 s[6:7], exec
BB0_16:
    v_readfirstlane_b32 s8, v28
    v_cmp_eq_u32_e32 vcc, s8, v28
    s_and_saveexec_b64 vcc, vcc
    s_set_gpr_idx_on s8, gpr_idx(DST)
    v_mov_b32_e32 v6, v25
    s_set_gpr_idx_off
    s_xor_b64 exec, exec, vcc
    s_cbranch_execnz BB0_16
; %bb.17:
    s_mov_b64 exec, s[6:7]

The bug appeared in case this expansion occurs in the ELSE block of the CF.

Originally

  %110:vreg_512 = SI_INDIRECT_DST_V16 %103:vreg_512(tied-def 0), %85:vgpr_32, 0, %107:vgpr_32,
   %112:sreg_64 = SI_ELSE %108:sreg_64, %bb.19, 0, implicit-def dead $exec, implicit-def dead $scc, implicit $exec

expanded to

         ******************   <== here exec has "THEN" context

    s_mov_b64 s[6:7], exec
BB0_16:
    v_readfirstlane_b32 s8, v28
    v_cmp_eq_u32_e32 vcc, s8, v28
    s_and_saveexec_b64 vcc, vcc
    s_set_gpr_idx_on s8, gpr_idx(DST)
    v_mov_b32_e32 v6, v25
    s_set_gpr_idx_off
    s_xor_b64 exec, exec, vcc
    s_cbranch_execnz BB0_16
; %bb.17:
    s_or_saveexec_b64 s[4:5], s[4:5]   <-- exec mask is restored for "ELSE" but immediately overwritten.
    s_mov_b64 exec, s[6:7]

The rest of the "ELSE" block is executed not by the workitems which constitute the "else mask" but by those which constitute "then mask"

SILowerControlFlow::emitElse always considers the basic block begin() as an insertion point for s_or_saveexec.

Proposed fix:  The SI_INDIRECT_DST_V* procedure should split the reminder block to create landing pad for the EXEC restoration.

Reviewers: rampitec, vpykhtin, nhaehnle

Reviewed By: vpykhtin

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75472
2020-03-10 14:04:22 +03:00
Kerry McLaughlin 0bba37a320 [AArch64][SVE] Add SVE intrinsics for address calculations
Summary: Adds the @llvm.aarch64.sve.adr[b|h|w|d] intrinsics

Reviewers: sdesmalen, andwar, efriedma, dancgr, cameron.mcinally, rengolin

Reviewed By: sdesmalen

Subscribers: tschuett, kristof.beyls, hiraditya, rkruppe, psnobl, danielkiss, cfe-commits, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75858
2020-03-10 10:53:37 +00:00
James Greenhalgh f0de8d0940 [Arm] Do not lower vmax/vmin to Neon instructions
On some Arm cores there is a performance penalty when forwarding from an
S register to a D register.  Calculating VMAX in a D register creates
false forwarding hazards, so don't do that unless we're on a core which
specifically asks for it.

Patch by James Greenhalgh

Differential Revision: https://reviews.llvm.org/D75248
2020-03-10 10:48:48 +00:00
Simon Pilgrim 18c19441d1 [X86][AVX] combineX86ShuffleChain - combine binary shuffles to X86ISD::VPERM2X128
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.

At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
2020-03-10 10:44:28 +00:00
Clement Courbet 30477197b3 [ExpandMemCmp][NFC] Add more tests. 2020-03-10 11:34:19 +01:00
Sam Parker ff9ac33e1e [ARM][MVE] Validate tail predication values
Iterate through the loop and check that the observable values
produced are the same whether tail predication happens or not.

We want to find out if the tail-predicated version of this loop will
produce the same values as the loop in its original form. For this to
be true, the newly inserted implicit predication must not change the
the (observable) results.

We're doing this because many instructions in the loop will not be
predicated and so the conversion from VPT predication to tail
predication can result in different values being produced, because of
falsely predicated lanes not being updated in the converted form.

A masked load, whether through VPT or tail predication, will write
zeros to any of the falsely predicated bytes. So, from the loads, we
know that the false lanes are zeroed and here we're trying to track
that those false lanes remain zero, or where they change, the
differences are masked away by their user(s).

All MVE loads and stores have to be predicated, so we know that any
load operands, or stored results are equivalent already. Other
explicitly predicated instructions will perform the same operation in
the original loop and the tail-predicated form too. Because of this,
we can insert loads, stores and other predicated instructions into
our KnownFalseZeros set and build from there.

Differential Revision: https://reviews.llvm.org/D75452
2020-03-10 09:59:01 +00:00
Djordje Todorovic 5aa5c943f7 Reland "[DebugInfo] Enable the debug entry values feature by default"
Differential Revision: https://reviews.llvm.org/D73534
2020-03-10 09:15:06 +01:00
Puyan Lotfi 4b8af31f63 [llvm][MIRVRegNamer] Avoid collisions across constant pool indices.
When hashing on MachineOperand::MO_ConstantPoolIndex, now MIR-Canon and
MIRVRegNamer will no longer result in a hash collision.

Differential Revision: https://reviews.llvm.org/D74449
2020-03-10 01:13:20 -04:00
Matt Arsenault 627bb31a28 AMDGPU/GlobalISel: Avoid illegal vector exts for add/sub/mul
When expanding scalar packed operations, we should not introduce
illegal vector casts LegalizerHelper introduces. We're not in a
legalizer context, and there's no RegBankSelect apply or legalize
worklist.
2020-03-09 23:42:17 -04:00
Matt Arsenault ed72bcae34 AMDGPU/GlobalISel: Fix mishandling SGPR v2s16 add/sub/mul
We weren't considering the packed case correctly, and this was passing
through to the selector. The selector only checked the size, so this
would incorrectly compile to a single 32-bit scalar add.

As usual, the LegalizerHelper is somewhat awkward to use from
applyMappingImpl. I think this is the first place we've needed
multi-step legalization here though.
2020-03-09 22:51:54 -04:00
Matt Arsenault eb41627799 AMDGPU/GlobalISel: Improve handling of illegal return types
Most importantly, this fixes ret i8. Also make sure to handle
signext/zeroext for odd types > i32. Some of the corresponding
argument passing fixes also need to be handled.
2020-03-09 13:11:30 -07:00
Matt Arsenault 156a1b59df AMDGPU: Make signext/zeroext behave more sensibly over > i32
Interpret these as extending to the next multiple of 32-bits. This had
no effect with i48 for example, which is really split into {i32, i16},
which should extend the high part.
2020-03-09 12:56:10 -07:00
Matt Arsenault 209094eeb6 AMDGPU/GlobalISel: Start matching s_lshlN_add_u32 instructions
Use a hack to only enable this for GlobalISel.

Technically this also works with SelectionDAG, but the divergence
selection isn't reliable enough and a few cases fail, but I have no
desire to spend time writing the manual expansion code for it. The DAG
actually does a better job since it catches using v_add_lshl_u32 in
the mixed SGPR/VGPR cases.
2020-03-09 12:36:51 -07:00
Cameron McInally 2ab8065df6 [AArch64][SVE] Add missing fp16 DestructiveInstType tests
These tests should have been added with a5b22b768f in D73711.

Differential Revision: https://reviews.llvm.org/D75767
2020-03-09 14:09:23 -05:00
Simon Pilgrim 4b130b883d [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - reduce vector width of X86ISD::BLENDI
If we don't need the upper subvector elements of the BLENDI node then use a smaller vector size.

This causes a couple of minor regressions in insertelement-ones.ll which are more examples of PR26018; given how cheap allones generation is I don't consider that a showstopper, just an annoyance (and there's plenty of other poor codegen cases in that file).
2020-03-09 18:29:28 +00:00
Craig Topper 3dcc0db15e [X86] Teach combineToExtendBoolVectorInReg to create opportunities for using broadcast load instructions.
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.

Previously we any_extended in the scalar domain before inserting.

This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.

This recovers regressions from 07d68c24aa
and 9fcd212e2f without relying on
alignment of the load.

Differential Revision: https://reviews.llvm.org/D75835
2020-03-09 11:26:12 -07:00
Krzysztof Parzyszek 44205891ed [Hexagon] Fix match pattern in a testcase 2020-03-09 09:09:58 -05:00
Djordje Todorovic c15c68abdc [CallSiteInfo] Enable the call site info only for -g + optimizations
Emit call site info only in the case of '-g' + 'O>0' level.

Differential Revision: https://reviews.llvm.org/D75175
2020-03-09 12:12:44 +01:00
KAWASHIMA Takahiro c8cd1a994d [AArch64] Add support for Fujitsu A64FX
A64FX is an Armv8.2-A CPU used in FUJITSU Supercomputer
PRIMEHPC FX1000, PRIMEHPC FX700, and supercomputer Fugaku.

https://www.fujitsu.com/global/products/computing/servers/supercomputer/specifications/

Differential Revision: https://reviews.llvm.org/D75594
2020-03-09 19:15:09 +09:00
Clement Courbet 6518b72f93 [ExpandMemCmp] Properly constant-fold all compares.
Summary:
This gets rid of duplicated code and diverging behaviour w.r.t.
constants.
Fixes PR45086.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75519
2020-03-09 10:40:52 +01:00
Clement Courbet f7e6f5f8e3 [ExpandMemCmp] Properly constant-fold all compares.
Summary:
This gets rid of duplicated code and diverging behaviour w.r.t.
constants.
Fixes PR45086.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75519
2020-03-09 09:10:34 +01:00
Craig Topper 07d68c24aa [X86] Remove isel patterns that matched vXi16 X86VBroadcast with i8->i16 aextload input.
This was selecting VBROADCASTW which turned the 8-bit load into
a 16-bit load if it happened to be 2 byte aligned.

I have a plan to fix the regression with a follow up patch
which I'll post shortly.
2020-03-08 19:16:24 -07:00