Commit Graph

3464 Commits

Author SHA1 Message Date
Sebastian Neubauer 6e29846b29 [AMDGPU] Fix whole wavefront mode
We cannot move wwm over exec copies because the exec register needs an exact exec mask.

Differential Revision: https://reviews.llvm.org/D76232
2020-03-17 17:23:23 +01:00
Matt Arsenault 039c917b43 AMDGPU/GlobalISel: Fix asserting on gather4 intrinsics 2020-03-17 11:07:30 -04:00
alex-t 48a9cf9043 [AMDGPU] Enable SEXT divergence driven selection.
Summary: This change enable the divergence driven selection for the SEXT DAG opcode.

Reviewers: vpykhtin, rampitec

Reviewed By: vpykhtin

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Differential Revision: https://reviews.llvm.org/D76230
2020-03-17 17:30:11 +03:00
Matt Arsenault d0fe13ecf9 AMDGPU/GlobalISel: Fully handle 0 dmask case during legalize
For normal loads, fully eliminate the load. For the TFE case, adjust
the dmask value in the instruction so the selector doesn't need to
handle it. For the TFE special case, I guess it would be possible to
replace the loaded data register with undef, but as-is this will start
treating it as a well defined value.
2020-03-17 10:15:30 -04:00
Matt Arsenault d9a012ed8a AMDGPU/GlobalISel: Adjust image load register type based on dmask
Trim elements that won't be written. The equivalent still needs to be
done for writes. Also start widening 3 elements to 4
elements. Selection will get the count from the dmask.
2020-03-17 10:09:18 -04:00
Matt Arsenault 83ffbf2618 AMDGPU/GlobalISel: Legalize non-a16 non-NSA images 2020-03-17 10:02:09 -04:00
Matt Arsenault 2aba9b6cf8 AMDGPU/GlobalISel: Legalize a16 images
Pack the address registers in the legalizer. Avoid introducing a huge
family of new intermediate operations by filling dead operands with
noreg.
2020-03-17 10:02:09 -04:00
Matt Arsenault b0bdb186f5 Utils: Always set alignment when expanding mem intrinsics
This was creating natural aligned loads and stores, which may not be
the case. The target could request a wider type load with less
alignment.
2020-03-16 14:34:29 -04:00
Matt Arsenault 80b627d69d AMDGPU/GlobalISel: Fix handling of G_ANYEXT with s1 source
We were letting G_ANYEXT with a vcc register bank through, which was
incorrect and would select to an invalid copy. Fix this up like G_ZEXT
and G_SEXT. Also drop old code to fixup the non-boolean case in
RegBankSelect. We now have to perform that expansion during selection,
so there's no benefit to doing it during RegBankSelect.
2020-03-16 12:59:54 -04:00
Matt Arsenault c460dc6eeb AMDGPU/GlobalISel: Fix some illegal scalar argument types
Fixes integers that don't evenly divide to i32 pieces. We should
probably extract some of the code in the legalizer to start handling
argument breakdowns. I'm dissatisfied with the argument lowering's
handling of vectors for example, and we should not be producing the
weird G_EXTRACTs we do now.
2020-03-16 12:51:23 -04:00
Matt Arsenault 84386b2d8a AMDGPU: Drop special case f64 fround lowering
The result is better if ftrunc is emitted and separately legalized
when unavailable.
2020-03-16 12:09:30 -04:00
Matt Arsenault 19a0350187 GlobalISel: Fix round lowering
I used the implementation for floor instead of round. It also turns
out the OpenCL builtin library wasn't using the round builtin, but
implemented the expanded form.
2020-03-16 11:37:30 -04:00
Dominik Montada 8ff2dcb18b [GlobalISel] add additional lowering support for G_INSERT
Summary: Add lowering support for inserting pointers or scalars into scalars, vectors or pointers

Reviewers: arsenm, dsanders

Reviewed By: arsenm

Subscribers: jvesely, wdng, nhaehnle, rovka, hiraditya, volkan, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75994
2020-03-16 16:27:17 +01:00
Guillaume Chatelet 4efec6e1c0 Revert "Disable memcpy-inline-fails.ll for windows"
This reverts commit adc2e250a1.
2020-03-16 16:03:39 +01:00
Matt Arsenault 57d896e838 AMDGPU/GlobalISel: Make some large merges legal
We allow up to 1024-bit registers, so we should support merges all the
way to the maximum.
2020-03-16 10:49:10 -04:00
Fangrui Song 536ba6373f [Object] Change ELFObjectFile<ELFT>::getFileFormatName() to use BFD names
Follow-up for D74433

What the function returns are almost standard BFD names, except that "ELF" is
in uppercase instead of lowercase.

This patch changes "ELF" to "elf" and changes ARM/AArch64 to use their BFD names.
MIPS and PPC64 have endianness differences as well, but this patch does not intend to address them.

Advantages:

* llvm-objdump: the "file format " line matches GNU objdump on ARM/AArch64 objects
* "file format " line can be extracted and fed into llvm-objcopy -O literally.
  (https://github.com/ClangBuiltLinux/linux/issues/779 has such a use case)

Affected tools: llvm-readobj, llvm-objdump, llvm-dwarfdump, MCJIT (internal implementation detail, not exposed)

Reviewed By: jhenderson

Differential Revision: https://reviews.llvm.org/D76046
2020-03-16 07:42:04 -07:00
Simon Pilgrim 2b3b453a82 [TargetLowering] Only demand a funnelshift's modulo amount bits
ISD::FSHL/FSHR shift amount values are guaranteed to act as a modulo amount, so for power-of-2 bitwidths we only need the lowest bits.
2020-03-16 13:52:17 +00:00
Dominik Montada c0241f150d [GlobalISel] combine G_TRUNC with G_MERGE_VALUES
Summary:
Truncating the result of a merge means that most likely we could have done without merge in the first place and just used the input merge inputs directly. This can be done in three cases:

1. If the truncation result is smaller than the merge source, we can use the source in the trunc directly
2. If the sizes are the same, we can replace the register or use a copy
3. If the truncation size is a multiple of the merge source size, we can build a smaller merge

This gets rid of most of the larger, hard-to-legalize merges.

Reviewers: qcolombet, aditya_nandakumar, aemerson, paquette, arsenm, Petar.Avramovic

Reviewed By: arsenm

Subscribers: sdardis, jvesely, wdng, nhaehnle, rovka, jrtc27, atanasyan, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75915
2020-03-16 14:42:01 +01:00
Guillaume Chatelet adc2e250a1 Disable memcpy-inline-fails.ll for windows 2020-03-16 14:10:08 +01:00
Fangrui Song ecd6d7254e [test] llvm/test/: change llvm-objdump single-dash long options to double-dash options
As announced here: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131786.html

Grouped option syntax (POSIX Utility Conventions) does not play well with -long-option
A subsequent change will reject -long-option.
2020-03-15 17:46:23 -07:00
Matt Arsenault ce33926342 AMDGPU/GlobalISel: Remove -global-isel-abort=0 from some tests 2020-03-15 17:22:34 -04:00
Matt Arsenault fe6037172b AMDGPU/GlobalISel: Add more tests for G_SADDE/G_SSUBE
These don't work, but add baseline tests.
2020-03-15 16:54:40 -04:00
Matt Arsenault 79cda46e49 AMDGPU/GlobalISel: Add baseline test for mul 2020-03-15 14:53:51 -04:00
Matt Arsenault de5b2cfdd4 AMDGPU/GlobalISel: Add baseline test for mul 2020-03-15 14:50:36 -04:00
Simon Pilgrim 5641804298 [DAG] MatchRotate - Add funnel shift by variable support
Followup to D75114, this patch reuses the existing MatchRotate ROTL/ROTR rotation pattern code to also recognize the more general FSHL/FSHR funnel shift patterns when we have variable shift amounts, matched with MatchFunnelPosNeg which acts in an (almost) equivalent manner to MatchRotatePosNeg.
2020-03-15 11:50:45 +00:00
Stanislav Mekhanoshin c262b69dcc [AMDGPU] Fix endcf collapse
Only collapse inner endcf if the outer one belongs to SI_IF.
If it does belong to SI_ELSE then mask being restored in fact
a partial inverse of what we need.

Differential Revision: https://reviews.llvm.org/D76154
2020-03-13 13:50:21 -07:00
Matt Arsenault 015b640be4 AMDGPU: Add flag to used fixed function ABI
Pass all arguments to every function, rather than only passing the
minimum set of inputs needed for the call graph.
2020-03-13 13:27:05 -07:00
Matt Arsenault bb8622094d AMDGPU: Don't handle kernarg.segment.ptr in functions
Just lower this to null. Pass implicitarg.ptr in its place in the
argument list.
2020-03-13 12:51:12 -07:00
Stanislav Mekhanoshin 32e90cbcd1 [AMDGPU] Disable endcf collapse
There are some functional regressions and I suspect our
scopes are not as perfectly enclosed as I expected.
Disable it for now.

Differential Revision: https://reviews.llvm.org/D76148
2020-03-13 12:33:22 -07:00
Matt Arsenault ccc6e780c8 AMDGPU: Directly annotate functions if they have calls
Currently we infer whether the flat-scratch-init kernel input should
be enabled based on calls. Move this handling, so we can decide if the
full set of ABI inputs is needed in kernels. Ideally we would have an
analysis of some sort, rather than the function attributes.
2020-03-12 19:10:59 -04:00
Stanislav Mekhanoshin a73528649c [AMDGPU] Simplify exec copies
The patch removes late endcf handling and only leaves the
related portion with redundant exec mask copy elimination.

Differential Revision: https://reviews.llvm.org/D76095
2020-03-12 14:54:19 -07:00
Simon Pilgrim e91feeed21 [AMDGPU] Add ISD::FSHR -> ALIGNBIT support
This patch allows ISD::FSHR(i32) patterns to lower to ALIGNBIT instructions.

This improves test coverage of ISD::FSHR matching - x86 has both FSHL/FSHR instructions and we prefer FSHL by default.

Differential Revision: https://reviews.llvm.org/D76070
2020-03-12 20:16:57 +00:00
Stanislav Mekhanoshin 360aff0493 [AMDGPU] Simplify nested SI_END_CF
This is to replace the optimization from the SIOptimizeExecMaskingPreRA.
We have less opportunities in the control flow lowering because many
VGPR copies are still in place and will be removed later, but we know
for sure an instruction is SI_END_CF and not just an arbitrary S_OR_B64
with EXEC.

The subsequent change needs to convert s_and_saveexec into s_and and
address new TODO lines in tests, then code block guarded by the
-amdgpu-remove-redundant-endcf option in the pre-RA exec mask optimizer
will be removed.

Differential Revision: https://reviews.llvm.org/D76033
2020-03-12 11:25:07 -07:00
Simon Pilgrim fa8ce7c0fa [AMDGPU] Add some funnel shift intrinsic test coverage 2020-03-12 12:53:57 +00:00
Stanislav Mekhanoshin d4757a6cf1 [AMDGPU] pre-commit collapse-endcf.mir. NFC.
Pre commit test before D76033.
2020-03-11 16:15:29 -07:00
Matt Arsenault 1e0c540360 AMDGPU: Don't hard error on LDS globals in functions
Instead, emit a trap and a warning. We force inlining of this
situation, so any function where this happens should be dead as
indirect or external calls are not yet supported. This should avoid
erroring on dead code.
2020-03-11 15:34:11 -04:00
Stanislav Mekhanoshin 9801e5469b [AMDGPU] Disable nested endcf collapse
The assumption is that conditional regions are perfectly nested
and a mask restored at the exit from the inner block will be
completely covered by a mask restored in the outer.

It turns out with our current structurizer this is not always
the case.

Disable the optimization for now, but I want to keep it around
for a while to either try after further structurizer changes or
to move it into control flow lowering where we have more info
and reuse the test.

Differential Revision: https://reviews.llvm.org/D75958
2020-03-11 11:24:20 -07:00
Jay Foad a46dba24fa [AMDGPU] Extend macro fusion for ADDC and SUBB to SUBBREV
Summary:
There's a lot of test case churn but the overall effect is to increase
the number of back-to-back v_sub,v_subbrev pairs, which can execute with
no delay even on gfx10.

Reviewers: arsenm, rampitec, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75999
2020-03-11 17:59:21 +00:00
Matt Arsenault a2202f6a3f AMDGPU/GlobalISel: Manually RegBankSelect copies
This was failng on any pre-assigned copy to the VCC bank.

This is something of a workaround for the default implementation in
getInstrMappingImpl, and how it treats copy-like operations in
general.

Copy-like operations are considered to only have one result register
bank, rather than separate banks for each source like a normal
instruction. To avoid potentially mishandling reg_sequence with
impossible operand combinations, the generic implementation errors on
impossible costs. If the bank was already assigned, is treated it
as-if it were an unsatisfiable REG_SEQUENCE mapping. We really don't
get any value from any of what getInstrMappingImpl tries to do for
copies, so just directly emit the simple mapping we really want.
2020-03-11 11:12:12 -04:00
Sebastian Neubauer 2f857eadf5 [AMDGPU] Use script to generate atomic optimizations test
This is a preparation for introducing a llvm.amdgcn.ballot intrinsic in
D65088.
2020-03-11 09:59:36 +01:00
Matt Arsenault c0ad75e758 GlobalISel: Don't try to narrow extending loads/trunc store
If the loaded memory size was smaller than the result size, this would
produce out of bounds memory accesses. I'm wondering if we need a
distinct narrow memory legalize action type, since a case I care about
is decomposing a 4-byte unaligned access into 4 extending loads, which
would leave the original result register type. I'm currently awkwardly
using narrowScalar to handle unaligned accesses that need to be split.
2020-03-10 23:34:10 -04:00
Matt Arsenault 206d46a192 AMDGPU/GlobalISel: Add some tests that used to infinite loop 2020-03-10 22:12:56 -04:00
Carl Ritson d07f9e7309 [AMDGPU] Allow struct.buffer.*.format intrinsics to accept i32
Summary:
In the same manner as struct.buffer.load / struct.buffer.store,
allow struct.buffer.load.format / struct.buffer.store.format to
return / accept any type.  This simplifies front-end code gen.

Reviewers: tpr, arsenm, nhaehnle

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75789
2020-03-11 08:20:32 +09:00
Matt Arsenault edd0dfca0d AMDGPU/GlobalISel: Refine G_TRUNC legality rules
Scalarize most truncates. Avoid touching cases that could end up in
unresolvable infinite loops.
2020-03-10 15:32:22 -07:00
Matt Arsenault ce8a1f7294 GlobalISel: Implement fewerElementsVector for G_TRUNC
Extend fewerElementsVectorBasic to handle operands with different
element types.
2020-03-10 15:17:20 -07:00
Matt Arsenault 200b20639a AMDGPU: Use V_MAC_F32 for fmad.ftz
This avoids regressions in a future patch. I'm confused by the use of
the gfx9 usage legacy_mad. Was this a pointless instruction rename, or
uses fmul_legacy handling? Why is regular mac avilable in that case?
2020-03-10 14:41:06 -07:00
Matt Arsenault 67cfbec746 AMDGPU/GlobalISel: Insert readfirstlane on SGPR returns
In case the source value ends up in a VGPR, insert a readfirstlane to
avoid producing an illegal copy later. If it turns out to be
unnecessary, it can be folded out.
2020-03-10 11:18:48 -04:00
alex-t 39e1a90784 [AMDGPU] SI_INDIRECT_DST_V* pseudos expansion should place EXEC restore to separate basic block
Summary:
When SI_INDIRECT_DST_V* pseudos has indexes in VGPR, they get expanded into the self-looped basic block that modifies EXEC in a loop.

To keep EXEC consistent it is stored before and then re-stored after the pseudo expansion result.

%95:vreg_512 = SI_INDIRECT_DST_V16 %93:vreg_512(tied-def 0), %94:sreg_32, 0, killed %1500:vgpr_32

results to

    s_mov_b64 s[6:7], exec
BB0_16:
    v_readfirstlane_b32 s8, v28
    v_cmp_eq_u32_e32 vcc, s8, v28
    s_and_saveexec_b64 vcc, vcc
    s_set_gpr_idx_on s8, gpr_idx(DST)
    v_mov_b32_e32 v6, v25
    s_set_gpr_idx_off
    s_xor_b64 exec, exec, vcc
    s_cbranch_execnz BB0_16
; %bb.17:
    s_mov_b64 exec, s[6:7]

The bug appeared in case this expansion occurs in the ELSE block of the CF.

Originally

  %110:vreg_512 = SI_INDIRECT_DST_V16 %103:vreg_512(tied-def 0), %85:vgpr_32, 0, %107:vgpr_32,
   %112:sreg_64 = SI_ELSE %108:sreg_64, %bb.19, 0, implicit-def dead $exec, implicit-def dead $scc, implicit $exec

expanded to

         ******************   <== here exec has "THEN" context

    s_mov_b64 s[6:7], exec
BB0_16:
    v_readfirstlane_b32 s8, v28
    v_cmp_eq_u32_e32 vcc, s8, v28
    s_and_saveexec_b64 vcc, vcc
    s_set_gpr_idx_on s8, gpr_idx(DST)
    v_mov_b32_e32 v6, v25
    s_set_gpr_idx_off
    s_xor_b64 exec, exec, vcc
    s_cbranch_execnz BB0_16
; %bb.17:
    s_or_saveexec_b64 s[4:5], s[4:5]   <-- exec mask is restored for "ELSE" but immediately overwritten.
    s_mov_b64 exec, s[6:7]

The rest of the "ELSE" block is executed not by the workitems which constitute the "else mask" but by those which constitute "then mask"

SILowerControlFlow::emitElse always considers the basic block begin() as an insertion point for s_or_saveexec.

Proposed fix:  The SI_INDIRECT_DST_V* procedure should split the reminder block to create landing pad for the EXEC restoration.

Reviewers: rampitec, vpykhtin, nhaehnle

Reviewed By: vpykhtin

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75472
2020-03-10 14:04:22 +03:00
Matt Arsenault 627bb31a28 AMDGPU/GlobalISel: Avoid illegal vector exts for add/sub/mul
When expanding scalar packed operations, we should not introduce
illegal vector casts LegalizerHelper introduces. We're not in a
legalizer context, and there's no RegBankSelect apply or legalize
worklist.
2020-03-09 23:42:17 -04:00
Matt Arsenault ed72bcae34 AMDGPU/GlobalISel: Fix mishandling SGPR v2s16 add/sub/mul
We weren't considering the packed case correctly, and this was passing
through to the selector. The selector only checked the size, so this
would incorrectly compile to a single 32-bit scalar add.

As usual, the LegalizerHelper is somewhat awkward to use from
applyMappingImpl. I think this is the first place we've needed
multi-step legalization here though.
2020-03-09 22:51:54 -04:00
Matt Arsenault eb41627799 AMDGPU/GlobalISel: Improve handling of illegal return types
Most importantly, this fixes ret i8. Also make sure to handle
signext/zeroext for odd types > i32. Some of the corresponding
argument passing fixes also need to be handled.
2020-03-09 13:11:30 -07:00
Matt Arsenault 156a1b59df AMDGPU: Make signext/zeroext behave more sensibly over > i32
Interpret these as extending to the next multiple of 32-bits. This had
no effect with i48 for example, which is really split into {i32, i16},
which should extend the high part.
2020-03-09 12:56:10 -07:00
Matt Arsenault 209094eeb6 AMDGPU/GlobalISel: Start matching s_lshlN_add_u32 instructions
Use a hack to only enable this for GlobalISel.

Technically this also works with SelectionDAG, but the divergence
selection isn't reliable enough and a few cases fail, but I have no
desire to spend time writing the manual expansion code for it. The DAG
actually does a better job since it catches using v_add_lshl_u32 in
the mixed SGPR/VGPR cases.
2020-03-09 12:36:51 -07:00
Amara Emerson c1a97e992d Revert "Revert "[GlobalISel][Localizer] Enable intra-block localization of already-local uses.""
This reverts commit 5583c2f2fb.

The lldb bot failure was a test that was fragile and sensitive to irrelevant
changes in instruction ordering. Re-committing this as the test should have
been skipped for AArch64 now.

Differential Revision: https://reviews.llvm.org/D75555
2020-03-06 21:35:08 -08:00
Fangrui Song 71e2ca6e32 [llvm-objdump] -d: print `00000000 <foo>:` instead of `00000000 foo:`
The new behavior matches GNU objdump. A pair of angle brackets makes tests slightly easier.

`.foo:` is not unique and thus cannot be used in a `CHECK-LABEL:` directive.
Without `-LABEL`, the CHECK line can match the `Disassembly of section`
line and causes the next `CHECK-NEXT:` to fail.

```
Disassembly of section .foo:

0000000000001634 .foo:
```

Bdragon: <> has metalinguistic connotation. it just "feels right"

Reviewed By: rupprecht

Differential Revision: https://reviews.llvm.org/D75713
2020-03-05 18:05:28 -08:00
Jordan Rupprecht c140810ea1 [llvm-readobj] Include section name of notes.
This changes the output of `llvm-readelf -n` from:

```
Displaying notes found at file offset 0x<...> with length 0x<...>:
```

to:

```
Displaying notes found in: .note.foo
```

And similarly, adds a `Name:` field to the `llvm-readobj -n` output for notes.

This change not only increases GNU compatibility, it also makes it much easier to read notes. Note that we still fall back to printing the file offset/length in cases where we don't have a section name, such as when printing notes in program headers or printing notes in a partially stripped file (GNU readelf does the same).

Fixes llvm.org/PR41339.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D75647
2020-03-05 09:53:14 -08:00
Rodrigo Dominguez 4313543de1 AMDGPU: Add/Fix tests for image atomic intrinsic.
Summary:
Add tests for 64-bit image atomic swap and cmpswap.
Fix tests for 32-bit image atomic add.

Change-Id: Ibb7619749c1ad504b24aa1c5f3185417a3013f3c

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, jfb, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75295
2020-03-05 12:18:15 -05:00
David Stuttard a74b33f612 AMDGPU: Fix SMRD test in trivially disjoint mem access code
Summary:
This seems like an obvious error - cut and paste issue?
The change does make a change to one of the lit tests - it stops s_buffer_load
re-ordering past an MUBUF instruction (which is not surprising).

Change-Id: I80be99de5b62af4f42e91af2591b76a52ac9efa6

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75686
2020-03-05 17:14:01 +00:00
Sameer Sahasrabuddhe 42febbab91 StructurizeCFG: simplify phi nodes when possible
After structurization, some phi nodes can have a single incoming edge
and can be simplified away. This change runs a simplify query on all
phis that are either modified or added by the structurizer. This also
moves some phis closer to their use as a side benefit.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D75500
2020-03-05 10:33:15 +05:30
hsmahesha 3fda1fde8f AMDGPU/GlobalISel: Support llvm.trap and llvm.debugtrap intrinsics
Summary: Lower trap and debugtrap intrinsics to AMDGPU machine instruction(s).

Reviewers: arsenm, nhaehnle, kerbowa, cdevadas, t-tye, kzhuravl

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, yaxunl, rovka, dstuttard, tpr, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74688
2020-03-05 08:16:57 +05:30
Muhammad Omair Javaid 5583c2f2fb Revert "[GlobalISel][Localizer] Enable intra-block localization of already-local uses."
This reverts commit e91e1df6ab.
2020-03-05 03:12:28 +05:00
Matt Arsenault 9e1d2afc13 AMDGPU/GlobalISel: Don't use vector G_EXTRACT in arg lowering
Create a wider source vector, and unmerge with dead defs like the
legalizer. The legalization handling for G_EXTRACT is incomplete, and
it's preferrable to keep everything in 32-bit pieces.

We should probably start moving these functions into utils, since we
have a growing number of places that do almost the same thing.
2020-03-04 16:49:01 -05:00
Matt Arsenault f70e7dc17d AMDGPU/GlobalISel: Switch target in argument test
Since this is still largely relying on the DAG argument type lowering
code, this has inherited the problem where i16 vectors have a
different ABI on targets with and without legal i16. Switch to using a
target with legal i16, so the i16 vector argument tests are more
useful.
2020-03-04 16:40:06 -05:00
Matt Arsenault fb0c35fa34 GlobalISel: Set alignment on function argument stack load/store 2020-03-04 16:38:46 -05:00
Simon Pilgrim e2f0093800 [AMDGPU] performCvtF32UByteNCombine - revisit node after src operand simplification.
If SimplifyDemandedBits succeeds in simplifying the byte src, add the CVT_F32_UBYTE node back to the worklist as we might be able to simplify further.

Yet another step towards removing SelectionDAG::GetDemandedBits.
2020-03-04 11:25:50 +00:00
Amara Emerson e91e1df6ab [GlobalISel][Localizer] Enable intra-block localization of already-local uses.
This changes the localizer to attempt intra-block localizer of instructions
that have local uses. This is useful because sometimes the entry block itself
has many uses of constant-like instructions, which would benefit from shortening
live ranges. Previously if an inst had no non-local uses, we wouldn't add it to
the list of instructions to attempt further intra-block localization.

This gives a 0.7% geomean code size improvement on CTMark.

Differential Revision: https://reviews.llvm.org/D75555
2020-03-03 18:14:57 -08:00
Matt Arsenault 88aced1e45 AMDGPU: Fix computation for getOccupancyWithLocalMemSize
The computation here didn't really make sense to me, and reported
wildy different results depending on the flat work group size
attribute.

I think this should really report a range derived from the possible
work group size bounds, and only allow an occupancy that is a multiple
of the group size.
2020-03-03 17:15:57 -05:00
Sameer Sahasrabuddhe 534d8866a1 [AMDGPU] add generated checks for some LIT tests
This is in prepration for further changes that affect these tests.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D75403
2020-03-03 11:47:05 +05:30
Stanislav Mekhanoshin 1bacdcf48d Extend LaneBitmask to 64 bit
This is needed for D74873, AMDGPU going to have 16 bit subregs
and the largest tuple is 32 VGPRs, which results in 64 lanes.

Differential Revision: https://reviews.llvm.org/D75378
2020-03-02 12:10:52 -08:00
Jay Foad 43830790d7 [AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.

I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.

To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.

Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.

Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB

Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68338
2020-02-28 21:35:34 +00:00
Jay Foad 4917a9a965 [AMDGPU] Precommit some scheduler related test updates
Summary:
The point of this is to make some tests with manual checks robust
against scheduler tweaks, so that only autogenerated test updates will
be required when pushing D68338 "[AMDGPU] Remove dubious logic in
bidirectional list scheduler".

Reviewers: arsenm, rampitec, vpykhtin

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75302
2020-02-28 11:20:58 +00:00
Stanislav Mekhanoshin 6b813f2762 [AMDGPU] Enable runtime unroll for LDS
We want to do unroll for LDS even for runtime trip count
to combine LDS operations.

Differential Revision: https://reviews.llvm.org/D75293
2020-02-27 12:59:35 -08:00
Sameer Sahasrabuddhe 0c8a218798 [AMDGPU] improve fragile test for divergent branches
Summary:
The affected LIT test intends to test the correct use of divergence
analysis to detect a divergent branch with a uniform predicate. The
passes involved are LLVM IR passes, but the test runs llc and tries to
match against generated ISA, which makes it hard to demonstrate that
the intended behavior was really tested. Replaced this with a test
that invokes opt on the required passes and then checks for the
appropriate changes in the LLVM IR.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D75267
2020-02-27 23:31:03 +05:30
Matt Arsenault 79493e721a AMDGPU/GlobalISel: Add missing test for G_UMULH 2020-02-26 22:30:13 -05:00
Matt Arsenault 6fc0d00823 GlobalISel: Fix lowering for G_UADDE/G_USUBE
The type parameter passed into lower is invalid and should be removed
from the function.
2020-02-26 19:10:52 -08:00
Matt Arsenault 6dcf43102c AMDGPU/GlobalISel: Add missing G_[US]ADDE/G_[US]SUBE tests
The s64 case currently crashes, so leave that for later.
2020-02-26 19:10:34 -08:00
Jay Foad 09a6b26753 AMDGPU: Fix some more incorrect check lines 2020-02-26 14:37:22 +00:00
Nicolai Hähnle 0f1df48925 AMDGPU/SIInsertSkips: Fix the determination of whether early-exit-after-kill is possible
Summary:
The old code made some incorrect assumptions about the order in which
basic blocks are laid out in a function. This could lead to incorrect
early-exits, especially when kills occurred inside of loops.

The new approach is to check whether the point where the conditional
kill occurs dominates all reachable code. If that is the case, there
cannot be any other threads in the wave that are waiting to rejoin
at a later point in the CFG, i.e. if exec=0 at that point, then all
threads really are dead and we can exit the wave.

Make some other minor cleanups to the pass while we're at it.

v2: preserve the dominator tree

Reviewers: arsenm, cdevadas, foad, critson

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74908

Change-Id: Ia0d2b113ac944ad642d1c622b6da1b20aa1aabcc
2020-02-26 15:30:42 +01:00
Jay Foad 80d7e473e0 AMDGPU: Fix some incorrect FUNC-LABEL checks 2020-02-26 09:43:13 +00:00
Jay Foad c66db21165 AMDGPU/GlobalISel: Un-XFAIL a test
This was missed in 12fe9b26ec
2020-02-25 16:46:46 +00:00
Matt Arsenault 86e13ec194 AMDGPU/GlobalISel: Use packed for G_ADD/G_SUB/G_MUL v2s16 2020-02-25 11:20:35 -05:00
Jay Foad 33cbd5ee08 AMDGPU/GlobalISel: Legalize s64 min/max by lowering
Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75108
2020-02-25 16:00:43 +00:00
Jay Foad ab96ec41ea [AMDGPU] Precommit some test updates for D68338 "Remove dubious logic in bidirectional list scheduler" 2020-02-25 14:51:42 +00:00
Jay Foad dc78190811 AMDGPU/GlobalISel: add legalize tests for s64 max/min 2020-02-25 09:49:19 +00:00
Matt Arsenault fee41517fe AMDGPU/GlobalISel: Introduce post-legalize combiner
The current set of custom combines are only really useful after
legalization, so move them there. There is a lot of overlap in the
boilerplate here, but I think we do want a pretty different set of
combines before and after legalize. I think we will want a lot of
overlap between the post-legalize and a post-regbankselect combiner.
2020-02-24 22:12:12 -05:00
Matt Arsenault 0b46b078b6 AMDGPU/GlobalISel: Fix incorrect VOP3P fneg folding
We use some s32 values in VOP3P operands, and won't see any
intervening casts from a 32-bit fneg. Make sure it's really a packed
fneg before folding.
2020-02-24 21:20:35 -05:00
Matt Arsenault 11e3dde625 GlobalISel: Reimplement fewerElementsVectorBasic
Changes the handling of odd breakdowns, and avoids using
G_EXTRACT/G_INSERT. Pad with undef to a wider size, and unmerge. Also
avoid introducing instructions for the fully undef components.
2020-02-24 21:19:47 -05:00
Jay Foad 0ed4744bb5 AMDGPU/GlobalISel: Lower 64-bit uaddo/usubo
Summary: Add more test cases for signed and unsigned add/sub with overflow.

Reviewers: arsenm, rampitec, kerbowa

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75051
2020-02-24 23:08:14 +00:00
Mark Searles d3e170c438 Revert "[AMDGPU] Don’t marke the .note section as ALLOC"
This reverts commit 977cd661cf.

It breaks OpenCL testing. OpenCL Runtime is using PT_LOAD information
to calculate memory for global variables. This commit should be relanded once
the OpenCL runtime stops relying on PT_LOAD information for calculating global
variable memory size.

Differential Revision: https://reviews.llvm.org/D74995
2020-02-21 16:08:30 -08:00
Jay Foad b72f1448ce AMDGPU/GlobalISel: Better code for one case of G_SHUFFLE_VECTOR on v2i16
Reviewers: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74987
2020-02-21 21:16:39 +00:00
Matt Arsenault 00955a62e4 AMDGPU/GlobalISel: Fix SALU mapping for v2s16 min/max
The legalizer helper functions are unusably awkward to perform the 3-5
part legalization. This needs to be widened, scalarized, lowered, and
we should avoid creating vector extends and truncates. Manually do all
of this and expand.
2020-02-21 14:02:16 -05:00
Matt Arsenault db06870dbd AMDGPU: Move dot intrinsic patterns to instruction def
I tried to use some of the new tablegen features to avoid creating
different operand list permutations, but I still don't see a way to
programmatically build a source pattern dag.

Also add GlobalISel tests, which now all import successfully.

Some of the fneg fold tests are incorrect, which need to be fixed in a
future commit
2020-02-21 13:35:40 -05:00
Matt Arsenault 4c1c9422a3 AMDGPU/GlobalISel: Select llvm.amdgcn.fdot2
I'm slighly worried about the generated checks, since they won't catch
incorrect modifiers being added at the end of the line.
2020-02-21 13:35:40 -05:00
Matt Arsenault dfce5fd50a AMDGPU/GlobalISel: Select VOP3P instructions
This only handles the basic cases. More work is needed to make better
use of op_sel.
2020-02-21 13:35:40 -05:00
Matt Arsenault 72eef820d5 AMDGPU/GlobalISel: Select G_SHUFFLE_VECTOR
G_SHUFFLE_VECTOR is legal since it theoretically may help match op_sel
for VOP3P instructions. Expand it in some other way in case it doesn't
fold into the use instructions.
2020-02-21 13:35:40 -05:00
Jay Foad cab39e4b8c GlobalISel: Fix narrowing of (G_ASHR i64:x, 32)
Reviewers: arsenm

Subscribers: jvesely, wdng, nhaehnle, rovka, hiraditya, volkan, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74950
2020-02-21 16:51:03 +00:00
Matt Arsenault 6a479220b5 AMDGPU/GlobalISel: Commit test changes I forgot to squash
These should have been in ac7abe0ba9
2020-02-21 11:43:39 -05:00
Matt Arsenault 043ed2e22a AMDGPU/GlobalISel: Fix xnor matching
We should try the generated matchers before the manual selection. This
means the patterns are now handling the common cases, but the manual
selection code is not yet dead. It's still handling the non-s32/s64
cases (like v2s16 and v2s32). Currently tablegen doesn't have a nice
way to have a single pattern that covers multiple types.
2020-02-21 11:42:49 -05:00
Matt Arsenault 89dc8fe622 AMDGPU/GlobalISel: Precommit xnor matching test 2020-02-21 11:09:59 -05:00
Matt Arsenault ac7abe0ba9 AMDGPU/GlobalISel: Manually select G_BUILD_VECTOR_TRUNC
We have patterns for s_pack* selection, but they assume the inputs are
a build_vector with 16-bit inputs, not a truncating build
vector. Since there's still outstanding work for how to handle
mismatched result and source element vector operations, and since I'm
trying a different packed vector strategy than SelectionDAG, just
manually select this for now.
2020-02-21 10:34:11 -05:00
Matt Arsenault 79ff188add AMDGPU/GlobalISel: Legalize G_FPOW
There are few differences from the DAG handling. First, the DAG
handling uses a primitive selection pattern instead of custom
legalizing it. Because of this, this makes use of source modifiers
while the DAG does not.

Also instead of promoting f16, try to use the f16 log/exp. There's no
f16 fmul_legacy, so widen just for the multiply, although I'm not sure
that's the best solution.
2020-02-21 10:31:13 -05:00
Matt Arsenault fab4cdea39 AMDGPU/GlobalISel: Select llvm.amdgcn.fmul.legacy 2020-02-21 10:30:26 -05:00
Matt Arsenault b64aa8c715 AMDGPU/GlobalISel: Fix constant bus violation with source modifiers
This looked through copies to find the source modifiers, which may
have been SGPR->VGPR copies added to avoid potential constant bus
violations. Re-insert a copy to a VGPR if this happens.
2020-02-21 10:30:23 -05:00
Nicolai Hähnle 32e4e71966 test/CodeGen/AMDGPU: Add a test case that shows a miscompilation
Related to https://reviews.llvm.org/D74908

Change-Id: I6ebf3b5c7a32493016994f30d6796c41e95aecde
2020-02-21 13:38:24 +01:00
Simon Pilgrim fc2b4a02b1 [DAGCombine] visitEXTRACT_VECTOR_ELT - add SimplifyDemandedBits multi use support
Similar to what we already do with SimplifyDemandedVectorElts, call SimplifyDemandedBits across all the extracted elements of the source vector, treating it as single use.

There's a minor regression in store-weird-sizes.ll which will be addressed in an upcoming SimplifyDemandedBits patch.
2020-02-20 15:49:38 +00:00
Matt Arsenault 083717cf49 AMDGPU: Fix v2i64<->v4f32 bitcast
I'm not sure how to test the v2i64->v4f32 case since I can't think of
any v2i64 cases that won't legalize to v4i32.
2020-02-20 09:49:09 -05:00
Sebastian Neubauer 977cd661cf [AMDGPU] Don’t marke the .note section as ALLOC
Marking a section as ALLOC tells the ELF loader to load the section into memory.
As we do not want to load the notes into VRAM, the flag should not be there.

Differential Revision: https://reviews.llvm.org/D74600
2020-02-20 15:14:48 +01:00
Simon Pilgrim 6085593c12 [AMDGPU] simplifyI24 - replace GetDemandedBits with SimplifyMultipleUseDemandedBits
GetDemandedBits mostly just calls SimplifyMultipleUseDemandedBits now, but it does a very blunt constant simplification that SimplifyMultipleUseDemandedBits avoids.

If we need to demand bits from constants we should handle this through ShrinkDemandedConstant/targetShrinkDemandedConstant.

@arsenm confirmed that the sign extended immediates are better for code size.

Differential Revision: https://reviews.llvm.org/D74857
2020-02-20 12:03:08 +00:00
dfukalov dbfc682e2b SpeculativeExecution: fixed ingoring free execution
Summary:
After updating cost model in AMDGPU target (47a5c36b37) the pass started to
ignore some BBs since they got all instructions estimated as free.

Reviewers: arsenm, chandlerc, nhaehnle

Reviewed By: nhaehnle

Subscribers: jvesely, wdng, nhaehnle, tpr, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74825
2020-02-20 14:45:02 +03:00
Matt Arsenault 4bb0c8f91c AMDGPU: Enable integer division bypass
We probably want this, and I've meant to turn this on for a long
time. SC actually emits a special case to early-out for a 1
denominator, which perhaps should also be considered.
2020-02-19 17:50:19 -05:00
Matt Arsenault 0b6ead018a AMDGPU/GlobalISel: Cleanup min/max RegBankSelect tests
Use common check prefix, although update_mir_test_checks makes this
unnecessarily annoying. Also make sure to have uses in case that ever
ends up mattering.
2020-02-19 17:32:25 -05:00
Simon Pilgrim 025ff5a4ea [AMDGPU] Regenerate immediate constant tests 2020-02-19 18:58:44 +00:00
Matt Arsenault ff4639f060 AMDGPU/GlobalISel: Select MUBUF path for global atomic cmpxchg
I'm not sure why this isn't a pattern, but the DAG manually selects
this.
2020-02-19 06:19:22 -08:00
Simon Pilgrim 4af8db317d [AMDGPU] performCvtF32UByteNCombine - add SHL and SimplifyMultipleUseDemandedBits support
This is part of the work to remove SelectionDAG::GetDemandedBits and just use SimplifyMultipleUseDemandedBits.

Recent experiments raised some v_cvt_f32_ubyte*_e32 regressions, so I've added some additional abilities to performCvtF32UByteNCombine to help unpack byte data more aggressively.

We still don't remove all OR(SHL,SRL) patterns as some of the regenerated nodes don't get combined again, but we are getting closer.

Differential Revision: https://reviews.llvm.org/D74786
2020-02-19 11:45:57 +00:00
Matt Arsenault 37c452a289 AMDGPU/GlobalISel: Adjust branch target when lowering loop intrinsic
This needs to steal the branch target like the other control flow
intrinsics.
2020-02-18 06:35:40 -08:00
Matt Arsenault 5e8792453d AMDGPU/GlobalISel: Fix RegBankSelect for G_SHUFFLE_VECTOR 2020-02-17 15:11:25 -05:00
Matt Arsenault f742a28ae3 AMDGPU/GlobalISel: Custom lower 32-bit G_SDIV/G_SREM 2020-02-17 15:09:51 -05:00
Matt Arsenault e240b27d6d AMDGPU/GlobalISel: Allow arbitrary global values
Treat unknown address spaces as global
2020-02-17 11:32:28 -08:00
Matt Arsenault 54137bbaaf GlobalISel: Allow running localizer earlier
This required legal and regbankselected MIR for seemingly no
reason. For AMDGPU this wouldn't see legalized G_GLOBAL_VALUEs.
2020-02-17 11:24:06 -08:00
Matt Arsenault 96db12d507 AMDGPU/GlobalISel: Custom lower 32-bit G_UDIV/G_UREM
AMDGPUCodeGenPrepare expands this most of the time, but not always. We
will always at least need a fallback option here. This is the 3rd
implementation of the same expansion in the backend. Eventually I
would like to eliminate the IR expansion (and the DAG version
obviously).

Currently the new legalizer path produces a better result, since the
IR expansion results in extra operations which need to be combined
out. Notably, the IR expansion results in multiplies by 0.
2020-02-17 11:05:50 -08:00
Matt Arsenault 0e2eb357e0 GlobalISel: Extend narrowing to G_ASHR 2020-02-17 10:42:59 -08:00
Matt Arsenault 8550859535 GlobalISel: Extend shift narrowing to G_SHL 2020-02-17 09:13:37 -08:00
Matt Arsenault 78d455adf0 GlobalISel: Add combine to narrow G_LSHR
Produce an unmerge to a narrower type and introduce a narrower shift
if needed. I wasn't sure if there was a better way to parameterize the
target's preferred shift type for the GICombineRule, so manually call
the combine helper.
2020-02-17 08:04:52 -08:00
Matt Arsenault 86813e2768 AMDGPU/GlobalISel: Select llvm.amdgcn.s.buffer.load
Doesn't try to fail on the dlc bit pre-gfx10 like the DAG lowering
does.
2020-02-17 08:02:40 -08:00
Matt Arsenault 5fdc9851d0 AMDGPU/GlobalISel: Run the localizer pass
While looking at the output on real sized programs, there is a lot of
extra SGPR spilling compared to the DAG path. This seems to largely be
from all constants being SGPRs in the entry block.
2020-02-17 07:38:12 -08:00
Matt Arsenault e5805529bf AMDGPU/GlobalISel: Select v2s32->v2s16 G_TRUNC
It would be nice if there was a way to avoid the tied operand, but as
far as I can tell there isn't a way to use or with op_sel to achieve
this
2020-02-17 09:20:13 -05:00
Matt Arsenault 361f2a7818 AMDGPU/GlobalISel: Handle sbfe/ubfe intrinsic
Try to handle arbitrary scalar BFEs by packing the operands. The DAG
gives up on non-constant arguments. We're still missing any constant
folding, so we end up with pretty ugly code most of the time. Also
handle the 64-bit scalar case, which the DAG doesn't try to do.
2020-02-17 09:20:13 -05:00
Tim Renouf 1e926a9f9c [AMDGPU] Fix some tests that did not specify -mcpu
Summary:
This fixes some tests that did not specify -mcpu. Doing that disables
all subtarget features, which gives behavior that (a) does not
necessarily correspond to any actual target, and (b) can change as we
add new subtarget features.

Also added gfx1010 to memtime test.

Differential Revision: https://reviews.llvm.org/D74594

Change-Id: I8c0fe4fa03e9a93ef8bb722cd42d22e064526309
2020-02-17 14:02:32 +00:00
Matt Arsenault 295bbea3ed AMDGPU/GlobalISel: Fix non-power-of-2 G_SITOFP/G_UITOFP
This wouldn't work for s33-s63 sources.
2020-02-16 22:48:57 -05:00
Matt Arsenault 24c156194b AMDGPU/GlobalISel: Add some missing tests for non-power-of-2 cases 2020-02-16 22:48:42 -05:00
Matt Arsenault 8d8d46b57a AMDGPU/GlobalISel: Fix missing impdef of scc on boolean bit ops 2020-02-14 22:35:30 -05:00
Matt Arsenault 65dbdc329f AMDGPU: Don't preserve analyses with div64 IR expansion
The dominator tree needs to be updated, but that isn't handled now.
2020-02-14 20:06:02 -05:00
Matt Arsenault dc3e499dd4 AMDGPU/GlobalISel: Fix G_EXTRACT of 96-bit results
This would assert on an unhandled size in getRegSplitParts.
2020-02-14 15:57:40 -08:00
Matt Arsenault 630b47e518 AMDGPU: Use generated checks for memcpy expansion 2020-02-14 15:57:40 -08:00
Matt Arsenault 60fea2713d AMDGPU/GlobalISel: Improve 16-bit bswap
Match the new DAG behavior and use v_perm_b32 when available. Also
does better on SI/CI by expanding 16-bit swaps. Also fix
non-power-of-2 cases.
2020-02-14 15:57:39 -08:00
Matt Arsenault 9ec668606b AMDGPU: Add option to disable CGP division expansion
The division expansions in AMDGPUCodeGenPrepare can't be relied on for
correctness, since they punt to later optimization and possibly
legalization in some cases. We still need a way to be able to write
tests for the legalizer versions of the expansion. This is mostly for
GlobalISel, since the expected optimzations is expecting aren't
implemented.

The interaction with the flag to expand 64-bit division in the IR is
pretty confusing, but these flags have different purposes.
2020-02-14 11:37:07 -08:00
Matt Arsenault 34d9a16e54 AMDGPU: Add option to expand 64-bit integer division in IR
I didn't realize we were already expanding 24/32-bit division here
already. Use the available IntegerDivision utilities. This uses loops,
so produces significantly smaller code than the inline DAG expansion.

This now requires width reductions of 64-bit divisions before
introducing the expanded loops.

This helps work around missing legalization in GlobalISel for
division, which are the only remaining core instructions that didn't
work at all.

I think this is plausibly a better implementation than exists in the
DAG, although turning it on by default misses out on the constant
value optimizations and also needs benchmarking.
2020-02-14 11:16:08 -08:00
Matt Arsenault bfbfa18591 GlobalISel: Lower s64->s16 G_FPTRUNC
This is more or less directly ported from the AMDGPU custom lowering
for FP_TO_FP16. I made a few minor fixups (using G_UNMERGE_VALUES
instead of creating shift/trunc to extract the two halves, and zexting
an inverted compare instead of select_cc).

This also does not include the fast math expansion the DAG which
converts to f32 and then to f16. I think that belongs in a
pre-legalize combine instead.
2020-02-14 10:46:58 -08:00
Matt Arsenault 8c2c0b3637 AMDGPU: Improve i16/v2i16 bswap 2020-02-14 09:53:22 -08:00
Matt Arsenault e0fd2d6d62 AMDGPU: Add baseline tests for 16-bit bswap 2020-02-14 09:34:13 -08:00
Matt Arsenault a257bde420 AMDGPU/GlobalISel: Handle G_BSWAP 2020-02-14 09:09:44 -08:00
Matt Arsenault 5adbf7d57f AMDGPU/GlobalISel: Make G_TRUNC legal
This is required to be legal. I'm not sure how we were getting away
without defining any rules for it.
2020-02-13 15:25:52 -05:00
Matt Arsenault cfa60ff2c7 AMDGPU/GlobalISel: Add missing tests for cmpxchg selection 2020-02-13 10:26:55 -08:00
Yuanfang Chen 4ad7685258 Revert "Revert "Reland "[Support] make report_fatal_error `abort` instead of `exit`"""
This reverts commit 80a34ae311 with fixes.

Previously, since bots turning on EXPENSIVE_CHECKS are essentially turning on
MachineVerifierPass by default on X86 and the fact that
inline-asm-avx-v-constraint-32bit.ll and inline-asm-avx512vl-v-constraint-32bit.ll
are not expected to generate functioning machine code, this would go
down to `report_fatal_error` in MachineVerifierPass. Here passing
`-verify-machineinstrs=0` to make the intent explicit.
2020-02-13 10:16:06 -08:00
Yuanfang Chen 17122ec10a Revert "Revert "Revert "Reland "[Support] make report_fatal_error `abort` instead of `exit`""""
This reverts commit bb51d24330.
2020-02-13 10:08:05 -08:00
Yuanfang Chen bb51d24330 Revert "Revert "Reland "[Support] make report_fatal_error `abort` instead of `exit`"""
This reverts commit 80a34ae311 with fixes.

On bots llvm-clang-x86_64-expensive-checks-ubuntu and
llvm-clang-x86_64-expensive-checks-debian only,
llc returns 0 for these two tests unexpectedly. I tweaked the RUN line a little
bit in the hope that LIT is the culprit since this change is not in the
codepath these tests are testing.
llvm\test\CodeGen\X86\inline-asm-avx-v-constraint-32bit.ll
llvm\test\CodeGen\X86\inline-asm-avx512vl-v-constraint-32bit.ll
2020-02-13 10:02:53 -08:00
Matt Arsenault bfe3779459 AMDGPU: Use v_perm_b32 to implement bswap
Also greatly improve i64 lowering. LegalizeIntegerTypes does the
correct narrowing if i64 isn't legal. Just workaround this for
SelectionDAG by making i64 legal and splitting in the patterns.
2020-02-13 09:45:31 -08:00
Austin Kerbow 5db0b2521c [AMDGPU][GlobalISel] Handle 64byte EltSIze in getRegSplitParts
Reviewers: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, dstuttard, tpr, t-tye, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74518
2020-02-12 19:11:52 -08:00
Matt Arsenault d1b393d92c AMDGPU/GlobalISel: Select G_CTTZ_ZERO_UNDEF
Directly select this rather than going through the intermediate
instruction, which may provide some combine value in the future.
2020-02-12 16:19:46 -08:00
Matt Arsenault 045a8921d7 AMDGPU/GlobalISel: Select G_CTLZ_ZERO_UNDEF
Directly select this rather than going through the intermediate
instruction, which may provide some combine value in the future.
2020-02-12 16:19:45 -08:00
Matt Arsenault e174c278ca AMDGPU/GlobalISel: Fix mapping G_ICMP with constrained result
When SI_IF is inserted, it constrains the source register with a
register class, which was quite likely a G_ICMP. This was incorrectly
treating it as a scalar, and then applyMappingImpl would end up
producing invalid MIR since this was unexpected.

Also fix not using all VGPR sources for vcc outputs.
2020-02-12 16:19:45 -08:00
Matt Arsenault b99f45574c AMDGPU/GlobalISel: Split 96-bit load/store select tests out
These are only legal on CI+. The test would fail in a debug build, but
not a release due to the partial selection since the pre-selection
legality assert only happens in a debug build.
2020-02-12 09:58:37 -05:00
Matt Arsenault fa61e200e5 AMDGPU/GlobalISel: Widen non-power-of-2 load results
Load extra bits if suitably aligned. This allows using widened
3-vector loads on SI, and fixes legalization for <9 x s32> (which LSV
apparently forms frequently on lowered kernel argument lists).

Fix incorrectly treating these as legal on SI. This should emit a
64-bit store and a 32-bit store.

I think all of the load and store rules are just about complete, but
due for a rewrite.
2020-02-12 09:35:10 -05:00
Nicolai Hähnle 07a5b849f7 SelectionDAG: Fix bug in ClusterNeighboringLoads
Summary:
The method attempts to find loads that can be legally clustered by
looking for loads consuming the same chain glue token.

However, the old code looks at _all_ users of values produced by the
chain node -- including uses of the loaded/returned value of volatile
loads or atomics. This could lead to circular dependencies which then
failed during scheduling.

With this change, we filter out users by getResNo, i.e. by which
SDValue value they use, to ensure that we only look at users of the
chain glue token.

This appears to be a rather old bug, which is perhaps surprising.
However, the test case is actually quite fragile (i.e., it is hidden
by fairly small changes), and the test _must_ use volatile loads for
the bug to manifest.

Reviewers: arsenm, bogner, craig.topper, foad

Subscribers: MatzeB, jvesely, wdng, hiraditya, javed.absar, jfb, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74253
2020-02-12 09:12:55 +01:00
Yuanfang Chen 80a34ae311 Revert "Reland "[Support] make report_fatal_error `abort` instead of `exit`""
This reverts commit rGcd5b308b828e, rGcd5b308b828e, rG8cedf0e2994c.

There are issues to be investigated for polly bots and bots turning on
EXPENSIVE_CHECKS.
2020-02-11 20:41:53 -08:00
Matt Arsenault 6d4ebada79 AMDGPU: Use conditions directly in division expansion
This was creating a select on true/false values, and then comparing
that later. This produced more work for later combines, which can be
avoided by just using the boolean values. This was copied from the
original DAG expansion, which also has the same problem. This doesn't
have a observable change using SelectionDAG, but since GlobalISel is
missing these optimizations, the final code was noticeably longer.
2020-02-11 23:11:30 -05:00
Yuanfang Chen 8cedf0e299 Reland "[Support] make report_fatal_error `abort` instead of `exit`"
Summary:
Reland D67847 after D73742 is committed. Replace `sys::Process::Exit(1)`
with `abort` in `report_fatal_error`.

After this patch, for tools turning on `CrashRecoveryContext`,
crash handler installed by `CrashRecoveryContext` is called unless
they installed a non-returning handler using `llvm::install_fatal_error_handler`
like `cc1_main` currently does.

Reviewers: rnk, MaskRay, aganea, hans, espindola, jhenderson

Subscribers: jholewinski, qcolombet, dschuff, jyknight, emaste, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, zzheng, edward-jones, atanasyan, steven_wu, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, rupprecht, jocewei, jsji, Jim, dmgreen, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D74456
2020-02-11 18:20:40 -08:00
Austin Kerbow 3a312c3ee5 [AMDGPU][GlobalISel] Refactor selectDS1Addr1Offset/selectDS64Bit4ByteAligned
Differential Revision: https://reviews.llvm.org/D74261
2020-02-11 16:57:13 -08:00
Matt Arsenault b30e122333 AMDGPU: Don't expand more special div cases in IR
These have nicer expansions implemented in the DAG. Ideally we would
either directly implement all of these special expansions, or stop
expanding division in the IR.
2020-02-11 19:01:06 -05:00
Matt Arsenault 86f9117d47 AMDGPU: Don't report 2-byte alignment as fast
This is apparently worse than 1-byte alignment. This does not attempt
to decompose 2-byte aligned wide stores, but will stop trying to
produce them.

Also fix bug in LoadStoreVectorizer which was decreasing the alignment
and vectorizing stack accesses. It was assuming a stack object was an
alloca that could have its base alignment changed, which is not true
if the pointer is derived from a function argument.
2020-02-11 18:35:00 -05:00
Matt Arsenault d3a96fc082 AMDGPU: Add baseline tests for CGP div expansion
These cases are harmed by expanding division early in the IR, before
DAGCombiner.
2020-02-11 18:11:39 -05:00
Matt Arsenault f734ce0488 AMDGPU: Fix crash on v3i15 kernel arguments
This was split into 3 i15 arguments. The i15 piece needs to be rounded
to a simple MVT for the memory type.
2020-02-11 18:11:39 -05:00
Matt Arsenault 92c62582fc AMDGPU: Directly use rcp intrinsic in idiv expansions
Since natural fdiv lowering is now more conservative even with
denormals disabled, we get a slower expansion from just a plain
1.0/fdiv. Directly emit the rcp intrinsic when using it to implement
integer division to avoid a pointlessly complex sequence.
2020-02-11 18:11:39 -05:00
Stanislav Mekhanoshin d538dc05f3 [AMDGPU] Fixed subreg use in sdwa-scalar-ops.mir. NFC 2020-02-11 14:27:17 -08:00
Matt Arsenault 7af7b96a9b AMDGPU: Move R600 test compatability hack
Instead of handling the r600 intrinsics on amdgcn, handle the amdgcn
intrinsics on r600.
2020-02-10 10:02:06 -08:00
Sebastian Neubauer 7cddd15e56 [SelectionDAG] Optimize build_vector of truncates and shifts
Add a simplification to fuse a manual vector extract with shifts and
truncate into a bitcast.

Unpacking and packing values into vectors is only optimized with
extractelement instructions, not when manually unpacked using shifts
and truncates.
This patch simplifies shifts and truncates into a bitcast if possible.

Simplify (build_vec (trunc $1)
                    (trunc (srl $1 width))
                    (trunc (srl $1 (2 * width))) ...)
to (bitcast $1)

Differential Revision: https://reviews.llvm.org/D73892
2020-02-10 15:04:07 +01:00
Sebastian Neubauer 8756869170 [AMDGPU] Add a16 feature to gfx10
Based on D72931

This adds a new feature called A16 which is enabled for gfx10.
gfx9 keeps the R128A16 feature so it can share all the instruction encodings
with gfx7/8.

Differential Revision: https://reviews.llvm.org/D73956
2020-02-10 09:04:23 +01:00
Matt Arsenault 312a9d1b83 GlobalISel: Fix narrowScalar for G_{CTLZ|CTTZ}_ZERO_UNDEF
Narrow these for 64-bit VALU for AMDGPU.
2020-02-09 19:02:38 -05:00
Matt Arsenault c437f6c687 AMDGPU/GlobalISel: Split 64-bit G_CTPOP in RegBankSelect 2020-02-09 18:39:33 -05:00
Matt Arsenault 2126c70e3a AMDGPU/GlobalISel: Don't mis-select vector index on a constant
Vector indexing with a constant index should be folded out in the
legalizer, but this was accidentally falling through. This would
produce the indexing operation with $noreg. Handle this case as a
dynamic index just in case a bug like this happens again in the
future.
2020-02-09 18:02:37 -05:00
Matt Arsenault f4a38c114e AMDGPU/GlobalISel: Look through casts when legalizing vector indexing
We were failing to find constants that were casted. I feel like the
artifact combiner should have folded the constant in the trunc before
the custom lowering, but that doesn't happen.
2020-02-09 18:02:10 -05:00
Matt Arsenault 6e1770821f AMDGPU: Fix SI_IF lowering when the save exec reg has terminator uses
Reverts part of 6524a7a2b9. Since that
commit, the expansion was ignoring the actual save exec register
produced by the instruction, and looking at other instructions. I do
not understand why it was looking at other instructions, but relying
on this scan was wrong.

Fixes verifier errors after SI_IF is tail duplicated, which should be
correct to do. The results were fed into a phi, which was lowered to
the S_MOV_B64_term instructions.
2020-02-09 17:59:19 -05:00
Craig Topper 2af1640f9a [LegalizeDAG][X86][AMDGPU] Use ANY_EXTEND instead of ZERO_EXTEND when promoting ISD::CTTZ/CTTZ_ZERO_UNDEF.
Summary:
For CTTZ we place a set bit just past where the non-promoted type
stopped so the extended bits won't be used for the count. For
CTTZ_ZERO_UNDEF we don't care what happens if no bits are set in
the original type and we end up counting into the extended bits.
So we can just use ANY_EXTEND for both cases.

This matches what is done in type legalization for these operations.
We make no effort to force the upper bits to zero.

Differential Revision: https://reviews.llvm.org/D74111
2020-02-07 22:25:56 -08:00
Changpeng Fang 884acbb9e1 AMDGPU: Enhancement on FDIV lowering in AMDGPUCodeGenPrepare
Summary:
  The accuracy limit to use rcp is adjusted to 1.0 ulp from 2.5 ulp.
Also, afn instead of arcp is used to allow inaccurate rcp to be used.

Reviewers:
  arsenm

Differential Revision: https://reviews.llvm.org/D73588
2020-02-07 11:46:23 -08:00
Changpeng Fang 6370c7c13e AMDGPU: Limit the search in finding the instruction pattern for v_swap generation.
Summary:
  Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.

But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.

Reviewers:
  rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D74180
2020-02-07 11:06:33 -08:00
Matt Arsenault cbe0c8299e AMDGPU/GlobalISel: Fix missing test for select of s64 scalar G_CTPOP 2020-02-07 13:15:48 -05:00
Matt Arsenault 2f885cbe90 AMDGPU/GlobalISel: Fix move s.buffer.load to VALU
We were executing this in a waterfall loop as a placeholder, but this
should really be converted to a MUBUF load. Also execute in a
waterfall loop if the resource isn't an SGPR. This is a case where the
DAG handling was wrong because doing the right thing was too hard.

Currently, this will mishandle 96-bit loads. There's currently no way
to track the original memory size with an MMO, so these loads will be
widened andd the resulting memory size will be 128-bits.
2020-02-07 07:19:01 -08:00
Matt Arsenault 8de2dad9e0 GlobalISel: Fix lowering of G_CTLZ/G_CTTZ
The type passed to lower was invalid, so I'm not sure how this was
even working before. The source and destination type also do not have
to match, so make sure to use the right ones.
2020-02-07 06:54:12 -08:00
Matt Arsenault 6a570dc548 AMDGPU/GlobalISel: Fix non-pow-2 add/sub/mul for 16-bit insts
These wouldn't legalize between 16-bits and 32-bits on targets with
16-bit instructions.
2020-02-06 21:43:54 -05:00
Stanislav Mekhanoshin 2863c26968 Revert "AMDGPU: Limit the search in finding the instruction pattern for v_swap generation."
This reverts commit 9827806481.
2020-02-06 17:38:55 -08:00
Changpeng Fang 9827806481 AMDGPU: Limit the search in finding the instruction pattern for v_swap generation.
Summary:
  Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.

But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.

Reviewers:
  rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D74180
2020-02-06 16:40:21 -08:00
Matt Arsenault 9087ef0765 GlobalISel: Allow CSE of G_IMPLICIT_DEF
The legalizer produces a lot of these, and they make reading legalized
MIR annoying. For some reason, this does seem to sometimes introduce
copies of implicit def, which is dumb.
2020-02-05 17:47:21 -05:00
Matt Arsenault baafe82b07 AMDGPU/GlobalISel: Remove bitcast legality hack 2020-02-05 16:24:24 -05:00
Matt Arsenault 364326ce66 AMDGPU/GlobalISel: Add mem operand to s.buffer.load intrinsic
Really the intrinsic definition is wrong, but work around this
here. The DAG lowering introduces an MMO. We have to introduce a new
operation to avoid the verifier complaining about the missing mayLoad.
2020-02-05 15:04:42 -05:00
Matt Arsenault 5aa6e246a1 AMDGPU/GlobalISel: Legalize f64 G_FFLOOR for SI
Use cmp ord instead of cmp_class compared to the DAG version for the
nan check, but mostly try to match the existsing pattern.

I think the sign doesn't matter for fract, so we could do a little
better with the source modifier matching.

I think this is also still broken as in D22898, but I'm leaving it
as-is for now while I don't have an SI system to test on.
2020-02-05 14:32:01 -05:00
Matt Arsenault 7bffa97285 AMDGPU/GlobalISel: Prefer merge/unmerge ops to legalize TFE
These have a better chance of combining with other operations and are
currently much better supported than G_EXTRACT.
2020-02-05 12:56:10 -05:00
Matt Arsenault e65e6d052e AMDGPU/GlobalISel: Legalize TFE image result loads
Rewrite the result register pair into the expected sinigle register
format in the legalizer.

I'm also operating under the assumption that TFE doesn't apply to
stores or atomics, but don't know if this is true or not.
2020-02-05 12:40:20 -05:00
Matt Arsenault 69cc9f3046 AMDGPU/GlobalISel: Legalize llvm.amdgcn.s.buffer.load
The 96-bit results need to be widened.

I find the interaction between LegalizerHelper and MIRBuilder somewhat
awkward. The custom legalization is called by the LegalizerHelper, but
then does not have access to the helper. You have to construct a new
helper, which then does not own the MachineIRBuilder, but does modify
it. Maybe custom legalization should be passed the helper?
2020-02-05 12:01:34 -05:00
Matt Arsenault dfa9420f09 AMDGPU/GlobalISel: Don't use legal v2s16 G_BUILD_VECTOR
If we have s_pack_* instructions, legalize this to
G_BUILD_VECTOR_TRUNC from s32 elements. This is closer to how how the
s_pack_* instructions really behave.

If we don't have s_pack_ instructions, expand this by creating a merge
to s32 and bitcasting. This expands to the expected bit operations. I
think this eventually should go in a new bitcast legalize action type
in LegalizerHelper.

We already directly emit the shift operations in RegBankSelect for the
vector case. This could possibly be cleaned up, but I also may want to
defer doing this expansion to selection anyway. I'll see about that
when I try to actually match VOP3P instructions.

This breaks the selection of the build_vector since tablegen doesn't
know how to match G_BUILD_VECTOR_TRUNC yet, so just xfail it for now.
2020-02-05 11:52:18 -05:00
Sebastian Neubauer 163e33b290 [AMDGPU] Fix lowering a16 image intrinsics
scalar_to_vector takes only one argument, not two.
The a16 tests now also check the packing of coordinates into registers

Differential Revision: https://reviews.llvm.org/D73482
2020-02-05 10:54:34 +01:00
Sebastian Neubauer 3bc7ffdaab [AMDGPU] Use v3f32 type in image instructions
This should lower the amount of used registers for gfx9.

I updated some of the changed tests with the update script because
changing them by hand is tedious.

Differential Revision: https://reviews.llvm.org/D73884
2020-02-05 10:35:41 +01:00
Jan Vesely e6686adf8a AMDGPU/EG,CM: Implement fsqrt using recip(rsqrt(x)) instead of x * rsqrt(x)
The old version might be faster on EG (RECIP_IEEE is Trans only),
but it'd need extra corner case checks.
This gives correct corner case behaviour and saves a register.
Fixes OCL CTS sqrt test (1-thread, scalar) on Turks.

Reviewer: arsenm
Differential Revision: https://reviews.llvm.org/D74017
2020-02-05 00:24:07 -05:00
Matt Arsenault 12fe9b26ec AMDGPU/GlobalISel: Select G_SEXT_INREG 2020-02-04 13:23:53 -08:00
Matt Arsenault 0693e827ed AMDGPU/GlobalISel: Do a better job splitting 64-bit G_SEXT_INREG
We don't need to expand to full shifts for the > 32-bit case. This
just switches to a sext_inreg of the high half.
2020-02-04 13:23:53 -08:00
Matt Arsenault 05f2a04ba7 AMDGPU/GlobalISel: Legalize G_SEXT_INREG
Split the VALU 64-bit case in RegBankSelect.
2020-02-04 13:23:53 -08:00
Austin Kerbow 0f116fd9d8 [AMDGPU] Fix infinite loop with fma combines
https://reviews.llvm.org/D72312 introduced an infinite loop which involves
DAGCombiner::visitFMA and AMDGPUTargetLowering::performFNegCombine.

fma( a, fneg(b), fneg(c) ) => fneg( fma (a, b, c) ) => fma( a, fneg(b), fneg(c) ) ...

This only breaks with types where 'isFNegFree' returns flase, e.g. v4f32.
Reproducing the issue also needs the attribute 'no-signed-zeros-fp-math',
and no source mods allowed on one of the users of the Op.

This fix makes changes to indicate that it is not free to negate a fma if it
has users with source mods.

Differential Revision: https://reviews.llvm.org/D73939
2020-02-04 13:11:09 -08:00
Matt Arsenault 9b0ce8edfa AMDGPU/GlobalISel: Remove extension legality hacks
The legalization has improved since this was added, and the tests
relying on this no longer need it.
2020-02-04 12:50:47 -08:00
Matt Arsenault 5d2749938c AMDGPU/GlobalISel: Custom lower G_FEXP 2020-02-04 11:50:55 -08:00
Matt Arsenault b461436d01 AMDGPU/GlobalISel: Legalize s16 G_FEXP2 2020-02-04 11:50:55 -08:00
Matt Arsenault 1024b73ef5 AMDGPU: Split denormal mode tracking bits
Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.

This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
2020-02-04 10:44:21 -08:00