Commit Graph

3151 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin d538dc05f3 [AMDGPU] Fixed subreg use in sdwa-scalar-ops.mir. NFC 2020-02-11 14:27:17 -08:00
Matt Arsenault 7af7b96a9b AMDGPU: Move R600 test compatability hack
Instead of handling the r600 intrinsics on amdgcn, handle the amdgcn
intrinsics on r600.
2020-02-10 10:02:06 -08:00
Sebastian Neubauer 7cddd15e56 [SelectionDAG] Optimize build_vector of truncates and shifts
Add a simplification to fuse a manual vector extract with shifts and
truncate into a bitcast.

Unpacking and packing values into vectors is only optimized with
extractelement instructions, not when manually unpacked using shifts
and truncates.
This patch simplifies shifts and truncates into a bitcast if possible.

Simplify (build_vec (trunc $1)
                    (trunc (srl $1 width))
                    (trunc (srl $1 (2 * width))) ...)
to (bitcast $1)

Differential Revision: https://reviews.llvm.org/D73892
2020-02-10 15:04:07 +01:00
Sebastian Neubauer 8756869170 [AMDGPU] Add a16 feature to gfx10
Based on D72931

This adds a new feature called A16 which is enabled for gfx10.
gfx9 keeps the R128A16 feature so it can share all the instruction encodings
with gfx7/8.

Differential Revision: https://reviews.llvm.org/D73956
2020-02-10 09:04:23 +01:00
Matt Arsenault 312a9d1b83 GlobalISel: Fix narrowScalar for G_{CTLZ|CTTZ}_ZERO_UNDEF
Narrow these for 64-bit VALU for AMDGPU.
2020-02-09 19:02:38 -05:00
Matt Arsenault c437f6c687 AMDGPU/GlobalISel: Split 64-bit G_CTPOP in RegBankSelect 2020-02-09 18:39:33 -05:00
Matt Arsenault 2126c70e3a AMDGPU/GlobalISel: Don't mis-select vector index on a constant
Vector indexing with a constant index should be folded out in the
legalizer, but this was accidentally falling through. This would
produce the indexing operation with $noreg. Handle this case as a
dynamic index just in case a bug like this happens again in the
future.
2020-02-09 18:02:37 -05:00
Matt Arsenault f4a38c114e AMDGPU/GlobalISel: Look through casts when legalizing vector indexing
We were failing to find constants that were casted. I feel like the
artifact combiner should have folded the constant in the trunc before
the custom lowering, but that doesn't happen.
2020-02-09 18:02:10 -05:00
Matt Arsenault 6e1770821f AMDGPU: Fix SI_IF lowering when the save exec reg has terminator uses
Reverts part of 6524a7a2b9. Since that
commit, the expansion was ignoring the actual save exec register
produced by the instruction, and looking at other instructions. I do
not understand why it was looking at other instructions, but relying
on this scan was wrong.

Fixes verifier errors after SI_IF is tail duplicated, which should be
correct to do. The results were fed into a phi, which was lowered to
the S_MOV_B64_term instructions.
2020-02-09 17:59:19 -05:00
Craig Topper 2af1640f9a [LegalizeDAG][X86][AMDGPU] Use ANY_EXTEND instead of ZERO_EXTEND when promoting ISD::CTTZ/CTTZ_ZERO_UNDEF.
Summary:
For CTTZ we place a set bit just past where the non-promoted type
stopped so the extended bits won't be used for the count. For
CTTZ_ZERO_UNDEF we don't care what happens if no bits are set in
the original type and we end up counting into the extended bits.
So we can just use ANY_EXTEND for both cases.

This matches what is done in type legalization for these operations.
We make no effort to force the upper bits to zero.

Differential Revision: https://reviews.llvm.org/D74111
2020-02-07 22:25:56 -08:00
Changpeng Fang 884acbb9e1 AMDGPU: Enhancement on FDIV lowering in AMDGPUCodeGenPrepare
Summary:
  The accuracy limit to use rcp is adjusted to 1.0 ulp from 2.5 ulp.
Also, afn instead of arcp is used to allow inaccurate rcp to be used.

Reviewers:
  arsenm

Differential Revision: https://reviews.llvm.org/D73588
2020-02-07 11:46:23 -08:00
Changpeng Fang 6370c7c13e AMDGPU: Limit the search in finding the instruction pattern for v_swap generation.
Summary:
  Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.

But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.

Reviewers:
  rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D74180
2020-02-07 11:06:33 -08:00
Matt Arsenault cbe0c8299e AMDGPU/GlobalISel: Fix missing test for select of s64 scalar G_CTPOP 2020-02-07 13:15:48 -05:00
Matt Arsenault 2f885cbe90 AMDGPU/GlobalISel: Fix move s.buffer.load to VALU
We were executing this in a waterfall loop as a placeholder, but this
should really be converted to a MUBUF load. Also execute in a
waterfall loop if the resource isn't an SGPR. This is a case where the
DAG handling was wrong because doing the right thing was too hard.

Currently, this will mishandle 96-bit loads. There's currently no way
to track the original memory size with an MMO, so these loads will be
widened andd the resulting memory size will be 128-bits.
2020-02-07 07:19:01 -08:00
Matt Arsenault 8de2dad9e0 GlobalISel: Fix lowering of G_CTLZ/G_CTTZ
The type passed to lower was invalid, so I'm not sure how this was
even working before. The source and destination type also do not have
to match, so make sure to use the right ones.
2020-02-07 06:54:12 -08:00
Matt Arsenault 6a570dc548 AMDGPU/GlobalISel: Fix non-pow-2 add/sub/mul for 16-bit insts
These wouldn't legalize between 16-bits and 32-bits on targets with
16-bit instructions.
2020-02-06 21:43:54 -05:00
Stanislav Mekhanoshin 2863c26968 Revert "AMDGPU: Limit the search in finding the instruction pattern for v_swap generation."
This reverts commit 9827806481.
2020-02-06 17:38:55 -08:00
Changpeng Fang 9827806481 AMDGPU: Limit the search in finding the instruction pattern for v_swap generation.
Summary:
  Current implementation of matchSwap in SIShrinkInstructions searches the entire
use_nodbg_operands set to find the possible pattern to generate v_swap instruction.
This approach will lead to a O(N^3) in compile time for SIShrinkInstructions.

But in reality, the matching pattern only exists within nearby instructions in the
same basic block. This work limits the search to a maximum of 16 instructions, and has
a linear compile time comsumption.

Reviewers:
  rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D74180
2020-02-06 16:40:21 -08:00
Matt Arsenault 9087ef0765 GlobalISel: Allow CSE of G_IMPLICIT_DEF
The legalizer produces a lot of these, and they make reading legalized
MIR annoying. For some reason, this does seem to sometimes introduce
copies of implicit def, which is dumb.
2020-02-05 17:47:21 -05:00
Matt Arsenault baafe82b07 AMDGPU/GlobalISel: Remove bitcast legality hack 2020-02-05 16:24:24 -05:00
Matt Arsenault 364326ce66 AMDGPU/GlobalISel: Add mem operand to s.buffer.load intrinsic
Really the intrinsic definition is wrong, but work around this
here. The DAG lowering introduces an MMO. We have to introduce a new
operation to avoid the verifier complaining about the missing mayLoad.
2020-02-05 15:04:42 -05:00
Matt Arsenault 5aa6e246a1 AMDGPU/GlobalISel: Legalize f64 G_FFLOOR for SI
Use cmp ord instead of cmp_class compared to the DAG version for the
nan check, but mostly try to match the existsing pattern.

I think the sign doesn't matter for fract, so we could do a little
better with the source modifier matching.

I think this is also still broken as in D22898, but I'm leaving it
as-is for now while I don't have an SI system to test on.
2020-02-05 14:32:01 -05:00
Matt Arsenault 7bffa97285 AMDGPU/GlobalISel: Prefer merge/unmerge ops to legalize TFE
These have a better chance of combining with other operations and are
currently much better supported than G_EXTRACT.
2020-02-05 12:56:10 -05:00
Matt Arsenault e65e6d052e AMDGPU/GlobalISel: Legalize TFE image result loads
Rewrite the result register pair into the expected sinigle register
format in the legalizer.

I'm also operating under the assumption that TFE doesn't apply to
stores or atomics, but don't know if this is true or not.
2020-02-05 12:40:20 -05:00
Matt Arsenault 69cc9f3046 AMDGPU/GlobalISel: Legalize llvm.amdgcn.s.buffer.load
The 96-bit results need to be widened.

I find the interaction between LegalizerHelper and MIRBuilder somewhat
awkward. The custom legalization is called by the LegalizerHelper, but
then does not have access to the helper. You have to construct a new
helper, which then does not own the MachineIRBuilder, but does modify
it. Maybe custom legalization should be passed the helper?
2020-02-05 12:01:34 -05:00
Matt Arsenault dfa9420f09 AMDGPU/GlobalISel: Don't use legal v2s16 G_BUILD_VECTOR
If we have s_pack_* instructions, legalize this to
G_BUILD_VECTOR_TRUNC from s32 elements. This is closer to how how the
s_pack_* instructions really behave.

If we don't have s_pack_ instructions, expand this by creating a merge
to s32 and bitcasting. This expands to the expected bit operations. I
think this eventually should go in a new bitcast legalize action type
in LegalizerHelper.

We already directly emit the shift operations in RegBankSelect for the
vector case. This could possibly be cleaned up, but I also may want to
defer doing this expansion to selection anyway. I'll see about that
when I try to actually match VOP3P instructions.

This breaks the selection of the build_vector since tablegen doesn't
know how to match G_BUILD_VECTOR_TRUNC yet, so just xfail it for now.
2020-02-05 11:52:18 -05:00
Sebastian Neubauer 163e33b290 [AMDGPU] Fix lowering a16 image intrinsics
scalar_to_vector takes only one argument, not two.
The a16 tests now also check the packing of coordinates into registers

Differential Revision: https://reviews.llvm.org/D73482
2020-02-05 10:54:34 +01:00
Sebastian Neubauer 3bc7ffdaab [AMDGPU] Use v3f32 type in image instructions
This should lower the amount of used registers for gfx9.

I updated some of the changed tests with the update script because
changing them by hand is tedious.

Differential Revision: https://reviews.llvm.org/D73884
2020-02-05 10:35:41 +01:00
Jan Vesely e6686adf8a AMDGPU/EG,CM: Implement fsqrt using recip(rsqrt(x)) instead of x * rsqrt(x)
The old version might be faster on EG (RECIP_IEEE is Trans only),
but it'd need extra corner case checks.
This gives correct corner case behaviour and saves a register.
Fixes OCL CTS sqrt test (1-thread, scalar) on Turks.

Reviewer: arsenm
Differential Revision: https://reviews.llvm.org/D74017
2020-02-05 00:24:07 -05:00
Matt Arsenault 12fe9b26ec AMDGPU/GlobalISel: Select G_SEXT_INREG 2020-02-04 13:23:53 -08:00
Matt Arsenault 0693e827ed AMDGPU/GlobalISel: Do a better job splitting 64-bit G_SEXT_INREG
We don't need to expand to full shifts for the > 32-bit case. This
just switches to a sext_inreg of the high half.
2020-02-04 13:23:53 -08:00
Matt Arsenault 05f2a04ba7 AMDGPU/GlobalISel: Legalize G_SEXT_INREG
Split the VALU 64-bit case in RegBankSelect.
2020-02-04 13:23:53 -08:00
Austin Kerbow 0f116fd9d8 [AMDGPU] Fix infinite loop with fma combines
https://reviews.llvm.org/D72312 introduced an infinite loop which involves
DAGCombiner::visitFMA and AMDGPUTargetLowering::performFNegCombine.

fma( a, fneg(b), fneg(c) ) => fneg( fma (a, b, c) ) => fma( a, fneg(b), fneg(c) ) ...

This only breaks with types where 'isFNegFree' returns flase, e.g. v4f32.
Reproducing the issue also needs the attribute 'no-signed-zeros-fp-math',
and no source mods allowed on one of the users of the Op.

This fix makes changes to indicate that it is not free to negate a fma if it
has users with source mods.

Differential Revision: https://reviews.llvm.org/D73939
2020-02-04 13:11:09 -08:00
Matt Arsenault 9b0ce8edfa AMDGPU/GlobalISel: Remove extension legality hacks
The legalization has improved since this was added, and the tests
relying on this no longer need it.
2020-02-04 12:50:47 -08:00
Matt Arsenault 5d2749938c AMDGPU/GlobalISel: Custom lower G_FEXP 2020-02-04 11:50:55 -08:00
Matt Arsenault b461436d01 AMDGPU/GlobalISel: Legalize s16 G_FEXP2 2020-02-04 11:50:55 -08:00
Matt Arsenault 1024b73ef5 AMDGPU: Split denormal mode tracking bits
Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.

This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
2020-02-04 10:44:21 -08:00
Sanjay Patel 0cf0be993c [InstCombine] fix operands of shouldChangeType() for casted phi transform
This is a bug noted in the recent D72733 and seen
in the similar transform just above the changed source code.

I added tests with illegal types and zexts to show the bug -
we could transform legal phi ops to illegal, etc. I did not add
tests with trunc because we won't see any diffs on those patterns.
That is because InstCombiner::SliceUpIllegalIntegerPHI() appears to
do those transforms independently of datalayout. It can also create
more casts than are present in existing code.

There are some existing regression tests that do not include a
datalayout that would be altered by this fix. I assumed that the
lack of a datalayout in those regression files is an oversight, so
I added the minimal layout (make i32 legal) necessary to preserve
behavior on those tests.

Differential Revision: https://reviews.llvm.org/D73907
2020-02-04 07:45:48 -05:00
Jay Foad 2252cac694 [ANDGPU] getMemOperandsWithOffset: support BUF non-stack-access instructions with resource but no vaddr
Summary:
This enables clustering for many more BUF instructions.

Reviewers: rampitec, arsenm, nhaehnle

Subscribers: jvesely, wdng, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73868
2020-02-03 22:49:30 +00:00
Matt Arsenault 7d3aace3f5 AMDGPU: Add flag to control mem intrinsic expansion
GlobalISel doesn't implement the expansion for these yet, so add a
flag to force expanding these so it's possible to avoid these for a
while.
2020-02-03 14:26:01 -08:00
Matt Arsenault cb7b661d3d AMDGPU: Analyze divergence of inline asm 2020-02-03 12:42:16 -08:00
Matt Arsenault 2758ae41ae AMDGPU/GlobalISel: Allow selecting s128 load/stores 2020-02-03 12:28:08 -08:00
Matt Arsenault 726446a009 AMDGPU: Fix splitting wide f32 s.buffer.load intrinsics
This would witch f32 to i32, and produce an invald concat_vectors from
i32 pieces to an f32 vector.
2020-02-03 12:28:08 -08:00
Matt Arsenault cd7650c186 GlobalISel: Implement fewerElementsVector for G_SEXT_INREG
Start using a new strategy with a combination of merge and unmerges.

This allows scalarizing before lowering, which in cases like
<2 x s128> avoids producing giant illegal shifts.
2020-02-03 11:47:33 -08:00
Jay Foad 05297b7cbe [AMDGPU] getMemOperandsWithOffset: add resource operand for BUF instructions
Summary:
This prevents unwanted clustering of BUF instructions with the same
vaddr but different resource descriptors.

Reviewers: rampitec, arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73867
2020-02-03 17:06:09 +00:00
Matt Arsenault 00b22df71d AMDGPU: Fix extra type mangling on llvm.amdgcn.if.break
These have to be the same mask type.
2020-02-03 07:02:05 -08:00
Matt Arsenault 95a9b828f3 AMDGPU/GlobalISel: Fix mem size in test
This wasn't intended to tests an extload.
2020-02-03 05:41:14 -08:00
Jay Foad 97d9a76afc [AMDGPU] Don't remove short branches over kills
Summary:
D68092 introduced a new SIRemoveShortExecBranches optimization pass and
broke some graphics shaders. The problem is that it was removing
branches over KILL pseudo instructions, and the fix is to explicitly
check for that in mustRetainExeczBranch.

Reviewers: critson, arsenm, nhaehnle, cdevadas, hakzsam

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73771
2020-02-03 09:26:52 +00:00
Simon Pilgrim 547a94ffa1 Regenerate bitcast test for upcoming patch. 2020-02-02 18:27:44 +00:00
Nicolai Hähnle ba8110161d AMDGPU/GFX10: Fix NSA reassign pass when operands are undef
Summary:
Virtual registers that are undef have an empty LiveInterval at this
point, which means beginIndex() and endIndex() cannot be used. We
only need those indices to determine the range in which to scan for
affected other NSA instructions, and undef operands cannot contribute
to that range.

Reviewers: arsenm, rampitec, mareko

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73831
2020-02-01 22:41:40 +01:00