Commit Graph

2131 Commits

Author SHA1 Message Date
Matt Arsenault 0cb08e448a Allow FP types for atomicrmw xchg
llvm-svn: 351427
2019-01-17 10:49:01 +00:00
Changpeng Fang fe9269f804 AMDGPU: Adjust the chain for loads writing to the HI part of a register.
Summary:
  For these loads that write to the HI part of a register, we should chain them to the op that writes to the LO part
of the register to maintain the appropriate order.

Reviewers:
  rampitec, arsenm

Differential Revision:
  https://reviews.llvm.org/D56454

llvm-svn: 351379
2019-01-16 21:32:53 +00:00
Marek Olsak c5cec5e1fa AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap
Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52944

llvm-svn: 351351
2019-01-16 15:43:53 +00:00
Changpeng Fang 20fe3d2f35 AMDGPU: Raise the priority of MAD24 in instruction selection.
Summary:
  We have seen performance regression when v_add3 is generated. The major reason is that the v_mad pattern
is broken when v_add3 is generated. We also see the register pressure increased. While we could not properly
estimate register pressure during instruction selection, we can give mad a higher priority.

In this work, we raise the priority for mad24 in selection and resolve the performance regression.

Reviewers:
  rampitec

Differential Revision:
  https://reviews.llvm.org/D56745

llvm-svn: 351273
2019-01-15 23:12:36 +00:00
James Y Knight 693d39dd12 Remove irrelevant references to legacy git repositories from
compiler identification lines in test-cases.

(Doing so only because it's then easier to search for references which
are actually important and need fixing.)

llvm-svn: 351200
2019-01-15 16:18:52 +00:00
Marek Olsak 33eb4d947d AMDGPU: Add a fast path for icmp.i1(src, false, NE)
Summary:
This allows moving the condition from the intrinsic to the standard ICmp
opcode, so that LLVM can do simplifications on it. The icmp.i1 intrinsic
is an identity for retrieving the SGPR mask.

And we can also get the mask from and i1, or i1, xor i1.

Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52060

llvm-svn: 351150
2019-01-15 02:13:18 +00:00
David Stuttard f77079f892 [AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.

This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.

This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).

There's an additional fix now to avoid a dmask=0

For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.

Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.

The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:

%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
                                      i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1

This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.

Differential revision: https://reviews.llvm.org/D48826

Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda


Work around for ppcle compiler bug

Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 11:55:24 +00:00
Neil Henning e85d45a699 [AMDGPU] Fix dwordx3/southern-islands failures.
This commit fixes the dwordx3/southern-islands failures that were found
in bugzilla https://bugs.llvm.org/show_bug.cgi?id=40129, by not
generating the dwordx3 variants of load/store instructions that were
added to the ISA after southern islands.

Differential Revision: https://reviews.llvm.org/D56434

llvm-svn: 350838
2019-01-10 16:21:08 +00:00
Valery Pykhtin b7a459547d Revert "[AMDGPU] Fix DPP combiner"
This reverts commit e3e2923a39cbec3b3bc3a7d3f0e9a77a4115080e, svn revision rL350721

llvm-svn: 350730
2019-01-09 15:21:53 +00:00
Valery Pykhtin 1e0b5c719b [AMDGPU] Fix DPP combiner
Fixed issue with identity values and other cases, f32/f16 identity values to be added later. fma/mac instructions is disabled for now.
Test is fully reworked, added comments. Other fixes:

1. dpp move with uses and old reg initializer should be in the same BB.
2. bound_ctrl:0 is only considered when bank_mask and row_mask are fully enabled (0xF). Othervise the old register value is checked for identity.
3. Added add, subrev, and, or instructions to the old folding function.
4. Kill flag is cleared for the src0 (DPP register) as it may be copied into more than one user.

Differential revision: https://reviews.llvm.org/D55444

llvm-svn: 350721
2019-01-09 13:43:32 +00:00
Matt Arsenault 0ad1b71fe3 RegisterCoalescer: Assume CR_Replace for SubRangeJoin
Currently it's possible for following
check on V.WriteLanes (which is not really meaningful
during SubRangeJoin) to pass for one half of the pair,
and then fall through to to one of the impossible
or unresolved states. This then fails as inconsistent
on the other half.

During the main range join, the check between V.WriteLanes
and OtherV.ValidLanes must have passed, meaning this
should be a CR_Replace.

Fixes most of the testcases in bugs 39542 and 39602

llvm-svn: 350678
2019-01-08 23:22:18 +00:00
Matt Arsenault 2c807410fd RegisterCoalescer: Defer clearing implicit_def lanes
We can't go back and recover the lanes if it turns
out the implicit_def really can't be erased.

Assume all lanes are valid if an unresolved conflict
is encountered. There aren't any tests where this
seems to matter either way, but this seems like a
safer option.

Fixes bug 39602

llvm-svn: 350676
2019-01-08 23:10:47 +00:00
Matt Arsenault c765240060 AMDGPU/GlobalISel: Introduce vcc reg bank
I'm not entirely sure this is the correct thing
to do with the global isel philosophy, but I think
this is necessary to handle how differently SGPRs
are used normally vs. from a condition.

For example, it makes sense to allow a copy
from a VGPR to an SGPR, but it makes no sense
to allow a copy from VGPRs to SGPRs used as
select mask.

This avoids regbankselecting strange code with
a truncate feeding directly into a condition field.
Now a copy is forced from sgpr(s1) to vcc, which is
more sensible to handle.

Some of these issues could probably avoided with making enough
operations resulting in i1 illegal. I think we can't avoid
this register bank for legality.

For example, an i1 and where one source is from a truncate, and
one source is a compare needs some kind of copy inserted to
make sure both are in condition registers.

llvm-svn: 350611
2019-01-08 06:30:53 +00:00
Matt Arsenault a1515d2d33 AMDGPU/GlobalISel: Legalize concat_vectors
llvm-svn: 350598
2019-01-08 01:30:02 +00:00
Matt Arsenault adc40baa29 RegBankSelect: Fix copy insertion point for terminators
If a copy was needed to handle the condition of brcond, it was being
inserted before the defining instruction. Add tests for iterator edge
cases.

I find the existing code here suspect for the case where it's looking
for terminators that modify the register. It's going to insert a copy
in the middle of the terminators, which isn't allowed (it might be
necessary to have a COPY_terminator if anybody actually needs this).

Also legalize brcond for AMDGPU.

llvm-svn: 350595
2019-01-08 01:22:47 +00:00
Matt Arsenault ae6f1e07fc AMDGPU/GlobalISel: Disallow VGPR->SCC copies
This fixes using scalar adds when only the carry in is a VGPR
using greedy regbankselect.

llvm-svn: 350593
2019-01-08 01:13:20 +00:00
Matt Arsenault 68c668a5f3 AMDGPU/GlobalISel: RegBankSelect for carry-in
I'm not sure we should be allowing the truncate
to s1 for the inputs. It may be necessary to
create a new VCC reg bank.

llvm-svn: 350592
2019-01-08 01:09:09 +00:00
Matt Arsenault 2cc15b67b7 AMDGPU/GlobalISel: RegBankSelect for add/sub with carry out
llvm-svn: 350589
2019-01-08 01:03:58 +00:00
Matt Arsenault 299302fbe7 AMDGPU/GlobalISel: InstrMapping for G_UNMERGE_VALUES
llvm-svn: 350588
2019-01-08 00:46:19 +00:00
Craig Topper 826f44b550 [TargetLowering][AMDGPU] Remove the SimplifyDemandedBits function that takes a User and OpIdx. Stop using it in AMDGPU target for simplifyI24.
As we saw in D56057 when we tried to use this function on X86, it's unsafe. It allows the operand node to have multiple users, but doesn't prevent recursing past the first node when it does have multiple users. This can cause other simplifications earlier in the graph without regard to what bits are needed by the other users of the first node. Ideally all we should do to the first node if it has multiple uses is bypass it when its not needed by the user we started from. Doing any other transformation that SimplifyDemandedBits can do like turning ZEXT/SEXT into AEXT would result in an increase in instructions.

Fortunately, we already have a function that can do just that, GetDemandedBits. It will only make transformations that involve bypassing a node.

This patch changes AMDGPU's simplifyI24, to use a combination of GetDemandedBits to handle the multiple use simplifications. And then uses the regular SimplifyDemandedBits on each operand to handle simplifications allowed when the operand only has a single use. Unfortunately, GetDemandedBits simplifies constants more aggressively than SimplifyDemandedBits. This caused the -7 constant in the changed test to be simplified to remove the upper bits. I had to modify computeKnownBits to account for this by ignoring the upper 8 bits of the input.

Differential Revision: https://reviews.llvm.org/D56087

llvm-svn: 350560
2019-01-07 19:30:43 +00:00
Rhys Perry f77e2e8406 AMDGPU: test for uniformity of branch instruction, not its condition
Summary:
If a divergent branch instruction is marked as divergent by propagation
rule 2 in DivergencePropagator::exploreSyncDependency() and its condition
is uniform, that branch would incorrectly be assumed to be uniform.

Reviewers: arsenm, tstellar

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D56331

llvm-svn: 350532
2019-01-07 15:52:28 +00:00
Matt Arsenault 369acb8470 AMDGPU: Remove VS/SV mappings from select
These would violate the constant bus restriction

llvm-svn: 350517
2019-01-07 13:21:36 +00:00
Simon Pilgrim 09bf22862a Regenerate test.
Prep work towards enabling SimplifyDemandedBits vector support for TRUNCATE as discussed on D56118.

llvm-svn: 350513
2019-01-07 12:20:35 +00:00
Stanislav Mekhanoshin 35a3a3bd11 Added single use check to ShrinkDemandedConstant
Fixes cvt_f32_ubyte combine. performCvtF32UByteNCombine() could shrink
source node to demanded bits only even if there are other uses.

Differential Revision: https://reviews.llvm.org/D56289

llvm-svn: 350475
2019-01-05 19:20:00 +00:00
Alexander Timofeev 993e2798fd [AMDGPU] Fix scalar operand folding bug that causes SHOC performance regression.
Detailed description: SIFoldOperands::foldInstOperand iterates over the
operand uses calling the function that changes def-use iteratorson the
way. As a result loop exits immediately when def-use iterator is
changed. Hence, the operand is folded to the very first use instruction
only. This makes VGPR live along the whole basic block and increases
register pressure significantly. The performance drop observed in SHOC
DeviceMemory test is caused by this bug.

Proposed fix: collect uses to separate container for further processing
in another loop.

Testing: make check-llvm
SHOC performance test.

Reviewers: rampitec, ronlieb

Differential Revision: https://reviews.llvm.org/D56161

llvm-svn: 350350
2019-01-03 19:55:32 +00:00
Piotr Sobczak 3abef8f9ea [AMDGPU] Change section name with metadata access
Summary:
The commit rL348922 introduced a means to set Metadata
section kind for a global variable, if its explicit section
name was prefixed with ".AMDGPU.metadata.".

This patch changes that prefix to ".AMDGPU.comment.",
as "metadata" in the section name might lead to
ambiguity with metadata used by AMD PAL runtime.

Change-Id: Idd4748800d6fe801441d91595fc21e5a4171e668

Reviewers: kzhuravl

Reviewed By: kzhuravl

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D56197

llvm-svn: 350292
2019-01-03 11:22:58 +00:00
Piotr Sobczak 378131bae0 [AMDGPU] Handle OR as operand of raw load/store
Summary:
Use isBaseWithConstantOffset() which handles OR as an operand
to llvm.amdgcn.raw.buffer.load and llvm.amdgcn.raw.buffer.store.

Change-Id: Ifefb9dc5ded8710d333df07ab1900b230e33539a

Reviewers: nhaehnle, mareko, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55999

llvm-svn: 350208
2019-01-02 09:47:41 +00:00
Simon Pilgrim a8ff77bb34 [AMDGPU] Regenerate i64 shift tests.
To show codegen diff due to a future SimplifyDemandedBits patch.

llvm-svn: 350065
2018-12-26 12:09:10 +00:00
Sanjay Patel 4b537aaf6d [DAGCombiner] allow narrowing of add followed by truncate
trunc (add X, C ) --> add (trunc X), C'

If we're throwing away the top bits of an 'add' instruction, do it in the narrow destination type.
This makes the truncate-able opcode list identical to the sibling transform done in IR (in instcombine).

This change used to show regressions for x86, but those are gone after D55494. 
This gets us closer to deleting the x86 custom function (combineTruncatedArithmetic) 
that does almost the same thing.

Differential Revision: https://reviews.llvm.org/D55866

llvm-svn: 350006
2018-12-22 17:10:31 +00:00
Changpeng Fang 6f539294b5 AMDGPU: Don't peel of the offset if the resulting base could possibly be negative in Indirect addressing.
Summary:
  Don't peel of the offset if the resulting base could possibly be negative in Indirect addressing.
This is because the M0 field is of unsigned.

This patch achieves the similar goal as https://reviews.llvm.org/D55241, but keeps the optimization
if the base is known unsigned.

Reviewers:
  arsemn

Differential Revision:
  https://reviews.llvm.org/D55568

llvm-svn: 349951
2018-12-21 20:57:34 +00:00
Matt Arsenault 3eae3c4590 AMDGPU/GlobalISel: RegBankSelect for amdgcn.wqm.vote
llvm-svn: 349882
2018-12-21 03:20:54 +00:00
Matt Arsenault f4c21c575a AMDGPU/GlobalISel: RegBankSelect for some fp ops
llvm-svn: 349880
2018-12-21 03:14:45 +00:00
Matt Arsenault bee2ad7185 AMDGPU/GlobalISel: Redo legality for build_vector
It seems better to avoid using the callback if possible since
there are coverage assertions which are disabled if this is used.

Also fix missing tests. Only test the legal cases since it seems
legalization for build_vector is quite lacking.

llvm-svn: 349878
2018-12-21 03:03:11 +00:00
Matt Arsenault 4339883710 AMDGPU: Make i1/i64/v2i32 and/or/xor legal
The 64-bit types do depend on the register bank,
but that's another issue to deal with later.

llvm-svn: 349716
2018-12-20 01:35:49 +00:00
Matt Arsenault 8cc98bee8a AMDGPU/GlobalISel: Fix ValueMapping tables for i1
This was incorrectly selecting SGPR for any i1 values,
e.g. G_TRUNC to i1 from a VGPR was still an SGPR.

llvm-svn: 349715
2018-12-20 01:33:43 +00:00
Matt Arsenault dff33c38e1 AMDGPU/GlobalISel: RegBankSelect for fp conversions
llvm-svn: 349709
2018-12-20 00:37:02 +00:00
Matt Arsenault 36d4092173 AMDGPU/GlobalISel: Legality/regbankselect for atomicrmw/atomic_cmpxchg
llvm-svn: 349708
2018-12-20 00:33:49 +00:00
Rhys Perry 3931ad38b9 AMDGPU: Add patterns for v4i16/v4f16 -> v4i16/v4f16 bitcasts
Reviewers: arsenm, tstellar

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55058

llvm-svn: 349694
2018-12-19 22:53:33 +00:00
Nicolai Haehnle 8d5e974076 AMDGPU: Use an ABS32_LO relocation for SCRATCH_RSRC_DWORD1
Summary:
Using HI here makes no logical sense, since the dword is only
32 bits to begin with.

Current Mesa master does not look at the relocation type at all,
so this change is fine. Future Mesa will rely on this, however.

Change-Id: I91085707834c4ac0370926602b93c94b90e44cb1

Reviewers: arsenm, rampitec, mareko

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55369

llvm-svn: 349620
2018-12-19 11:55:03 +00:00
Carl Ritson c521ac3a44 AMDGPU/InsertWaitcnts: Update VGPR/SGPR bounds when brackets are merged
Summary:
Fix an issue where VGPR/SGPR bounds are not properly extended when brackets are merged.
This manifests as missing waitcnt insertions when multiple brackets are forwarded to a successor block and the first forward has lower VGPR/SGPR bounds.

Irreducible loop test has been extended based on a CTS failure detected for GFX9.

Reviewers: nhaehnle

Reviewed By: nhaehnle

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, jfb, llvm-commits

Differential Revision: https://reviews.llvm.org/D55602

llvm-svn: 349611
2018-12-19 10:17:49 +00:00
Matt Arsenault b110e2277c AMDGPU/GlobalISel: Regbankselect for fsub
llvm-svn: 349608
2018-12-19 09:07:58 +00:00
Farhana Aleen 59ee2c5362 [AMDGPU] Removed the unnecessary operand size-check-assert from processBaseWithConstOffset().
Summary: 32bit operand sizes are guaranteed by the opcode check AMDGPU::V_ADD_I32_e64 and
         AMDGPU::V_ADDC_U32_e64. Therefore, we don't any additional operand size-check-assert.

Author: FarhanaAleen
llvm-svn: 349529
2018-12-18 19:58:39 +00:00
Matt Arsenault c94e26c71d AMDGPU: Legalize/regbankselect frame_index
llvm-svn: 349468
2018-12-18 09:46:13 +00:00
Matt Arsenault c0ea221068 AMDGPU: Legalize/regbankselect fma
llvm-svn: 349467
2018-12-18 09:39:56 +00:00
Matt Arsenault e01e7c81f2 AMDGPU/GlobalISel: Legalize/regbankselect fneg/fabs/fsub
llvm-svn: 349463
2018-12-18 09:19:03 +00:00
Farhana Aleen ce095c564a [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions.
Summary: Promote constant offset to immediate by recomputing the relative 13bit offset from nearby instructions.
 E.g.
  s_movk_i32 s0, 0x1800
  v_add_co_u32_e32 v0, vcc, s0, v2
  v_addc_co_u32_e32 v1, vcc, 0, v6, vcc

  s_movk_i32 s0, 0x1000
  v_add_co_u32_e32 v5, vcc, s0, v2
  v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
  global_load_dwordx2 v[5:6], v[5:6], off
  global_load_dwordx2 v[0:1], v[0:1], off
  =>
  s_movk_i32 s0, 0x1000
  v_add_co_u32_e32 v5, vcc, s0, v2
  v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
  global_load_dwordx2 v[5:6], v[5:6], off
  global_load_dwordx2 v[0:1], v[5:6], off offset:2048

Author: FarhanaAleen

Reviewed By: arsenm, rampitec

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D55539

llvm-svn: 349196
2018-12-14 21:13:14 +00:00
John Brawn 1d0d86ae40 [RegAllocGreedy] IMPLICIT_DEF values shouldn't prefer registers
It costs nothing to spill an IMPLICIT_DEF value (the only spill code that's
generated is a KILL of the value), so when creating split constraints if the
live-out value is IMPLICIT_DEF the exit constraint should be DontCare instead
of PrefReg.

Differential Revision: https://reviews.llvm.org/D55652

llvm-svn: 349151
2018-12-14 14:07:57 +00:00
Aakanksha Patil bc568766b2 Revert r348971: [AMDGPU] Support for "uniform-work-group-size" attribute
This patch breaks RADV (and probably RadeonSI as well)

llvm-svn: 349084
2018-12-13 21:23:12 +00:00
Matt Arsenault 934e534c47 AMDGPU/GlobalISel: Legalize/regbankselect block_addr
llvm-svn: 349081
2018-12-13 20:34:15 +00:00
Matt Arsenault 577b9fc543 AMDGPU/GlobalISel: Legalize f64 fadd/fmul
llvm-svn: 349014
2018-12-13 08:27:48 +00:00
Matt Arsenault f38f483bef AMDGPU/GlobalISel: RegBankSelect some simple operations
llvm-svn: 349012
2018-12-13 08:23:51 +00:00
Matt Arsenault 7acf89a21a AMDGPU/GlobalISel: Test cleanups
Remove IR and registers sections

llvm-svn: 349011
2018-12-13 08:11:45 +00:00
Stanislav Mekhanoshin 6071e1aa58 [AMDGPU] Simplify negated condition
Optimize sequence:

  %sel = V_CNDMASK_B32_e64 0, 1, %cc
  %cmp = V_CMP_NE_U32 1, %1
  $vcc = S_AND_B64 $exec, %cmp
  S_CBRANCH_VCC[N]Z
=>
  $vcc = S_ANDN2_B64 $exec, %cc
  S_CBRANCH_VCC[N]Z

It is the negation pattern inserted by DAGCombiner::visitBRCOND() in the
rebuildSetCC().

Differential Revision: https://reviews.llvm.org/D55402

llvm-svn: 349003
2018-12-13 03:17:40 +00:00
Aakanksha Patil 729309cc89 [AMDGPU] Support for "uniform-work-group-size" attribute
Updated the annotate-kernel-features pass to support the propagation of uniform-work-group attribute from the kernel to the called functions. Once this pass is run, all kernels, even the ones which initially did not have the attribute, will be able to indicate weather or not they have uniform work group size depending on the value of the attribute. 

Differential Revision: https://reviews.llvm.org/D50200

llvm-svn: 348971
2018-12-12 20:49:17 +00:00
Scott Linder f5b36e56fb [AMDGPU] Emit MessagePack HSA Metadata for v3 code object
Continue to present HSA metadata as YAML in ASM and when output by tools
(e.g. llvm-readobj), but encode it in Messagepack in the code object.

Differential Revision: https://reviews.llvm.org/D48179

llvm-svn: 348963
2018-12-12 19:39:27 +00:00
Neil Henning 76504a4c5e [AMDGPU] Extend the SI Load/Store optimizer to combine more things.
I've extended the load/store optimizer to be able to produce dwordx3
loads and stores, This change allows many more load/stores to be combined,
and results in much more optimal code for our hardware.

Differential Revision: https://reviews.llvm.org/D54042

llvm-svn: 348937
2018-12-12 16:15:21 +00:00
Piotr Sobczak 3732b4ce25 [AMDGPU] Set metadata access for explicit section
Summary:
This patch provides a means to set Metadata section kind
for a global variable, if its explicit section name is
prefixed with ".AMDGPU.metadata."
This could be useful to make the global variable go to
an ELF section without any section flags set.

Reviewers: dstuttard, tpr, kzhuravl, nhaehnle, t-tye

Reviewed By: dstuttard, kzhuravl

Subscribers: llvm-commits, arsenm, jvesely, wdng, yaxunl, t-tye

Differential Revision: https://reviews.llvm.org/D55267

llvm-svn: 348922
2018-12-12 11:20:04 +00:00
Amara Emerson 5ec146046c [GlobalISel] Restrict G_MERGE_VALUES capability and replace with new opcodes.
This patch restricts the capability of G_MERGE_VALUES, and uses the new
G_BUILD_VECTOR and G_CONCAT_VECTORS opcodes instead in the appropriate places.

This patch also includes AArch64 support for selecting G_BUILD_VECTOR of <4 x s32>
and <2 x s64> vectors.

Differential Revisions: https://reviews.llvm.org/D53629

llvm-svn: 348788
2018-12-10 18:44:58 +00:00
Neil Henning e448351b77 [AMDGPU] Change the l1 flush instruction for AMDPAL/MESA3D.
This commit changes which l1 flush instruction is used for AMDPAL and
MESA3d workloads to flush the entire l1 cache instead of just the
volatile lines.

Differential Revision: https://reviews.llvm.org/D55367

llvm-svn: 348771
2018-12-10 16:35:53 +00:00
Tim Corringham 4c4d2fe280 [AMDGPU] Add new Mode Register pass
A new pass to manage the Mode register.

Currently this just manages the floating point double precision
rounding requirements, but is intended to be easily extended to
encompass all Mode register settings.

The immediate motivation comes from the requirement to use the
round-to-zero rounding mode for the 16 bit interpolation
instructions, where the rounding mode setting is shared between
16 and 64 bit operations.

llvm-svn: 348754
2018-12-10 12:06:10 +00:00
Brian Gesiak b963c5150d [AMDGPU] Fix discarded result of addAttribute
Summary:
`llvm::AttributeList` and `llvm::AttributeSet` are immutable, and so methods
defined on these classes, such as `addAttribute`, return a new immutable
object with the attribute added. In https://reviews.llvm.org/D55217 I attempted
to annotate methods such as `addAttribute` with `LLVM_NODISCARD`, since
calling these methods has no side-effects, and so ignoring the result
that is returned is almost certainly a programmer error.

However, committing the change resulted in new warnings in the AMDGPU target.
The AMDGPU simplify libcalls pass added in https://reviews.llvm.org/D36436
attempts to add the readonly and nounwind attributes to simplified
library functions, but instead calls the `addAttribute` methods and
ignores the result.

Modify the simplify libcalls pass to actually add the nounwind and
readonly attributes. Also update the simplify libcalls test to assert
that these attributes are actually being set.

Reviewers: rampitec, vpykhtin, rnk

Reviewed By: rampitec

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55435

llvm-svn: 348732
2018-12-09 21:56:50 +00:00
Sanjay Patel e767bf4468 [DAGCombiner] re-enable truncation of binops
This is effectively re-committing the changes from:
rL347917 (D54640)
rL348195 (D55126)
...which were effectively reverted here:
rL348604
...because the code had a bug that could induce infinite looping
or eventual out-of-memory compilation.

The bug was that this code did not guard against transforming
opaque constants. More details are in the post-commit mailing
list thread for r347917. A reduced test for that is included
in the x86 bool-math.ll file. (I wasn't able to reduce a PPC
backend test for this, but it was almost the same pattern.)

Original commit message for r347917:

The motivating case for this is shown in:
https://bugs.llvm.org/show_bug.cgi?id=32023
and the corresponding rot16.ll regression tests.

Because x86 scalar shift amounts are i8 values, we can end up with trunc-binop-trunc
sequences that don't get folded in IR.

As the TODO comments suggest, there will be regressions if we extend this (for x86,
we mostly seem to be missing LEA opportunities, but there are likely vector folds
missing too). I think those should be considered existing bugs because this is the
same transform that we do as an IR canonicalization in instcombine. We just need
more tests to make those visible independent of this patch.

llvm-svn: 348706
2018-12-08 16:07:38 +00:00
Matt Arsenault b5613ecf17 AMDGPU: Fix offsets for < 4-byte aggregate kernel arguments
We were still using the rounded down offset and alignment even though
they aren't handled because you can't trivially bitcast the loaded
value.

llvm-svn: 348658
2018-12-07 22:12:17 +00:00
Matt Arsenault fab7d27f0e AMDGPU: Use gfx9 instead of gfx8 in a test
They are the same for the purposes of the tests,
but it's much easier to write check lines for
the memory instructions with offsets.

llvm-svn: 348643
2018-12-07 20:57:43 +00:00
Matt Arsenault ce2e053134 AMDGPU: Allow f32 types for llvm.amdgcn.s.buffer.load
llvm-svn: 348625
2018-12-07 18:41:39 +00:00
Matt Arsenault ca8eb0b672 AMDGPU: Remove llvm.SI.tbuffer.store
llvm-svn: 348619
2018-12-07 18:03:47 +00:00
Matt Arsenault 3ff764a944 AMDGPU: Remove llvm.SI.buffer.load.dword
llvm-svn: 348616
2018-12-07 17:46:20 +00:00
Matt Arsenault aa9bcd56b1 AMDGPU: Remove llvm.AMDGPU.kill
This is the last of the old AMDGPU intrinsics.

llvm-svn: 348615
2018-12-07 17:46:16 +00:00
Sanjay Patel 3af4ae9735 [DAGCombiner] disable truncation of binops by default
As discussed in the post-commit thread of r347917, this
transform is fighting with an existing transform causing
an infinite loop or out-of-memory, so this is effectively 
reverting r347917 and its follow-up r348195 while we
investigate the bug.

llvm-svn: 348604
2018-12-07 15:47:52 +00:00
Graham Sellers b297379ef0 [AMDGPU] Shrink scalar AND, OR, XOR instructions
This change attempts to shrink scalar AND, OR and XOR instructions which take an immediate that isn't inlineable.

It performs:
AND s0, s0, ~(1 << n) -> BITSET0 s0, n
OR s0, s0, (1 << n) -> BITSET1 s0, n
AND s0, s1, x -> ANDN2 s0, s1, ~x
OR s0, s1, x -> ORN2 s0, s1, ~x
XOR s0, s1, x -> XNOR s0, s1, ~x

In particular, this catches setting and clearing the sign bit for fabs (and x, 0x7ffffffff -> bitset0 x, 31 and or x, 0x80000000 -> bitset1 x, 31).

llvm-svn: 348601
2018-12-07 15:33:21 +00:00
Simon Pilgrim d498dee7a2 [SelectionDAG] Don't pass on DemandedElts when handling SCALAR_TO_VECTOR
Fixes an assertion:

llc: lib/CodeGen/SelectionDAG/SelectionDAG.cpp:2200: llvm::KnownBits llvm::SelectionDAG::computeKnownBits(llvm::SDValue, const llvm::APInt&, unsigned int) const: Assertion `(!Op.getValueType().isVector() || NumElts == Op.getValueType().getVectorNumElements()) && "Unexpected vector size"' failed.

Committed on behalf of: @pendingchaos (Rhys Perry)

Differential Revision: https://reviews.llvm.org/D55223

llvm-svn: 348574
2018-12-07 09:18:44 +00:00
Adrian Prantl fbeeac0e1e Reapply "Adapt gcov to changes in CFE."
This reverts commit r348203 and reapplies D55085 with an additional
GCOV bugfix to make the change NFC for relative file paths in .gcno files.

Thanks to Ilya Biryukov for additional testing!

Original commit message:

    Update Diagnostic handling for changes in CFE.

    The clang frontend no longer emits the current working directory for
    DIFiles containing an absolute path in the filename: and will move the
    common prefix between current working directory and the file into the
    directory: component.

    https://reviews.llvm.org/D55085

llvm-svn: 348512
2018-12-06 18:44:48 +00:00
Nicolai Haehnle ca4a32945f AMDGPU: Generate VALU ThreeOp Integer instructions
Summary:
Original patch by: Fabian Wahlster <razor@singul4rity.com>

Change-Id: I148f692a88432541fad468963f58da9ddf79fac5

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, b-sumner, llvm-commits

Differential Revision: https://reviews.llvm.org/D51995

llvm-svn: 348488
2018-12-06 14:33:40 +00:00
Matt Arsenault b3e14de487 AMDGPU: Fix using old address spaces in some tests
llvm-svn: 348385
2018-12-05 17:34:59 +00:00
Valery Pykhtin 5b4db77b13 [AMDGPU]: Turn on the DPP combiner by default
Differential revision: https://reviews.llvm.org/D55314

llvm-svn: 348371
2018-12-05 15:21:17 +00:00
Craig Topper 6934202dc0 [MachineLICM][X86][AMDGPU] Fix subtle bug in the updating of PhysRegClobbers in post-RA LICM
It looks like MCRegAliasIterator can visit the same physical register twice. When this happens in this code in LICM we end up setting the PhysRegDef and then later in the same loop visit the register again. Now we see that PhysRegDef is set from the earlier iteration so now set PhysRegClobber.

This patch splits the loop so we have one that uses the previous value of PhysRegDef to update PhysRegClobber and second loop that updates PhysRegDef.

The X86 atomic test is an improvement. I had to add sideeffect to the two shrink wrapping tests to prevent hoisting from occurring. I'm not sure about the AMDGPU tests. It looks like the branch instruction changed at end the of the loops. And in the branch-relaxation test I think there is now "and vcc, exec, -1" instruction that wasn't there before.

Differential Revision: https://reviews.llvm.org/D55102

llvm-svn: 348330
2018-12-05 03:41:26 +00:00
Ilya Biryukov 449a7f0dbb Revert "Adapt gcov to changes in CFE."
This reverts commit r348203.
Reason: this produces absolute paths in .gcno files, breaking us
internally as we rely on them being consistent with the filenames passed
in the command line.

Also reverts r348157 and r348155 to account for revert of r348154 in
clang repository.

llvm-svn: 348279
2018-12-04 16:30:31 +00:00
Adrian Prantl 0f873eb80a Update Diagnostic handling for changes in CFE.
The clang frontend no longer emits the current working directory for
DIFiles containing an absolute path in the filename: and will move the
common prefix between current working directory and the file into the
directory: component.

https://reviews.llvm.org/D55085

llvm-svn: 348155
2018-12-03 17:55:29 +00:00
Ron Lieberman 16de4fd2eb [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
The introduction of S_{ADD|SUB}_U64_PSEUDO instructions which are decomposed
into VOP3 instruction pairs for S_ADD_U64_PSEUDO:
  V_ADD_I32_e64
  V_ADDC_U32_e64
and for S_SUB_U64_PSEUDO
  V_SUB_I32_e64
  V_SUBB_U32_e64
preclude the use of SDWA to encode a constant.
SDWA: Sub-Dword addressing is supported on VOP1 and VOP2 instructions,
but not on VOP3 instructions.

We desire to fold the bit-and operand into the instruction encoding
for the V_ADD_I32 instruction. This requires that we transform the
VOP3 into a VOP2 form of the instruction (_e32).
  %19:vgpr_32 = V_AND_B32_e32 255,
      killed %16:vgpr_32, implicit $exec
  %47:vgpr_32, %49:sreg_64_xexec = V_ADD_I32_e64
      %26.sub0:vreg_64, %19:vgpr_32, implicit $exec
 %48:vgpr_32, dead %50:sreg_64_xexec = V_ADDC_U32_e64
      %26.sub1:vreg_64, %54:vgpr_32, killed %49:sreg_64_xexec, implicit $exec

which then allows the SDWA encoding and becomes
  %47:vgpr_32 = V_ADD_I32_sdwa
      0, %26.sub0:vreg_64, 0, killed %16:vgpr_32, 0, 6, 0, 6, 0,
      implicit-def $vcc, implicit $exec
  %48:vgpr_32 = V_ADDC_U32_e32
      0, %26.sub1:vreg_64, implicit-def $vcc, implicit $vcc, implicit $exec


Differential Revision: https://reviews.llvm.org/D54882

llvm-svn: 348132
2018-12-03 13:04:54 +00:00
Graham Sellers ba559ac058 [AMDGPU] Split 64-Bit XNOR to 64-Bit NOT/XOR
The identity ~(x ^ y) == (~x ^ y) == (x ^ ~y) allows XNOR (XOR/NOT) to turn into NOT/XOR. Handling this case with its own split means we can make the NOT remain in the scalar unit. Previously, we split 64-bit XNOR into two 32-bit XNOR, then lowered. Now, we get three instructions (s_not, v_xor, v_xor) rather than four in the case where either of the sources is a scalar 64-bit.

Add test cases to xnor.ll to attempt XNOR Vx, Sy and XNOR Sx, Vy. Also adding test that uses the opposite identity such that (~x ^ y) on the scalar unit (or vector for gfx906) can generate XNOR. This already worked, but I didn't see a test for it.

Differential: https://reviews.llvm.org/D55071
llvm-svn: 348075
2018-12-01 12:27:53 +00:00
Simon Pilgrim e017ed3245 [SelectionDAG] Improve SimplifyDemandedBits to SimplifyDemandedVectorElts simplification
D52935 introduced the ability for SimplifyDemandedBits to call SimplifyDemandedVectorElts through BITCASTs if the demanded bit mask entirely covered the sub element.

This patch relaxes this to demanding an element if we need any bit from it.

Differential Revision: https://reviews.llvm.org/D54761

llvm-svn: 348073
2018-12-01 12:08:55 +00:00
Nicolai Haehnle a7b00058e0 AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.

If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.

There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.

Change-Id: I170e6816323beb1348677b358c9d380865cd1a19

Reviewers: arsenm, alex-t, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53283

llvm-svn: 348050
2018-11-30 22:55:38 +00:00
Nicolai Haehnle a9cc92c247 AMDGPU: Fix various issues around the VirtReg2Value mapping
Summary:
The VirtReg2Value mapping is crucial for getting consistently
reliable divergence information into the SelectionDAG. This
patch fixes a bunch of issues that lead to incorrect divergence
info and introduces tight assertions to ensure we don't regress:

1. VirtReg2Value is generated lazily; there were some cases where
   a lookup was performed before all relevant virtual registers were
   created, leading to an out-of-sync mapping. Those cases were:

  - Complex code to lower formal arguments that generated CopyFromReg
    nodes from live-in registers (fixed by never querying the mapping
    for live-in registers).

  - Code that generates CopyToReg for formal arguments that are used
    outside the entry basic block (fixed by never querying the
    mapping for Register nodes, which don't need the divergence info
    anyway).

2. For complex values that are lowered to a sequence of registers,
   all registers must be reflected in the VirtReg2Value mapping.

I am not adding any new tests, since I'm not actually aware of any
bugs that these problems are causing with trunk as-is. However,
I recently added a test case (in r346423) which fails when D53283 is
applied without this change. Also, the new assertions should provide
most of the effective test coverage.

There is one test change in sdwa-peephole.ll. The underlying issue
is that since the divergence info is now correct, the DAGISel will
select V_OR_B32 directly instead of S_OR_B32. This leads to an extra
COPY which affects the behavior of MachineLICM in a way that ends up
with the S_MOV_B32 with the constant in a different basic block than
the V_OR_B32, which is presumably what defeats the peephole.

Reviewers: alex-t, arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D54340

llvm-svn: 348049
2018-11-30 22:55:29 +00:00
Ron Lieberman f48e43bbf7 [AMDGPU] Disable SReg Global LD/ST, perf regression
Differential Revision: https://reviews.llvm.org/D55093

llvm-svn: 348014
2018-11-30 18:29:17 +00:00
Valery Pykhtin 3d9afa273f [AMDGPU] Combine DPP mov with use instructions (VOP1/2/3)
Introduces DPP pseudo instructions and the pass that combines DPP mov with subsequent uses.

Differential revision: https://reviews.llvm.org/D53762

llvm-svn: 347993
2018-11-30 14:21:56 +00:00
Sanjay Patel 8d27144251 [DAGCombiner] narrow truncated binops
The motivating case for this is shown in:
https://bugs.llvm.org/show_bug.cgi?id=32023
and the corresponding rot16.ll regression tests.

Because x86 scalar shift amounts are i8 values, we can end up with trunc-binop-trunc 
sequences that don't get folded in IR.

As the TODO comments suggest, there will be regressions if we extend this (for x86, 
we mostly seem to be missing LEA opportunities, but there are likely vector folds 
missing too). I think those should be considered existing bugs because this is the 
same transform that we do as an IR canonicalization in instcombine. We just need 
more tests to make those visible independent of this patch.

Differential Revision: https://reviews.llvm.org/D54640

llvm-svn: 347917
2018-11-29 20:58:26 +00:00
David Stuttard c6603861d8 Revert r347871 "Fix: Add support for TFE/LWE in image intrinsic"
Also revert fix r347876

One of the buildbots was reporting a failure in some relevant tests that I can't
repro or explain at present, so reverting until I can isolate.

llvm-svn: 347911
2018-11-29 20:14:17 +00:00
Graham Sellers 04f7a4d2d2 [AMDGPU] Add and update scalar instructions
This patch adds support for S_ANDN2, S_ORN2 32-bit and 64-bit instructions and adds splits to move them to the vector unit (for which there is no equivalent instruction). It modifies the way that the more complex scalar instructions are lowered to vector instructions by first breaking them down to sequences of simpler scalar instructions which are then lowered through the existing code paths. The pattern for S_XNOR has also been updated to apply inversion to one input rather than the output of the XOR as the result is equivalent and may allow leaving the NOT instruction on the scalar unit.

A new tests for NAND, NOR, ANDN2 and ORN2 have been added, and existing tests now hit the new instructions (and have been modified accordingly).

Differential: https://reviews.llvm.org/D54714
llvm-svn: 347877
2018-11-29 16:05:38 +00:00
David Stuttard de02e4b1cc Add support for TFE/LWE in image intrinsics
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.

This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.

This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).

There's an additional fix now to avoid a dmask=0

For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.

Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.

The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:

%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
                                      i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1

Differential revision: https://reviews.llvm.org/D48826

Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
llvm-svn: 347871
2018-11-29 15:21:13 +00:00
Nicolai Haehnle 7bed696915 AMDGPU/InsertWaitcnts: Remove the dependence on MachineLoopInfo
Summary:
MachineLoopInfo cannot be relied on for correctness, because it cannot
properly recognize loops in irreducible control flow which can be
introduced by late machine basic block optimization passes. See the new
test case for the reduced form of an example that occurred in practice.

Use a simple fixpoint iteration instead.

In order to facilitate this change, refactor WaitcntBrackets so that it
only tracks pending events and registers, rather than also maintaining
state that is relevant for the high-level algorithm. Various accessor
methods can be removed or made private as a consequence.

Affects (in radv):
- dEQP-VK.glsl.loops.special.{for,while}_uniform_iterations.select_iteration_count_{fragment,vertex}

Fixes: r345719 ("AMDGPU: Rewrite SILowerI1Copies to always stay on SALU")

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54231

llvm-svn: 347853
2018-11-29 11:06:26 +00:00
Nicolai Haehnle 1a94cbb3f5 AMDGPU/InsertWaitcnts: Untangle some semi-global state
Summary:
Reduce the statefulness of the algorithm in two ways:

1. More clearly split generateWaitcntInstBefore into two phases: the
   first one which determines the required wait, if any, without changing
   the ScoreBrackets, and the second one which actually inserts the wait
   and updates the brackets.

2. Communicate pre-existing s_waitcnt instructions using an argument to
   generateWaitcntInstBefore instead of through the ScoreBrackets.

To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.

There are some functional changes:

1. The FIXME for the VCCZ bug workaround was implemented: we only wait for
   SMEM instructions as required instead of waiting on all counters.

2. We now properly track pre-existing waitcnt's in all cases, which leads
   to less conservative waitcnts being emitted in some cases.

     s_load_dword ...
     s_waitcnt lgkmcnt(0)    <-- pre-existing wait count
     ds_read_b32 v0, ...
     ds_read_b32 v1, ...
     s_waitcnt lgkmcnt(0)    <-- this is too conservative
     use(v0)
     more code
     use(v1)

   This increases code size a bit, but the reduced latency should still be a
   win in basically all cases. The worst code size regressions in my shader-db
   are:

 WORST REGRESSIONS - Code Size
 Before After     Delta Percentage
   1724  1736        12    0.70 %   shaders/private/f1-2015/1334.shader_test [0]
   2276  2284         8    0.35 %   shaders/private/f1-2015/1306.shader_test [0]
   4632  4640         8    0.17 %   shaders/private/ue4_elemental/62.shader_test [0]
   2376  2384         8    0.34 %   shaders/private/f1-2015/1308.shader_test [0]
   3284  3292         8    0.24 %   shaders/private/talos_principle/1955.shader_test [0]

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54226

llvm-svn: 347848
2018-11-29 11:06:06 +00:00
Stanislav Mekhanoshin 443a7f9788 [AMDGPU] Disable DAG combine at -O0
Differential Revision: https://reviews.llvm.org/D54358

llvm-svn: 347659
2018-11-27 15:13:37 +00:00
Matt Arsenault dcdf3ddff5 AMDGPU: Cleanup / relax tests for future changes
llvm-svn: 347576
2018-11-26 17:17:07 +00:00
Simon Pilgrim bac49ac455 [AMDGPU] Regenerate weird stores tests.
Makes an upcoming SimplifyDemandedBits optimization much easier to understand.

llvm-svn: 347326
2018-11-20 17:04:02 +00:00
Stanislav Mekhanoshin 54ebfe8aee Implement computeKnownBits for scalar_to_vector
Differential Revision: https://reviews.llvm.org/D54728

llvm-svn: 347274
2018-11-19 23:34:07 +00:00
Konstantin Zhuravlyov 700b1ef54d AMDGPU: Fix V_FMA_F16 selection on GFX9
GFX9 should select opsel version.

Differential Revision: https://reviews.llvm.org/D54545

llvm-svn: 347265
2018-11-19 21:10:16 +00:00
Stanislav Mekhanoshin 8bafbae889 [AMDGPU] Restored selection of scalar_to_vector (v2x16)
This works if DAG combiner is enabled, but without combining
we cannot select scalar_to_vector of <2 x half> and <2 x i16>.

Differential Revision: https://reviews.llvm.org/D54718

llvm-svn: 347259
2018-11-19 19:58:13 +00:00
Stanislav Mekhanoshin 054f8101f1 [AMDGPU] Convert insert_vector_elt into set of selects
This allows to avoid scratch use or indirect VGPR addressing for
small vectors.

Differential Revision: https://reviews.llvm.org/D54606

llvm-svn: 347231
2018-11-19 17:39:20 +00:00
Sanjay Patel 8c0cd77bff [DAG] add undef simplifications for select nodes
Sadly, this duplicates (twice) the logic from InstSimplify. There
might be some way to at least share the DAG versions of the code, 
but copying the folds seems to be the standard method to ensure 
that we don't miss these folds. 

Unlike in IR, we don't run DAGCombiner to fixpoint, so there's no 
way to ensure that we do these kinds of simplifications unless the 
code is repeated at node creation time and during combines.

There were other tests that would become worthless with this
improvement that I changed as pre-commits:
rL347161
rL347164
rL347165
rL347166
rL347167

I'm not sure how to salvage the remaining tests (diffs in this patch).
So the x86 tests verify that the new code is working as intended.
The AMDGPU test is actually similar to my motivating case: we have
some undef value that has survived to machine IR in an x86 test, and 
then it gets folded in some weird way, or we crash if we don't transfer
the undef flag. But we would have been better off never getting to that
point by doing these simplifications.

This will lead back to PR32023 someday...
https://bugs.llvm.org/show_bug.cgi?id=32023

llvm-svn: 347170
2018-11-18 17:36:23 +00:00
Stanislav Mekhanoshin c12d64ab16 Moved dag-combine-select-undef.ll into amdgpu. NFC.
Tests really needs target arch to be specified.

llvm-svn: 347115
2018-11-17 00:17:15 +00:00
Stanislav Mekhanoshin 0ff7c8309d DAG combiner: fold (select, C, X, undef) -> X
Differential Revision: https://reviews.llvm.org/D54646

llvm-svn: 347110
2018-11-16 23:13:38 +00:00
Matt Arsenault eabb8dd015 AMDGPU: Fix analyzeBranch failing with pseudoterminators
If a block had one of the _term instructions used for gluing
exec modifying instructions to the end of the block,
analyzeBranch would fail, preventing the verifier from catching
a broken successor list.

llvm-svn: 347027
2018-11-16 05:03:02 +00:00
Ron Lieberman cac749ac88 [AMDGPU] Add FixupVectorISel pass, currently Supports SREGs in GLOBAL LD/ST
Add a pass to fixup various vector ISel issues.
Currently we handle converting GLOBAL_{LOAD|STORE}_*
and GLOBAL_Atomic_* instructions into their _SADDR variants.
This involves feeding the sreg into the saddr field of the new instruction.

llvm-svn: 347008
2018-11-16 01:13:34 +00:00
Konstantin Zhuravlyov 7d1532d333 AMDGPU: Fix check lines in fdot2 test:
GCN900 -> GFX900

llvm-svn: 346925
2018-11-15 02:42:04 +00:00
Konstantin Zhuravlyov a25e0524c0 AMDGPU: Enable code object v3 for AMDHSA only
Differential Revision: https://reviews.llvm.org/D54186

llvm-svn: 346923
2018-11-15 02:32:43 +00:00
Nirav Dave 1241dcb3cf Bias physical register immediate assignments
The machine scheduler currently biases register copies to/from
physical registers to be closer to their point of use / def to
minimize their live ranges. This change extends this to also physical
register assignments from immediate values.

This causes a reduction in reduction in overall register pressure and
minor reduction in spills and indirectly fixes an out-of-registers
assertion (PR39391).

Most test changes are from minor instruction reorderings and register
name selection changes and direct consequences of that.

Reviewers: MatzeB, qcolombet, myatsina, pcc

Subscribers: nemanjai, jvesely, nhaehnle, eraman, hiraditya,
  javed.absar, arphaman, jfb, jsji, llvm-commits

Differential Revision: https://reviews.llvm.org/D54218

llvm-svn: 346894
2018-11-14 21:11:53 +00:00
Aakanksha Patil 1a60116b5c AMDGPU: Additional pattern for i16 median3 matching
min(max(a, b), max(min(a, b), c))

Differential Revision: https://reviews.llvm.org/D54494

llvm-svn: 346886
2018-11-14 20:10:41 +00:00
Stanislav Mekhanoshin bcb34ac2ea [AMDGPU] combine extractelement into several selects
An extractelement with non-constant index will be lowered either to
scratch or movrel loop in most cases. This patch converts such
instruction into a set of selects if vector size is not too big.

Differential Revision: https://reviews.llvm.org/D54351

llvm-svn: 346800
2018-11-13 21:18:21 +00:00
Stanislav Mekhanoshin 35de877e8c Fixed DAGTypeLegalizer::SplitVecOp_EXTRACT_VECTOR_ELT i1 handling
Legalizer used to request an ext load from i8 to i1 when promoting
vector element type to i8. Fixed.

Differential Revision: https://reviews.llvm.org/D54440

llvm-svn: 346795
2018-11-13 20:26:27 +00:00
Aakanksha Patil a992c694c6 AMDGPU: Adding more median3 patterns
min(max(a, b), max(min(a, b), c)) -> med3 a, b, c

Differential Revision: https://reviews.llvm.org/D54331

llvm-svn: 346704
2018-11-12 21:04:06 +00:00
Stanislav Mekhanoshin e86c8d33b1 [AMDGPU] Optimize S_CBRANCH_VCC[N]Z -> S_CBRANCH_EXEC[N]Z
Sometimes after basic block placement we end up with a code like:

  sreg = s_mov_b64 -1
  vcc = s_and_b64 exec, sreg
  s_cbranch_vccz

This happens as a join of a block assigning -1 to a saved mask and
another block which consumes that saved mask with s_and_b64 and a
branch.

This is essentially a single s_cbranch_execz instruction when moved
into a single new basic block.

Differential Revision: https://reviews.llvm.org/D54164

llvm-svn: 346690
2018-11-12 18:48:17 +00:00
Stanislav Mekhanoshin 5f9513147a Fix MachineInstr::findRegisterUseOperandIdx subreg checks
The function only checks that instruction reads a super-register
containing requested physical register. In case if a sub-register
if being read that is also a use of a super-reg, so added the check.
In particular MI->readsRegister() is broken because of the missing
check. The resulting check is essentially regsOverlap().

Differential Revision: https://reviews.llvm.org/D54128

llvm-svn: 346686
2018-11-12 18:12:28 +00:00
Stanislav Mekhanoshin 26299e2af1 [AMDGPU] Cleanup optimize-if-exec-masking.mir test. NFC.
llvm-svn: 346533
2018-11-09 18:23:39 +00:00
Nicolai Haehnle d5199e39c0 AMDGPU: Add testcase to demonstrate a condition with pre-existing waitcnt
Relevant for https://reviews.llvm.org/D54226.

llvm-svn: 346501
2018-11-09 15:13:12 +00:00
Nicolai Haehnle 6979b70427 Add test case for the regression caused by r344696
(That change has since been reverted.)

Reduced from https://bugs.freedesktop.org/show_bug.cgi?id=108611

llvm-svn: 346423
2018-11-08 18:01:38 +00:00
Stanislav Mekhanoshin 6cc8b2fc65 [AMDGPU] Extend promote alloca vectorization
Promote alloca can vectorize a small array by bitcasting it to a
vector type. Extend vectorization for the case when alloca is
already a vector type. We still want to replace GEPs with an
insert/extract element instructions in this case.

Differential Revision: https://reviews.llvm.org/D54219

llvm-svn: 346376
2018-11-08 00:16:23 +00:00
Nicolai Haehnle bc233f5523 Revert "AMDGPU: Divergence-driven selection of scalar buffer load intrinsics"
This reverts commit r344696 for now (except for some test additions).

See https://bugs.freedesktop.org/show_bug.cgi?id=108611.

llvm-svn: 346364
2018-11-07 21:53:43 +00:00
Matt Arsenault 8ba740a5a8 Allow subclassing ExternalAA
This allows testing AMDGPU alias analysis like any
other alias analysis pass. This fixes the existing
test pointlessly running opt -O3 when it really
just wants to run the one analysis.

Before there was no way to test this using -aa-eval
with opt, since the default constructed pass
is run. The wrapper subclass allows the
default constructor to pass the necessary callback.

llvm-svn: 346353
2018-11-07 20:26:42 +00:00
Matthias Braun 5b7c90b4e2 RegAllocFast: Leave unassigned virtreg entries in map
Set `LiveReg::PhysReg` to zero when freeing a register instead of
removing it from the entry from `LiveRegMap`. This way no iterators get
invalidated and we can avoid passing around and updating iterators all
over the place.

This does not change any allocator decisions. It is not completely NFC
because the arbitrary iteration order through `LiveRegMap` in
`spillAll()` changes so we may get a different order in those spill
sequences (the amount of spills does not change).

This is in preparation of https://reviews.llvm.org/D52010.

llvm-svn: 346298
2018-11-07 06:57:03 +00:00
Yaxun Liu 73bf0af32f AMDGPU: Add an option -disable-promote-alloca-to-lds
Add this option for debugging and providing workaround.

By default it is off so no behavior change in backend.

Differential Revision: https://reviews.llvm.org/D54158

llvm-svn: 346267
2018-11-06 21:28:17 +00:00
Max Kazantsev 69f6dfa0f8 [LICM] Use ICFLoopSafetyInfo in LICM
This patch makes LICM use `ICFLoopSafetyInfo` that is a smarter version
of LoopSafetyInfo that leverages power of Implicit Control Flow Tracking
to keep track of throwing instructions and give less pessimistic answers
to queries related to throws.

The ICFLoopSafetyInfo itself has been introduced in rL344601. This patch
enables it in LICM only.

Differential Revision: https://reviews.llvm.org/D50377
Reviewed By: apilipenko

llvm-svn: 346201
2018-11-06 02:44:49 +00:00
Konstantin Zhuravlyov 108927b944 AMDGPU: Add sram-ecc feature
Differential Revision: https://reviews.llvm.org/D53222

llvm-svn: 346177
2018-11-05 22:44:19 +00:00
Neil Henning 233a02d0ed [AMDGPU] Fix the new atomic optimizer in pixel shaders.
The new atomic optimizer I previously added in D51969 did not work
correctly when a pixel shader was using derivatives, and had helper
lanes active.

To fix this we add an llvm.amdgcn.ps.live call that guards a branch
around the entire atomic operation - ensuring that all helper lanes are
inactive within the wavefront when we compute our atomic results.

I've added a test case that can cause derivatives, and exposes the
problem.

Differential Revision: https://reviews.llvm.org/D53930

llvm-svn: 346128
2018-11-05 12:04:48 +00:00
Matt Arsenault 8e0269ba0b AMDGPU: Fix assertion with bitcast from i64 constant to v4i16
llvm-svn: 345922
2018-11-02 02:43:55 +00:00
Farhana Aleen 5853762e5a [AMDGPU] Handle the idot8 pattern generated by FE.
Summary: Different variants of idot8 codegen dag patterns are not generated by llvm-tablegen due to a huge
         increase in the compile time. Support the pattern that clang FE generates after reordering the
         additions in integer-dot8 source language pattern.

Author: FarhanaAleen

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D53937

llvm-svn: 345902
2018-11-01 22:48:19 +00:00
Stanislav Mekhanoshin 222e9c11f7 Check shouldReduceLoadWidth from SimplifySetCC
SimplifySetCC could shrink a load without checking for
profitability or legality of such shink with a target.

Added checks to prevent shrinking of aligned scalar loads
in AMDGPU below dword as scalar engine does not support it.

Differential Revision: https://reviews.llvm.org/D53846

llvm-svn: 345778
2018-10-31 21:24:30 +00:00
Scott Linder 92bb783cfe [SelectionDAG] Handle constant range [0,1) in lowerRangeToAssertZExt
lowerRangeToAssertZExt currently relies on something like EarlyCSE having
eliminated the constant range [0,1). At -O0 this leads to an assert.

Differential Revision: https://reviews.llvm.org/D53888

llvm-svn: 345770
2018-10-31 19:57:36 +00:00
Scott Linder c6c627253d [AMDGPU] Remove FeatureVGPRSpilling
This feature is only relevant to shaders, and is no longer used. When disabled,
lowering of reserved registers for shaders causes a compiler crash.

Remove the feature and add a test for compilation of shaders at OptNone.

Differential Revision: https://reviews.llvm.org/D53829

llvm-svn: 345763
2018-10-31 18:54:06 +00:00
Nicolai Haehnle 814abb59df AMDGPU: Rewrite SILowerI1Copies to always stay on SALU
Summary:
Instead of writing boolean values temporarily into 32-bit VGPRs
if they are involved in PHIs or are observed from outside a loop,
we use bitwise masking operations to combine lane masks in a way
that is consistent with wave control flow.

Move SIFixSGPRCopies to before this pass, since that pass
incorrectly attempts to move SGPR phis to VGPRs.

This should recover most of the code quality that was lost with
the bug fix in "AMDGPU: Remove PHI loop condition optimization".

There are still some relevant cases where code quality could be
improved, in particular:

- We often introduce redundant masks with EXEC. Ideally, we'd
  have a generic computeKnownBits-like analysis to determine
  whether masks are already masked by EXEC, so we can avoid this
  masking both here and when lowering uniform control flow.

- The criterion we use to determine whether a def is observed
  from outside a loop is conservative: it doesn't check whether
  (loop) branch conditions are uniform.

Change-Id: Ibabdb373a7510e426b90deef00f5e16c5d56e64b

Reviewers: arsenm, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, mgorny, yaxunl, dstuttard, t-tye, eraman, llvm-commits

Differential Revision: https://reviews.llvm.org/D53496

llvm-svn: 345719
2018-10-31 13:27:08 +00:00
Nicolai Haehnle 28212cc689 AMDGPU: Remove PHI loop condition optimization
Summary:
The optimization to early break out of loops if all threads are dead was
never fully implemented.

But the PHI node analyzing is actually causing a number of problems, so
remove all the extra code for it.

(This does actually regress code quality in a few places because it
 ends up relying more heavily on phi's of i1, which we don't do a
 great job with. However, since it fixes real bugs in the wild, we
 should take this change. I have some prototype changes to improve
 i1 lowering in general -- not just for control flow -- which should
 help recover the code quality, I just need to make those changes
 fit for general consumption. -- Nicolai)

Change-Id: I6fc6c6c8961857ac6009fcfb9f7e5e48dc23fbb1
Patch-by: Christian König <christian.koenig@amd.com>

Reviewers: arsenm, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53359

llvm-svn: 345718
2018-10-31 13:26:48 +00:00
Neil Henning 63718b214a [AMDGPU] support image load/store a16
Our a16 support was only enabled for sample/gather and buffer
load/store, but not for image load/store operations (which take an i16
as the pixel index rather than a half).

Fix our isel lowering and add test cases to prove it out.

Differential Revision: https://reviews.llvm.org/D53750

llvm-svn: 345710
2018-10-31 10:34:48 +00:00
Matthias Braun a83403892a MachineOperand/MIParser: Do not print debug-use flag, infer it
The debug-use flag must be set exactly for uses on DBG_VALUEs.  This is
so obvious that it can be trivially inferred while parsing. This will
reduce noise when printing while omitting an information that has little
value to the user.

The parser will keep recognizing the flag for compatibility with old
`.mir` files.

Differential Revision: https://reviews.llvm.org/D53903

llvm-svn: 345671
2018-10-30 23:28:27 +00:00
Konstantin Zhuravlyov 2d22d24ac4 Revert r345542: AMDGPU: Enable code object v3 by default
It breaks mesa.

llvm-svn: 345662
2018-10-30 22:02:40 +00:00
Jonas Paulsson 611b533f1d [SchedModel] Fix for read advance cycles with implicit pseudo operands.
The SchedModel allows the addition of ReadAdvances to express that certain
operands of the instructions are needed at a later point than the others.

RegAlloc may add pseudo operands that are not part of the instruction
descriptor, and therefore cannot have any read advance entries. This meant
that in some cases the desired read advance was nullified by such a pseudo
operand, which still had the original latency.

This patch fixes this by making sure that such pseudo operands get a zero
latency during DAG construction.

Review: Matthias Braun, Ulrich Weigand.
https://reviews.llvm.org/D49671

llvm-svn: 345606
2018-10-30 15:04:40 +00:00
Simon Pilgrim 858303b827 [SelectionDAG] Add FoldBUILD_VECTOR to simplify new BUILD_VECTOR nodes
Similar to FoldCONCAT_VECTORS, this patch adds FoldBUILD_VECTOR to simplify cases that can avoid the creation of the BUILD_VECTOR - if all the operands are UNDEF or if the BUILD_VECTOR simplifies to a copy.

This exposed an assumption in some AMDGPU code that getBuildVector was guaranteed to be a BUILD_VECTOR node that I've tried to handle.	
	
Differential Revision: https://reviews.llvm.org/D53760

llvm-svn: 345578
2018-10-30 10:32:11 +00:00
Matt Arsenault abc4f29f9c AMDGPU: Remove custom BUILD_VECTOR combine
This was looping in a testcase and removing it
now slightly improves a test.

llvm-svn: 345560
2018-10-30 01:37:59 +00:00
Matt Arsenault b0b741efb8 AMDGPU: Use scavengeRegisterBackwards
llvm-svn: 345559
2018-10-30 01:33:14 +00:00
Konstantin Zhuravlyov 5cb950200c AMDGPU: Enable code object v3 by default
Differential Revision: https://reviews.llvm.org/D53525

llvm-svn: 345542
2018-10-29 21:07:27 +00:00
Matthias Braun c045c557b0 Relax fast register allocator related test cases; NFC
- Relex hard coded registers and stack frame sizes
- Some test cleanups
- Change phi-dbg.ll to match on mir output after phi elimination instead
  of going through the whole codegen pipeline.

This is in preparation for https://reviews.llvm.org/D52010
I'm committing all the test changes upfront that work before and after
independently.

llvm-svn: 345532
2018-10-29 20:10:42 +00:00
Stanislav Mekhanoshin 79080ecd82 [AMDGPU] Match v_swap_b32
Differential Revision: https://reviews.llvm.org/D52677

llvm-svn: 345514
2018-10-29 17:26:01 +00:00
Scott Linder 11ef7984b0 [AMDGPU] Add a pass to promote bitcast calls
AMDGPU currently only supports direct calls, but at lower optimisation levels it
fails to lower statically direct calls which appear indirect due to a bitcast.

Add a pass to visit all CallSites and use CallPromotionUtils to "devirtualize"
calls.

Differential Revision: https://reviews.llvm.org/D52741

llvm-svn: 345382
2018-10-26 13:18:36 +00:00
Tim Renouf 2a1b1d94b6 [AMDGPU] Defined gfx909 Raven Ridge 2
Differential Revision: https://reviews.llvm.org/D53418

Change-Id: Ie3d054f2e956c2768988c0f4c0ffd29a47294eef
llvm-svn: 345120
2018-10-24 08:14:07 +00:00
Matt Arsenault 687ec75d10 DAG: Change behavior of fminnum/fmaxnum nodes
Introduce new versions that follow the IEEE semantics
to help with legalization that may need quieted inputs.

There are some regressions from inserting unnecessary
canonicalizes when these are matched from fast math
fcmp + select which should be fixed in a future commit.

llvm-svn: 344914
2018-10-22 16:27:27 +00:00
Changpeng Fang f95f763ea5 AMDGPU: Add support pattern for SUB of one bit
Summary:
  Add selection patterns to support one bit Sub.

Reviewers:
  rampitec, arsenm

Differential Revision:
  https://reviews.llvm.org/D52946

llvm-svn: 344815
2018-10-19 21:09:21 +00:00
Nicolai Haehnle 4821937d2e AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI
Summary:
To workaround a hardware issue in the (base + offset) calculation
when base is negative. The impact on code quality should be limited
since SILoadStoreOptimizer still runs afterwards and is able to
combine loads/stores based on known sign information.

This fixes visible corruption in Hitman on SI (easily reproducible
by running benchmark mode).

Change-Id: Ia178d207a5e2ac38ae7cd98b532ea2ae74704e5f
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99923

Reviewers: arsenm, mareko

Subscribers: jholewinski, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53160

llvm-svn: 344698
2018-10-17 15:37:48 +00:00
Nicolai Haehnle 0823050b9f StructurizeCFG: Simplify inserted PHI nodes
Summary:
This improves subsequent divergence analysis in some cases.

Change-Id: I5e95e7ec7fd3fa80d414d1a53a02fea23e3d67d3

Reviewers: arsenm, rampitec

Subscribers: jvesely, wdng, llvm-commits

Differential Revision: https://reviews.llvm.org/D53316

llvm-svn: 344697
2018-10-17 15:37:41 +00:00
Nicolai Haehnle c4a2ff0950 AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.

If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.

There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.

Change-Id: I170e6816323beb1348677b358c9d380865cd1a19

Reviewers: arsenm, alex-t, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53283

llvm-svn: 344696
2018-10-17 15:37:30 +00:00
Nicolai Haehnle 6d92ca61f9 StructurizeCFG,AMDGPU: Test case of a redundant phi and codegen consequences
Change-Id: I9681f9e41ca30f82576f3d1f965c3a550a34b171
llvm-svn: 344569
2018-10-15 22:37:46 +00:00
Konstantin Zhuravlyov 94dfcc2eb2 AMDGPU: Generate .amdgcn_target for object code v3
Differential Revision: https://reviews.llvm.org/D53221

llvm-svn: 344552
2018-10-15 20:37:47 +00:00
Nicolai Haehnle 1bb242e201 AMDGPU: Test showing a scalar buffer load deficiency
Change-Id: I5b64a565f22a8482aa0712488d85e45163ac3d12
llvm-svn: 344506
2018-10-15 11:37:04 +00:00
Tom Stellard a894043910 Revert "AMDGPU/GlobalISel: Implement select for G_INSERT"
This reverts commit r344310.

The test case was failing on some bots.

llvm-svn: 344317
2018-10-11 23:36:46 +00:00
Tom Stellard 4733be6e7b AMDGPU/GlobalISel: Implement select for G_INSERT
Reviewers: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53116

llvm-svn: 344310
2018-10-11 22:49:54 +00:00
Scott Linder 823549a6ec [AMDGPU] Legalize VGPR Rsrc operands for MUBUF instructions
Emit a waterfall loop in the general case for a potentially-divergent Rsrc
operand. When practical, avoid this by using Addr64 instructions.

Recommits r341413 with changes to update the MachineDominatorTree when present.

Differential Revision: https://reviews.llvm.org/D51742

llvm-svn: 343992
2018-10-08 18:47:01 +00:00
Tom Stellard 14d8807d9a AMDGPU/GlobalISel: Select amdgcn.cvt.pkrtz to 64-bit instructions
Summary: The 32-bit variants do not exist on VI+.

Reviewers: arsenm

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52958

llvm-svn: 343985
2018-10-08 17:49:29 +00:00
Nicolai Haehnle ea36cd595c AMDGPU: Future-proof {raw,struct}.buffer.atomic intrinsics
Summary:
The ISA is really supposed to support 64-bit atomics as well,
so the data type should be an overload.

Mesa doesn't use these atomics yet, in fact I noticed this
issue while trying to use the atomics from Mesa.

Change-Id: I77f58317a085a0d3eb933cc7e99308c48a19f83e

Reviewers: tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, jfb, llvm-commits

Differential Revision: https://reviews.llvm.org/D52291

llvm-svn: 343978
2018-10-08 16:53:48 +00:00
Neil Henning 6641657453 [AMDGPU] Add an AMDGPU specific atomic optimizer.
This commit adds a new IR level pass to the AMDGPU backend to perform
atomic optimizations. It works by:

- Running through a function and finding atomicrmw add/sub or uses of
  the atomic buffer intrinsics for add/sub.
- If all arguments except the value to be added/subtracted are uniform,
  record the value to be optimized.
- Run through the atomic operations we can optimize and, depending on
  whether the value is uniform/divergent use wavefront wide operations
  (DPP in the divergent case) to calculate the total amount to be
  atomically added/subtracted.
- Then let only a single lane of each wavefront perform the atomic
  operation, reducing the total number of atomic operations in flight.
- Lastly we recombine the result from the single lane to each lane of
  the wavefront, and calculate our individual lanes offset into the
  final result.

Differential Revision: https://reviews.llvm.org/D51969

llvm-svn: 343973
2018-10-08 15:49:19 +00:00
Tom Stellard 7c65078f04 AMDGPU/GlobalISel: Add support for G_INTTOPTR
Summary: This is a no-op.

Reviewers: arsenm

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52916

llvm-svn: 343839
2018-10-05 04:34:09 +00:00
Konstantin Zhuravlyov aa067cb9fb AMDGPU: Rename isAmdCodeObjectV2 -> isAmdHsaOrMesa
The isAmdCodeObjectV2 is a misleading name which actually checks whether the os
is amdhsa or mesa.

Also add a test to make sure we do not generate old kernel header for code
object v3.

Differential Revision: https://reviews.llvm.org/D52897

llvm-svn: 343813
2018-10-04 21:02:16 +00:00
Farhana Aleen 4bc597bff5 [AMDGPU] Match signed dot4/8 pattern.
Summary: This patch matches signed dot4 and dot8 pattern.

Author: FarhanaAleen

Reviewed By: msearles

Differential Revision: https://reviews.llvm.org/D52520

llvm-svn: 343798
2018-10-04 16:57:37 +00:00
Tim Renouf a37679d67b [AMDGPU] Fix for negative offsets in buffer/tbuffer intrinsics
Summary:
The new buffer/tbuffer intrinsics handle an out-of-range immediate
offset by moving/adding offset&-4096 to a vgpr, leaving an in-range
immediate offset, with a chance of the move/add being CSEd for similar
loads/stores.

However it turns out that a negative offset in a vgpr is illegal, even
if adding the immediate offset makes it legal again.

Therefore, this commit disables the offset&-4096 thing if the offset is
negative.

Differential Revision: https://reviews.llvm.org/D52683

Change-Id: Ie02f0a74f240a138dc2a29d17cfbd9e350e4ed13
llvm-svn: 343672
2018-10-03 10:29:43 +00:00
Fangrui Song 3d76d36059 [AMDGPU] Rename pass "isel" to "amdgpu-isel"
Summary: The AMDGPU target specific pass "isel" is a misleading name.

Reviewers: tstellar, echristo, javed.absar, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D52759

llvm-svn: 343659
2018-10-03 03:38:22 +00:00
Matt Arsenault 635d479322 AMDGPU: Always run AMDGPUAlwaysInline
Even if calls are enabled, it still needs to be run
for forcing inline of functions that use LDS.

llvm-svn: 343657
2018-10-03 02:47:25 +00:00
Matt Arsenault ab41193312 AMDGPU: Expand atomicrmw nand in IR
llvm-svn: 343559
2018-10-02 03:50:56 +00:00
Bjorn Pettersson c2fc53ac90 [PHIElimination] Lower a PHI node with only undef uses as IMPLICIT_DEF
Summary:
The lowering of PHI nodes used to detect if all inputs originated
from IMPLICIT_DEF's. If so the PHI node was replaced by an
IMPLICIT_DEF. Now we also consider undef uses when checking the
inputs. So if all inputs are implicitly defined or undef we
lower the PHI to an IMPLICIT_DEF. This makes
PHIElimination::LowerPHINode more consistent as it checks
both implicit and undef properties at later stages.

Reviewers: MatzeB, tstellar

Reviewed By: MatzeB

Subscribers: jvesely, nhaehnle, llvm-commits

Differential Revision: https://reviews.llvm.org/D52558

llvm-svn: 343417
2018-09-30 17:26:58 +00:00
Stanislav Mekhanoshin b080adfc0c [AMDGPU] Fold copy (copy vgpr)
This allows to reduce a number of used VGPRs in some cases.

Differential Revision: https://reviews.llvm.org/D52577

llvm-svn: 343249
2018-09-27 18:55:20 +00:00
Stanislav Mekhanoshin 8dfcd83371 [AMDGPU] Fix ds combine with subregs
Differential Revision: https://reviews.llvm.org/D52522

llvm-svn: 343047
2018-09-25 23:33:18 +00:00
Changpeng Fang 6f4922ccc9 AMDGPU: Add Selection patterns to support add of one bit.
Summary:
  We generate s_xor to lower add of i1s in general cases, and s_not to
lower add with a one-bit imm of -1 (true).

Reviewers:
  rampitec

Differential Revision:
  https://reviews.llvm.org/D52518

llvm-svn: 343030
2018-09-25 21:21:18 +00:00
Daniil Fukalov 349b5943b4 [RegAllocGreedy] avoid using physreg candidates that cannot be correctly spilled
For the AMDGPU target if a MBB contains exec mask restore preamble, SplitEditor may get state when it cannot insert a spill instruction.

E.g. for a MIR

bb.100:
    %1 = S_OR_SAVEEXEC_B64 %2, implicit-def $exec, implicit-def $scc, implicit $exec
and if the regalloc will try to allocate a virtreg to the physreg already assigned to virtreg %1, it should insert spill instruction before the S_OR_SAVEEXEC_B64 instruction.
But it is not possible since can generate incorrect code in terms of exec mask.

The change makes regalloc to ignore such physreg candidates.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D52052

llvm-svn: 343004
2018-09-25 18:37:38 +00:00
Sameer Sahasrabuddhe b4f2d1cb68 [AMDGPU] restore r342722 which was reverted with r342743
[AMDGPU] lower-switch in preISel as a workaround for legacy DA

Summary:
The default target of the switch instruction may sometimes be an
"unreachable" block, when it is guaranteed that one of the cases is
always taken. The dominator tree concludes that such a switch
instruction does not have an immediate post dominator. This confuses
divergence analysis, which is unable to propagate sync dependence to
the targets of the switch instruction.

As a workaround, the AMDGPU target now invokes lower-switch as a
preISel pass. LowerSwitch is designed to handle the unreachable
default target correctly, allowing the divergence analysis to locate
the correct immediate dominator of the now-lowered switch.

llvm-svn: 342956
2018-09-25 09:39:21 +00:00
Stanislav Mekhanoshin 14fefe7f8e [AMDGPU] Remove useless check from test. NFC.
The check for assignment of zero is practically useless
while the assignment moves around with different scheduling.

llvm-svn: 342935
2018-09-25 01:24:54 +00:00
Matt Arsenault f432011d33 AMDGPU: Fix private handling for allowsMisalignedMemoryAccesses
If the alignment is at least 4, this should report true.

Something still seems off with how < 4-byte types are
handled here though.

Fixing this seems to change how some combines get
to where they get, but somehow isn't changing the net
result.

llvm-svn: 342879
2018-09-24 13:18:15 +00:00
Sameer Sahasrabuddhe 0807e94951 revert changes from r342722
"[AMDGPU] lower-switch in preISel as a workaround for legacy DA"

This broke regression tests. The first breakage was noticed here:
http://lab.llvm.org:8011/builders/lld-x86_64-freebsd/builds/23549

llvm-svn: 342743
2018-09-21 16:31:51 +00:00
Sameer Sahasrabuddhe 2de7653fd5 [AMDGPU] lower-switch in preISel as a workaround for legacy DA
Summary:
The default target of the switch instruction may sometimes be an
"unreachable" block, when it is guaranteed that one of the cases is
always taken. The dominator tree concludes that such a switch
instruction does not have an immediate post dominator. This confuses
divergence analysis, which is unable to propagate sync dependence to
the targets of the switch instruction.

As a workaround, the AMDGPU target now invokes lower-switch as a
preISel pass. LowerSwitch is designed to handle the unreachable
default target correctly, allowing the divergence analysis to locate
the correct immediate dominator of the now-lowered switch.

Reviewers: arsenm, nhaehnle

Reviewed By: nhaehnle

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits, simoll

Differential Revision: https://reviews.llvm.org/D52221

llvm-svn: 342722
2018-09-21 11:26:55 +00:00
Alexander Timofeev 36617f0160 [AMDGPU] Divergence driven instruction selection. Part 1.
Summary: This change is the first part of the AMDGPU target description
    change. The aim of it is the effective splitting the vector and scalar
    flows at the selection stage. Selection uses predicate functions based
    on the framework implemented earlier - https://reviews.llvm.org/D35267

    Differential revision: https://reviews.llvm.org/D52019

    Reviewers: rampitec

llvm-svn: 342719
2018-09-21 10:31:22 +00:00
Carl Ritson 6b8d75425e [AMDGPU] Add instruction selection for i1 to f16 conversion
Summary:
This is required for GPUs with 16 bit instructions where f16 is a
legal register type and hence int_to_fp i1 to f16 is not lowered
by legalizing.

Reviewers: arsenm, nhaehnle

Reviewed By: nhaehnle

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52018

Change-Id: Ie4c0fd6ced7cf10ad612023c6879724d9ded5851
llvm-svn: 342558
2018-09-19 16:32:12 +00:00
Farhana Aleen f5a2848376 [AMDGPU] Match udot8 pattern
Summary: D.u32 = S0.u4[0] * S1.u4[0] +

         S0.u4[1] * S1.u4[1] +
         S0.u4[2] * S1.u4[2] +
         S0.u4[3] * S1.u4[3] +
         S0.u4[4] * S1.u4[4] +
         S0.u4[5] * S1.u4[5] +
         S0.u4[6] * S1.u4[6] +
         S0.u4[7] * S1.u4[7] +
         S2.u32

Author: FarhanaAleen

Reviewed By: arsenm, nhaehnle

Differential Revision: https://reviews.llvm.org/D51947

llvm-svn: 342497
2018-09-18 16:59:48 +00:00
Matt Arsenault ebf46143ea AMDGPU: Don't form fmed3 if it will require materialization
If there is a single use constant, it can be folded into the
min/max, but not into med3.

llvm-svn: 342443
2018-09-18 02:34:54 +00:00
Matt Arsenault 9d49c449ec AMDGPU: Expand vector canonicalizes
llvm-svn: 342439
2018-09-18 01:51:33 +00:00
David Stuttard 20de3e99b5 [AMDGPU] Ensure trig range reduction only used for subtargets that require it
Summary:
GFX9 and above support sin/cos instructions with a greater range and thus don't
require a fract instruction prior to invocation.

Added a subtarget feature to reflect this and added code to take advantage of
expanded range on GFX9+

Also updated the tests to check correct behaviour

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D51933

Change-Id: I1c1f1d3726a5ae32116646ca5cfa1ab4ef69e5b0
llvm-svn: 342222
2018-09-14 10:27:19 +00:00
Matt Arsenault ff987ac6ea AMDGPU: Fix not preserving alignent in call setups
If an argument was passed on the stack, this
was using the default alignment.

I'm not sure there's an observable change from this. This
was observable due to bugs in expansion of unaligned
loads and stores, but since that is fixed I don't think
this matters much.

llvm-svn: 342133
2018-09-13 12:14:31 +00:00
Matt Arsenault 842cda6312 DAG: Fix expansion of unaligned FP loads and stores
This was trying to scalarizing a scalar FP type,
resulting in an assert.

Fixes unaligned f64 stack stores for AMDGPU.

llvm-svn: 342132
2018-09-13 12:14:23 +00:00
Matt Arsenault 9de2fb58fa AMDGPU: Fix some outdated datalayouts in tests
llvm-svn: 342131
2018-09-13 11:56:28 +00:00
Alexander Timofeev 2fb44808b1 [AMDGPU] Preliminary patch for divergence driven instruction selection. Load offset inlining pattern changed.
Differential revision: https://reviews.llvm.org/D51975

    Reviewers: rampitec

llvm-svn: 342115
2018-09-13 06:34:56 +00:00
Konstantin Zhuravlyov 71e43ee47d AMDGPU: Re-apply r341982 after fixing the layering issue
Move isa version determination into TargetParser.

Also switch away from target features to CPU string when
determining isa version. This fixes an issue when we
output wrong isa version in the object code when features
of a particular CPU are altered (i.e. gfx902 w/o xnack
used to result in gfx900).

llvm-svn: 342069
2018-09-12 18:50:47 +00:00
Ilya Biryukov 95066496d0 Revert "AMDGPU: Move isa version and EF_AMDGPU_MACH_* determination into TargetParser."
This reverts commit r341982.

The change introduced a layering violation. Reverting to unbreak
our integrate.

llvm-svn: 342023
2018-09-12 07:05:30 +00:00
Konstantin Zhuravlyov 941615e4c8 AMDGPU: Move isa version and EF_AMDGPU_MACH_* determination
into TargetParser.

Also switch away from target features to CPU string when
determining isa version. This fixes an issue when we
output wrong isa version in the object code when features
of a particular CPU are altered (i.e. gfx902 w/o xnack
used to result in gfx900).

Differential Revision: https://reviews.llvm.org/D51890

llvm-svn: 341982
2018-09-11 18:56:51 +00:00
Alexander Timofeev db7ee7660a [AMDGPU] Preliminary patch for divergence driven instruction selection. Immediate selection predicate changed
Differential revision: https://reviews.llvm.org/D51734
Reviewers: rampitec

llvm-svn: 341928
2018-09-11 11:56:50 +00:00
Matt Arsenault d0cf1b26d4 AMDGPU: Fix r600 test
llvm-svn: 341898
2018-09-11 04:39:16 +00:00
Matt Arsenault 99c780159d AMDGPU: Don't error on out of bounds address spaces
We should never abort on valid IR. The most reasonable
interpretation of an arbitrary address space pointer is
probably some kind of special subset of global memory.

llvm-svn: 341894
2018-09-11 04:00:41 +00:00
Alexander Timofeev 20cbe6f319 [AMDGPU] Preliminary patch for divergence driven instruction selection. Inline immediate move to V_MADAK_F32.
Differential revision: https://reviews.llvm.org/D51586

    Reviewer: rampitec

llvm-svn: 341843
2018-09-10 16:42:49 +00:00
Matt Arsenault 7f6dc597d3 AMDGPU: Stop reporting is-noop addrspacecast for constant 32-bit
This will require something to cast. Before this would eliminate
the cast, which would result in copies of $noreg.

llvm-svn: 341803
2018-09-10 11:59:27 +00:00
Matt Arsenault 57b5966dad DAG: Handle odd vector sizes in calling conv splitting
This already worked if only one register piece was used,
but didn't if a type was split into multiple, unequal
sized pieces.

Fixes not splitting 3i16/v3f16 into two registers for
AMDGPU.

This will also allow fixing the ABI for 16-bit vectors
in a future commit so that it's the same for all subtargets.

llvm-svn: 341801
2018-09-10 11:49:23 +00:00
Carl Ritson f898edd117 [AMDGPU] Prevent sequences of non-instructions disrupting GCNHazardRecognizer wait state counting
Summary:
This fixes a bug where a large number of implicit def instructions can fill the GCNHazardRecognizer lookahead buffer causing required NOPs to not be inserted.

Reviewers: nhaehnle, arsenm

Reviewed By: arsenm

Subscribers: sheredom, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D51726

Change-Id: Ie75338f94de704ee5816b05afd0c922c6748a95b
llvm-svn: 341798
2018-09-10 10:14:48 +00:00
Matt Arsenault 72d27f5525 AMDGPU: Fix tests using old number for constant address space
llvm-svn: 341770
2018-09-10 02:54:25 +00:00
Matt Arsenault d77fcc2a92 AMDGPU: Use GOT PSV since it has an address space now
llvm-svn: 341768
2018-09-10 02:23:39 +00:00
Matt Arsenault b998674610 AMDGPU: Don't abort on unknown addrspace argument
llvm-svn: 341767
2018-09-10 02:23:30 +00:00
Alexander Timofeev a805c96c65 [AMDGPU] Preliminary patch for divergence driven instruction selection. Fold immediate SMRD offset.
Differential revision: https://reviews.llvm.org/D51610

Reviewer: rampitec
llvm-svn: 341636
2018-09-07 09:05:34 +00:00
Scott Linder 834cbc645c Revert r341413
Causes a regression in expensive checks.

llvm-svn: 341589
2018-09-06 21:38:56 +00:00
Scott Linder dfe089dfd1 [AMDGPU] Legalize VGPR Rsrc operands for MUBUF instructions
Emit a waterfall loop in the general case for a potentially-divergent Rsrc
operand. When practical, avoid this by using Addr64 instructions.

Differential Revision: https://reviews.llvm.org/D50982

llvm-svn: 341413
2018-09-04 21:50:47 +00:00
Matt Arsenault 813613c494 AMDGPU: Fix DAG divergence not reporting flat loads
Match behavior in DAG of r340343

llvm-svn: 341393
2018-09-04 18:58:19 +00:00