Commit Graph

2991 Commits

Author SHA1 Message Date
Simon Pilgrim 9831d4058c Fix -Wunused-variable warning. NFCI.
llvm-svn: 349265
2018-12-15 12:25:22 +00:00
Florian Hahn abe32c9125 [SILoadStoreOptimizer] Use std::abs to avoid truncation.
Using regular abs() causes the following warning

error: absolute value function 'abs' given an argument of type 'int64_t' (aka 'long') but has parameter of type 'int' which may cause truncation of value [-Werror,-Wabsolute-value]
        (uint32_t)abs(Dist) > MaxDist) {
                  ^
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp:1369:19: note: use function 'std::abs' instead

which causes a bot to fail:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/18284/steps/bootstrap%20clang/logs/stdio

llvm-svn: 349224
2018-12-15 01:32:58 +00:00
Farhana Aleen ce095c564a [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions.
Summary: Promote constant offset to immediate by recomputing the relative 13bit offset from nearby instructions.
 E.g.
  s_movk_i32 s0, 0x1800
  v_add_co_u32_e32 v0, vcc, s0, v2
  v_addc_co_u32_e32 v1, vcc, 0, v6, vcc

  s_movk_i32 s0, 0x1000
  v_add_co_u32_e32 v5, vcc, s0, v2
  v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
  global_load_dwordx2 v[5:6], v[5:6], off
  global_load_dwordx2 v[0:1], v[0:1], off
  =>
  s_movk_i32 s0, 0x1000
  v_add_co_u32_e32 v5, vcc, s0, v2
  v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
  global_load_dwordx2 v[5:6], v[5:6], off
  global_load_dwordx2 v[0:1], v[5:6], off offset:2048

Author: FarhanaAleen

Reviewed By: arsenm, rampitec

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D55539

llvm-svn: 349196
2018-12-14 21:13:14 +00:00
Aakanksha Patil bc568766b2 Revert r348971: [AMDGPU] Support for "uniform-work-group-size" attribute
This patch breaks RADV (and probably RadeonSI as well)

llvm-svn: 349084
2018-12-13 21:23:12 +00:00
Matt Arsenault 934e534c47 AMDGPU/GlobalISel: Legalize/regbankselect block_addr
llvm-svn: 349081
2018-12-13 20:34:15 +00:00
Matt Arsenault 577b9fc543 AMDGPU/GlobalISel: Legalize f64 fadd/fmul
llvm-svn: 349014
2018-12-13 08:27:48 +00:00
Matt Arsenault f38f483bef AMDGPU/GlobalISel: RegBankSelect some simple operations
llvm-svn: 349012
2018-12-13 08:23:51 +00:00
Stanislav Mekhanoshin d933c2ced7 [AMDGPU] Fix build failure, second attempt
Some compilers complain that variable is captured and some
complain when it is not. Switch to [&].

llvm-svn: 349006
2018-12-13 05:52:11 +00:00
Stanislav Mekhanoshin 5225746e03 [AMDGPU] Fix build failure
Fixed error 'lambda capture 'CondReg' is not required to be captured
for this use'.

llvm-svn: 349005
2018-12-13 05:21:25 +00:00
Stanislav Mekhanoshin 6071e1aa58 [AMDGPU] Simplify negated condition
Optimize sequence:

  %sel = V_CNDMASK_B32_e64 0, 1, %cc
  %cmp = V_CMP_NE_U32 1, %1
  $vcc = S_AND_B64 $exec, %cmp
  S_CBRANCH_VCC[N]Z
=>
  $vcc = S_ANDN2_B64 $exec, %cc
  S_CBRANCH_VCC[N]Z

It is the negation pattern inserted by DAGCombiner::visitBRCOND() in the
rebuildSetCC().

Differential Revision: https://reviews.llvm.org/D55402

llvm-svn: 349003
2018-12-13 03:17:40 +00:00
Aakanksha Patil 729309cc89 [AMDGPU] Support for "uniform-work-group-size" attribute
Updated the annotate-kernel-features pass to support the propagation of uniform-work-group attribute from the kernel to the called functions. Once this pass is run, all kernels, even the ones which initially did not have the attribute, will be able to indicate weather or not they have uniform work group size depending on the value of the attribute. 

Differential Revision: https://reviews.llvm.org/D50200

llvm-svn: 348971
2018-12-12 20:49:17 +00:00
Scott Linder f5b36e56fb [AMDGPU] Emit MessagePack HSA Metadata for v3 code object
Continue to present HSA metadata as YAML in ASM and when output by tools
(e.g. llvm-readobj), but encode it in Messagepack in the code object.

Differential Revision: https://reviews.llvm.org/D48179

llvm-svn: 348963
2018-12-12 19:39:27 +00:00
Neil Henning 76504a4c5e [AMDGPU] Extend the SI Load/Store optimizer to combine more things.
I've extended the load/store optimizer to be able to produce dwordx3
loads and stores, This change allows many more load/stores to be combined,
and results in much more optimal code for our hardware.

Differential Revision: https://reviews.llvm.org/D54042

llvm-svn: 348937
2018-12-12 16:15:21 +00:00
Piotr Sobczak 3732b4ce25 [AMDGPU] Set metadata access for explicit section
Summary:
This patch provides a means to set Metadata section kind
for a global variable, if its explicit section name is
prefixed with ".AMDGPU.metadata."
This could be useful to make the global variable go to
an ELF section without any section flags set.

Reviewers: dstuttard, tpr, kzhuravl, nhaehnle, t-tye

Reviewed By: dstuttard, kzhuravl

Subscribers: llvm-commits, arsenm, jvesely, wdng, yaxunl, t-tye

Differential Revision: https://reviews.llvm.org/D55267

llvm-svn: 348922
2018-12-12 11:20:04 +00:00
Amara Emerson 5ec146046c [GlobalISel] Restrict G_MERGE_VALUES capability and replace with new opcodes.
This patch restricts the capability of G_MERGE_VALUES, and uses the new
G_BUILD_VECTOR and G_CONCAT_VECTORS opcodes instead in the appropriate places.

This patch also includes AArch64 support for selecting G_BUILD_VECTOR of <4 x s32>
and <2 x s64> vectors.

Differential Revisions: https://reviews.llvm.org/D53629

llvm-svn: 348788
2018-12-10 18:44:58 +00:00
Neil Henning e448351b77 [AMDGPU] Change the l1 flush instruction for AMDPAL/MESA3D.
This commit changes which l1 flush instruction is used for AMDPAL and
MESA3d workloads to flush the entire l1 cache instead of just the
volatile lines.

Differential Revision: https://reviews.llvm.org/D55367

llvm-svn: 348771
2018-12-10 16:35:53 +00:00
Tim Corringham 2faadb15f4 [AMDGPU] Add new Mode Register pass - minor fix
Trivial change to add parentheses to an expression to avoid a
sanitizer error in SIModeRegister.cpp, which was committed earlier.

llvm-svn: 348767
2018-12-10 16:23:30 +00:00
Tim Corringham 4c4d2fe280 [AMDGPU] Add new Mode Register pass
A new pass to manage the Mode register.

Currently this just manages the floating point double precision
rounding requirements, but is intended to be easily extended to
encompass all Mode register settings.

The immediate motivation comes from the requirement to use the
round-to-zero rounding mode for the 16 bit interpolation
instructions, where the rounding mode setting is shared between
16 and 64 bit operations.

llvm-svn: 348754
2018-12-10 12:06:10 +00:00
Brian Gesiak b963c5150d [AMDGPU] Fix discarded result of addAttribute
Summary:
`llvm::AttributeList` and `llvm::AttributeSet` are immutable, and so methods
defined on these classes, such as `addAttribute`, return a new immutable
object with the attribute added. In https://reviews.llvm.org/D55217 I attempted
to annotate methods such as `addAttribute` with `LLVM_NODISCARD`, since
calling these methods has no side-effects, and so ignoring the result
that is returned is almost certainly a programmer error.

However, committing the change resulted in new warnings in the AMDGPU target.
The AMDGPU simplify libcalls pass added in https://reviews.llvm.org/D36436
attempts to add the readonly and nounwind attributes to simplified
library functions, but instead calls the `addAttribute` methods and
ignores the result.

Modify the simplify libcalls pass to actually add the nounwind and
readonly attributes. Also update the simplify libcalls test to assert
that these attributes are actually being set.

Reviewers: rampitec, vpykhtin, rnk

Reviewed By: rampitec

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55435

llvm-svn: 348732
2018-12-09 21:56:50 +00:00
Matt Arsenault b5613ecf17 AMDGPU: Fix offsets for < 4-byte aggregate kernel arguments
We were still using the rounded down offset and alignment even though
they aren't handled because you can't trivially bitcast the loaded
value.

llvm-svn: 348658
2018-12-07 22:12:17 +00:00
Simon Pilgrim 44dfd81d01 Fix unused variable warning. NFCI.
llvm-svn: 348649
2018-12-07 21:44:25 +00:00
Matt Arsenault ce2e053134 AMDGPU: Allow f32 types for llvm.amdgcn.s.buffer.load
llvm-svn: 348625
2018-12-07 18:41:39 +00:00
Matt Arsenault ca8eb0b672 AMDGPU: Remove llvm.SI.tbuffer.store
llvm-svn: 348619
2018-12-07 18:03:47 +00:00
Matt Arsenault 3ff764a944 AMDGPU: Remove llvm.SI.buffer.load.dword
llvm-svn: 348616
2018-12-07 17:46:20 +00:00
Matt Arsenault aa9bcd56b1 AMDGPU: Remove llvm.AMDGPU.kill
This is the last of the old AMDGPU intrinsics.

llvm-svn: 348615
2018-12-07 17:46:16 +00:00
Graham Sellers b297379ef0 [AMDGPU] Shrink scalar AND, OR, XOR instructions
This change attempts to shrink scalar AND, OR and XOR instructions which take an immediate that isn't inlineable.

It performs:
AND s0, s0, ~(1 << n) -> BITSET0 s0, n
OR s0, s0, (1 << n) -> BITSET1 s0, n
AND s0, s1, x -> ANDN2 s0, s1, ~x
OR s0, s1, x -> ORN2 s0, s1, ~x
XOR s0, s1, x -> XNOR s0, s1, ~x

In particular, this catches setting and clearing the sign bit for fabs (and x, 0x7ffffffff -> bitset0 x, 31 and or x, 0x80000000 -> bitset1 x, 31).

llvm-svn: 348601
2018-12-07 15:33:21 +00:00
David Green ca29c271d2 [Targets] Add errors for tiny and kernel codemodel on targets that don't support them
Adds fatal errors for any target that does not support the Tiny or Kernel
codemodels by rejigging the getEffectiveCodeModel calls.

Differential Revision: https://reviews.llvm.org/D50141

llvm-svn: 348585
2018-12-07 12:10:23 +00:00
Nicolai Haehnle ca4a32945f AMDGPU: Generate VALU ThreeOp Integer instructions
Summary:
Original patch by: Fabian Wahlster <razor@singul4rity.com>

Change-Id: I148f692a88432541fad468963f58da9ddf79fac5

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, b-sumner, llvm-commits

Differential Revision: https://reviews.llvm.org/D51995

llvm-svn: 348488
2018-12-06 14:33:40 +00:00
Valery Pykhtin f479fbba5f [AMDGPU] Partial revert of rL348371: Turn on the DPP combiner by default
Turn the combiner back off as there're failures until the issue is fixed.

Differential revision: https://reviews.llvm.org/D55314

llvm-svn: 348487
2018-12-06 14:20:02 +00:00
Valery Pykhtin 5b4db77b13 [AMDGPU]: Turn on the DPP combiner by default
Differential revision: https://reviews.llvm.org/D55314

llvm-svn: 348371
2018-12-05 15:21:17 +00:00
Matt Arsenault 0f414b81cc AMDGPU: Add f32 vectors to SGPR register classes
llvm-svn: 348286
2018-12-04 17:51:36 +00:00
Ron Lieberman 16de4fd2eb [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
The introduction of S_{ADD|SUB}_U64_PSEUDO instructions which are decomposed
into VOP3 instruction pairs for S_ADD_U64_PSEUDO:
  V_ADD_I32_e64
  V_ADDC_U32_e64
and for S_SUB_U64_PSEUDO
  V_SUB_I32_e64
  V_SUBB_U32_e64
preclude the use of SDWA to encode a constant.
SDWA: Sub-Dword addressing is supported on VOP1 and VOP2 instructions,
but not on VOP3 instructions.

We desire to fold the bit-and operand into the instruction encoding
for the V_ADD_I32 instruction. This requires that we transform the
VOP3 into a VOP2 form of the instruction (_e32).
  %19:vgpr_32 = V_AND_B32_e32 255,
      killed %16:vgpr_32, implicit $exec
  %47:vgpr_32, %49:sreg_64_xexec = V_ADD_I32_e64
      %26.sub0:vreg_64, %19:vgpr_32, implicit $exec
 %48:vgpr_32, dead %50:sreg_64_xexec = V_ADDC_U32_e64
      %26.sub1:vreg_64, %54:vgpr_32, killed %49:sreg_64_xexec, implicit $exec

which then allows the SDWA encoding and becomes
  %47:vgpr_32 = V_ADD_I32_sdwa
      0, %26.sub0:vreg_64, 0, killed %16:vgpr_32, 0, 6, 0, 6, 0,
      implicit-def $vcc, implicit $exec
  %48:vgpr_32 = V_ADDC_U32_e32
      0, %26.sub1:vreg_64, implicit-def $vcc, implicit $vcc, implicit $exec


Differential Revision: https://reviews.llvm.org/D54882

llvm-svn: 348132
2018-12-03 13:04:54 +00:00
Graham Sellers ba559ac058 [AMDGPU] Split 64-Bit XNOR to 64-Bit NOT/XOR
The identity ~(x ^ y) == (~x ^ y) == (x ^ ~y) allows XNOR (XOR/NOT) to turn into NOT/XOR. Handling this case with its own split means we can make the NOT remain in the scalar unit. Previously, we split 64-bit XNOR into two 32-bit XNOR, then lowered. Now, we get three instructions (s_not, v_xor, v_xor) rather than four in the case where either of the sources is a scalar 64-bit.

Add test cases to xnor.ll to attempt XNOR Vx, Sy and XNOR Sx, Vy. Also adding test that uses the opposite identity such that (~x ^ y) on the scalar unit (or vector for gfx906) can generate XNOR. This already worked, but I didn't see a test for it.

Differential: https://reviews.llvm.org/D55071
llvm-svn: 348075
2018-12-01 12:27:53 +00:00
Nicolai Haehnle a7b00058e0 AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.

If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.

There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.

Change-Id: I170e6816323beb1348677b358c9d380865cd1a19

Reviewers: arsenm, alex-t, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53283

llvm-svn: 348050
2018-11-30 22:55:38 +00:00
Nicolai Haehnle a9cc92c247 AMDGPU: Fix various issues around the VirtReg2Value mapping
Summary:
The VirtReg2Value mapping is crucial for getting consistently
reliable divergence information into the SelectionDAG. This
patch fixes a bunch of issues that lead to incorrect divergence
info and introduces tight assertions to ensure we don't regress:

1. VirtReg2Value is generated lazily; there were some cases where
   a lookup was performed before all relevant virtual registers were
   created, leading to an out-of-sync mapping. Those cases were:

  - Complex code to lower formal arguments that generated CopyFromReg
    nodes from live-in registers (fixed by never querying the mapping
    for live-in registers).

  - Code that generates CopyToReg for formal arguments that are used
    outside the entry basic block (fixed by never querying the
    mapping for Register nodes, which don't need the divergence info
    anyway).

2. For complex values that are lowered to a sequence of registers,
   all registers must be reflected in the VirtReg2Value mapping.

I am not adding any new tests, since I'm not actually aware of any
bugs that these problems are causing with trunk as-is. However,
I recently added a test case (in r346423) which fails when D53283 is
applied without this change. Also, the new assertions should provide
most of the effective test coverage.

There is one test change in sdwa-peephole.ll. The underlying issue
is that since the divergence info is now correct, the DAGISel will
select V_OR_B32 directly instead of S_OR_B32. This leads to an extra
COPY which affects the behavior of MachineLICM in a way that ends up
with the S_MOV_B32 with the constant in a different basic block than
the V_OR_B32, which is presumably what defeats the peephole.

Reviewers: alex-t, arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D54340

llvm-svn: 348049
2018-11-30 22:55:29 +00:00
Ron Lieberman f48e43bbf7 [AMDGPU] Disable SReg Global LD/ST, perf regression
Differential Revision: https://reviews.llvm.org/D55093

llvm-svn: 348014
2018-11-30 18:29:17 +00:00
Valery Pykhtin 3d9afa273f [AMDGPU] Combine DPP mov with use instructions (VOP1/2/3)
Introduces DPP pseudo instructions and the pass that combines DPP mov with subsequent uses.

Differential revision: https://reviews.llvm.org/D53762

llvm-svn: 347993
2018-11-30 14:21:56 +00:00
David Stuttard c6603861d8 Revert r347871 "Fix: Add support for TFE/LWE in image intrinsic"
Also revert fix r347876

One of the buildbots was reporting a failure in some relevant tests that I can't
repro or explain at present, so reverting until I can isolate.

llvm-svn: 347911
2018-11-29 20:14:17 +00:00
Graham Sellers 04f7a4d2d2 [AMDGPU] Add and update scalar instructions
This patch adds support for S_ANDN2, S_ORN2 32-bit and 64-bit instructions and adds splits to move them to the vector unit (for which there is no equivalent instruction). It modifies the way that the more complex scalar instructions are lowered to vector instructions by first breaking them down to sequences of simpler scalar instructions which are then lowered through the existing code paths. The pattern for S_XNOR has also been updated to apply inversion to one input rather than the output of the XOR as the result is equivalent and may allow leaving the NOT instruction on the scalar unit.

A new tests for NAND, NOR, ANDN2 and ORN2 have been added, and existing tests now hit the new instructions (and have been modified accordingly).

Differential: https://reviews.llvm.org/D54714
llvm-svn: 347877
2018-11-29 16:05:38 +00:00
David Stuttard 535c1af0bf Fix: Add support for TFE/LWE in image intrinsic
My change svn-id: 347871 caused a buildbot failure due to an unused
variable def (used in an assert).

Change-Id: Ia882d18bb6fa79b4d7bbfda422b9ea5d23eab336
llvm-svn: 347876
2018-11-29 15:56:36 +00:00
David Stuttard de02e4b1cc Add support for TFE/LWE in image intrinsics
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.

This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.

This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).

There's an additional fix now to avoid a dmask=0

For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.

Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.

The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:

%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
                                      i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1

Differential revision: https://reviews.llvm.org/D48826

Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
llvm-svn: 347871
2018-11-29 15:21:13 +00:00
Nicolai Haehnle 7bed696915 AMDGPU/InsertWaitcnts: Remove the dependence on MachineLoopInfo
Summary:
MachineLoopInfo cannot be relied on for correctness, because it cannot
properly recognize loops in irreducible control flow which can be
introduced by late machine basic block optimization passes. See the new
test case for the reduced form of an example that occurred in practice.

Use a simple fixpoint iteration instead.

In order to facilitate this change, refactor WaitcntBrackets so that it
only tracks pending events and registers, rather than also maintaining
state that is relevant for the high-level algorithm. Various accessor
methods can be removed or made private as a consequence.

Affects (in radv):
- dEQP-VK.glsl.loops.special.{for,while}_uniform_iterations.select_iteration_count_{fragment,vertex}

Fixes: r345719 ("AMDGPU: Rewrite SILowerI1Copies to always stay on SALU")

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54231

llvm-svn: 347853
2018-11-29 11:06:26 +00:00
Nicolai Haehnle ab43bf60fe AMDGPU/InsertWaitcnt: Consistently use uint32_t for scores / time points
Summary:
There is one obsolete reference to using -1 as an indication of "unknown",
but this isn't actually used anywhere.

Using unsigned makes robust wrapping checks easier.

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, llvm-commits, tpr, t-tye, hakzsam

Differential Revision: https://reviews.llvm.org/D54230

llvm-svn: 347852
2018-11-29 11:06:21 +00:00
Nicolai Haehnle f96456c611 AMDGPU/InsertWaitcnt: Remove unused WaitAtBeginning
Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54229

llvm-svn: 347851
2018-11-29 11:06:18 +00:00
Nicolai Haehnle d1f45dad84 AMDGPU/InsertWaitcnts: Simplify pending events tracking
Summary:
Instead of storing the "score" (last time point) of the various relevant
events, only store whether an event is pending or not.

This is sufficient, because whenever only one event of a count type is
pending, its last time point is naturally the upper bound of all time
points of this count type, and when multiple event types are pending,
the count type has gone out of order and an s_waitcnt to 0 is required
to clear any pending event type (and will then clear all pending event
types for that count type).

This also removes the special handling of GDS_GPR_LOCK and EXP_GPR_LOCK.
I do not understand what this special handling ever attempted to achieve.
It has existed ever since the original port from an internal code base,
so my best guess is that it solved a problem related to EXEC handling in
that internal code base.

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54228

llvm-svn: 347850
2018-11-29 11:06:14 +00:00
Nicolai Haehnle ae369d70c3 AMDGPU/InsertWaitcnts: Use foreach loops for inst and wait event types
Summary:
It hides the type casting ugliness, and I happened to have to add a new
such loop (in a later patch).

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54227

llvm-svn: 347849
2018-11-29 11:06:11 +00:00
Nicolai Haehnle 1a94cbb3f5 AMDGPU/InsertWaitcnts: Untangle some semi-global state
Summary:
Reduce the statefulness of the algorithm in two ways:

1. More clearly split generateWaitcntInstBefore into two phases: the
   first one which determines the required wait, if any, without changing
   the ScoreBrackets, and the second one which actually inserts the wait
   and updates the brackets.

2. Communicate pre-existing s_waitcnt instructions using an argument to
   generateWaitcntInstBefore instead of through the ScoreBrackets.

To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.

There are some functional changes:

1. The FIXME for the VCCZ bug workaround was implemented: we only wait for
   SMEM instructions as required instead of waiting on all counters.

2. We now properly track pre-existing waitcnt's in all cases, which leads
   to less conservative waitcnts being emitted in some cases.

     s_load_dword ...
     s_waitcnt lgkmcnt(0)    <-- pre-existing wait count
     ds_read_b32 v0, ...
     ds_read_b32 v1, ...
     s_waitcnt lgkmcnt(0)    <-- this is too conservative
     use(v0)
     more code
     use(v1)

   This increases code size a bit, but the reduced latency should still be a
   win in basically all cases. The worst code size regressions in my shader-db
   are:

 WORST REGRESSIONS - Code Size
 Before After     Delta Percentage
   1724  1736        12    0.70 %   shaders/private/f1-2015/1334.shader_test [0]
   2276  2284         8    0.35 %   shaders/private/f1-2015/1306.shader_test [0]
   4632  4640         8    0.17 %   shaders/private/ue4_elemental/62.shader_test [0]
   2376  2384         8    0.34 %   shaders/private/f1-2015/1308.shader_test [0]
   3284  3292         8    0.24 %   shaders/private/talos_principle/1955.shader_test [0]

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54226

llvm-svn: 347848
2018-11-29 11:06:06 +00:00
Francis Visoiu Mistrih d7eebd6d83 [CodeGen][NFC] Make `TII::getMemOpBaseImmOfs` return a base operand
Currently, instructions doing memory accesses through a base operand that is
not a register can not be analyzed using `TII::getMemOpBaseRegImmOfs`.

This means that functions such as `TII::shouldClusterMemOps` will bail
out on instructions using an FI as a base instead of a register.

The goal of this patch is to refactor all this to return a base
operand instead of a base register.

Then in a separate patch, I will add FI support to the mem op clustering
in the MachineScheduler.

Differential Revision: https://reviews.llvm.org/D54846

llvm-svn: 347746
2018-11-28 12:00:20 +00:00
Stanislav Mekhanoshin 443a7f9788 [AMDGPU] Disable DAG combine at -O0
Differential Revision: https://reviews.llvm.org/D54358

llvm-svn: 347659
2018-11-27 15:13:37 +00:00
Matt Arsenault 88ce3dcbc8 AMDGPU: Record SGPR spills when restoring too
It's possible in some cases to have a restore present
without a corresponding spill. Due to an apparent bug
in D54366 <https://reviews.llvm.org/D54366>, only the
restore for a register was emitted. It's probably
always a bug for this to happen, but due to how SGPR
spilling is implemented, this makes the issues appear
worse than it is.

llvm-svn: 347595
2018-11-26 21:28:40 +00:00