Commit Graph

1794 Commits

Author SHA1 Message Date
Nicolai Haehnle db6911a6f9 AMDGPU: Remove old-style image intrinsics
Summary:
This also removes the need for atomic pseudo instructions, since
we select the correct encoding directly in SITargetLowering::lowerImage
for dimension-aware image intrinsics.

Mesa uses dimension-aware image intrinsics since
commit a9a7993441.

Change-Id: I7473d20009476a4ed6d919cae4e6dca9ff42e77a

Reviewers: arsenm, rampitec, mareko, tpr, b-sumner

Subscribers: kzhuravl, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D48167

llvm-svn: 335231
2018-06-21 13:37:45 +00:00
Nicolai Haehnle 1045928aab AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).

This is in preparation for the eventual removal of the old-style
intrinsics.

Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
  the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
  because %tmp1 was not used; remove the load, because it doesn't work
  (Because of the amdgpu_ps calling convention? In any case, it's
  orthogonal to what the test claims to be testing.)

Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf

Reviewers: arsenm, rampitec

Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D48018

llvm-svn: 335229
2018-06-21 13:37:19 +00:00
Nicolai Haehnle 7a9c03f484 AMDGPU: Select MIMG instructions manually in SITargetLowering
Summary:
Having TableGen patterns for image intrinsics is hitting limitations:
for D16 we already have to manually pre-lower the packing of data
values, and we will have to do the same for A16 eventually.

Since there is already some custom C++ code anyway, it is arguably easier
to just do everything in C++, now that we can use the beefed-up generic
tables backend of TableGen to provide all the required metadata and map
intrinsics to corresponding opcodes. With this approach, all image
intrinsic lowering happens in SITargetLowering::lowerImage. That code is
dense due to all the cases that it handles, but it should still be easier
to follow than what we had before, by virtue of it all being done in a
single location, and by virtue of not relying on the TableGen pattern
magic that very few people really understand.

This means that we will have MachineSDNodes with MIMG instructions
during DAG combining, but that seems alright: previously we had
intrinsic nodes instead, but those are similarly opaque to the generic
CodeGen infrastructure, and the final pattern matching just did a 1:1
translation to machine instructions anyway. If anything, the fact that
we now merge the address words into a vector before DAG combine should
be an advantage.

Change-Id: I417f26bd88f54ce9781c1668acc01f3f99774de6

Reviewers: arsenm, rampitec, rtaylor, tstellar

Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D48017

llvm-svn: 335228
2018-06-21 13:36:57 +00:00
Nicolai Haehnle b3a9b68513 AMDGPU: Add implicit def of SCC to kill and indirect pseudos
Summary:
Kill instructions sometimes do use SCC in unusual circumstances, when
v_cmpx cannot be used due to the operands that are involved.

Additionally, even if SCC was never defined by the expansion, kill pseudos
could previously occur between an s_cmp and an s_cbranch_scc, which breaks
the SCC liveness tracking when the pseudo is expanded to split the basic
block. While it would be possible to explicitly mark the SCC as live-in for
the successor basic block, it's simpler to just mark the pseudo as using SCC,
so that such a sequence is never emitted by instruction selection in the
first place.

A similar issue affects indirect source/dest pseudos in principle, although
I haven't been able to come up with a test case where it actually matters
(this affects instruction selection, so a MIR test can't be used).

Fixes: dEQP-GLES3.functional.shaders.discard.dynamic_loop_always
Change-Id: Ica8d82ecff1a763b892a1112cf1b06c948863a4f

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D47761

llvm-svn: 335223
2018-06-21 13:36:08 +00:00
Nicolai Haehnle f267431901 AMDGPU: Turn D16 for MIMG instructions into a regular operand
Summary:
This allows us to reduce the number of different machine instruction
opcodes, which reduces the table sizes and helps flatten the TableGen
multiclass hierarchies.

We can do this because for each hardware MIMG opcode, we have a full set
of IMAGE_xxx_Vn_Vm machine instructions for all required sizes of vdata
and vaddr registers. Instead of having separate D16 machine instructions,
a packed D16 instructions loading e.g. 4 components can simply use the
same V2 opcode variant that non-D16 instructions use.

We still require a TSFlag for D16 buffer instructions, because the
D16-ness of buffer instructions is part of the opcode. Renaming the flag
should help avoid future confusion.

The one non-obvious code change is that for gather4 instructions, the
disassembler can no longer automatically decide whether to use a V2 or
a V4 variant. The existing logic which choose the correct variant for
other MIMG instruction is extended to cover gather4 as well.

As a bonus, some of the assembler error messages are now more helpful
(e.g., complaining about a wrong data size instead of a non-existing
instruction).

While we're at it, delete a whole bunch of dead legacy TableGen code.

Change-Id: I89b02c2841c06f95e662541433e597f5d4553978

Reviewers: arsenm, rampitec, kzhuravl, artem.tamazov, dp, rtaylor

Subscribers: wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D47434

llvm-svn: 335222
2018-06-21 13:36:01 +00:00
Alina Sbirlea dfd14adeb0 Generalize MergeBlockIntoPredecessor. Replace uses of MergeBasicBlockIntoOnlyPred.
Summary:
Two utils methods have essentially the same functionality. This is an attempt to merge them into one.
1. lib/Transforms/Utils/Local.cpp : MergeBasicBlockIntoOnlyPred
2. lib/Transforms/Utils/BasicBlockUtils.cpp : MergeBlockIntoPredecessor

Prior to the patch:
1. MergeBasicBlockIntoOnlyPred
Updates either DomTree or DeferredDominance
Moves all instructions from Pred to BB, deletes Pred
Asserts BB has single predecessor
If address was taken, replace the block address with constant 1 (?)

2. MergeBlockIntoPredecessor
Updates DomTree, LoopInfo and MemoryDependenceResults
Moves all instruction from BB to Pred, deletes BB
Returns if doesn't have a single predecessor
Returns if BB's address was taken

After the patch:
Method 2. MergeBlockIntoPredecessor is attempting to become the new default:
Updates DomTree or DeferredDominance, and LoopInfo and MemoryDependenceResults
Moves all instruction from BB to Pred, deletes BB
Returns if doesn't have a single predecessor
Returns if BB's address was taken

Uses of MergeBasicBlockIntoOnlyPred that need to be replaced:

1. lib/Transforms/Scalar/LoopSimplifyCFG.cpp
Updated in this patch. No challenges.

2. lib/CodeGen/CodeGenPrepare.cpp
Updated in this patch.
  i. eliminateFallThrough is straightforward, but I added using a temporary array to avoid the iterator invalidation.
  ii. eliminateMostlyEmptyBlock(s) methods also now use a temporary array for blocks
Some interesting aspects:
  - Since Pred is not deleted (BB is), the entry block does not need updating.
  - The entry block was being updated with the deleted block in eliminateMostlyEmptyBlock. Added assert to make obvious that BB=SinglePred.
  - isMergingEmptyBlockProfitable assumes BB is the one to be deleted.
  - eliminateMostlyEmptyBlock(BB) does not delete BB on one path, it deletes its unique predecessor instead.
  - adding some test owner as subscribers for the interesting tests modified:
    test/CodeGen/X86/avx-cmp.ll
    test/CodeGen/AMDGPU/nested-loop-conditions.ll
    test/CodeGen/AMDGPU/si-annotate-cf.ll
    test/CodeGen/X86/hoist-spill.ll
    test/CodeGen/X86/2006-11-17-IllegalMove.ll

3. lib/Transforms/Scalar/JumpThreading.cpp
Not covered in this patch. It is the only use case using the DeferredDominance.
I would defer to Brian Rzycki to make this replacement.

Reviewers: chandlerc, spatel, davide, brzycki, bkramer, javed.absar

Subscribers: qcolombet, sanjoy, nemanjai, nhaehnle, jlebar, tpr, kbarton, RKSimon, wmi, arsenm, llvm-commits

Differential Revision: https://reviews.llvm.org/D48202

llvm-svn: 335183
2018-06-20 22:01:04 +00:00
Stanislav Mekhanoshin 20279dc025 Allow binop C1, (select cc, CF, CT) -> select folding
Previously this folding was done only if select is a first operand.
However, for non-commutative operations constant may go before
select.

Differential Revision: https://reviews.llvm.org/D48223

llvm-svn: 335167
2018-06-20 20:24:20 +00:00
Matt Arsenault 5a4ec8127f AMDGPU: Fix scalar_to_vector for v4i16/v4f16
llvm-svn: 335161
2018-06-20 19:45:48 +00:00
Michael Berg 7b993d762f Utilize new SDNode flag functionality to expand current support for fadd
Summary: This patch originated from D46562 and is a proper subset, with some issues addressed.

Reviewers: spatel, hfinkel, wristow, arsenm, javed.absar

Reviewed By: spatel

Subscribers: wdng, nhaehnle

Differential Revision: https://reviews.llvm.org/D47909

llvm-svn: 334996
2018-06-18 23:44:59 +00:00
Krzysztof Parzyszek 546017322f Shrink interval after moving copy in removePartialRedundancy
llvm-svn: 334963
2018-06-18 17:16:39 +00:00
Stanislav Mekhanoshin 3b11794dbf [AMDGPU] setcc (select cc, CT, CF), CF, eq | ne -> xor cc, -1 | cc
This is the common case in the BE when we serialize condition and then
rematerialize it. Use either original or inverted condition.

Differential Revision: https://reviews.llvm.org/D48246

llvm-svn: 334882
2018-06-16 03:46:59 +00:00
Michael Berg 02d1c6c0cf Utilize new SDNode flag functionality to expand current support for fdiv
Summary: This patch originated from D46562 and is a proper subset, with some issues addressed.

Reviewers: spatel, hfinkel, wristow, arsenm

Reviewed By: spatel

Subscribers: wdng, nhaehnle

Differential Revision: https://reviews.llvm.org/D47954

llvm-svn: 334862
2018-06-15 20:44:55 +00:00
Matt Arsenault 63bc0e3cb9 AMDGPU: Add combine for short vector extract_vector_elts
Try to access pieces 4 bytes at a time. This helps
various hasOneUse extract_vector_elt combines, such
as load width reductions.

Avoids test regressions in a future commit.

llvm-svn: 334836
2018-06-15 15:31:36 +00:00
Matt Arsenault 02dc7e19e2 AMDGPU: Make v4i16/v4f16 legal
Some image loads return these, and it's awkward working
around them not being legal.

llvm-svn: 334835
2018-06-15 15:15:46 +00:00
Matt Arsenault df2f4ef29d DAG: Fix creating concat_vectors with illegal type
Test passes as is, but fails with future patch to make v4i16/v4f16
legal.

llvm-svn: 334823
2018-06-15 12:09:15 +00:00
Roman Lebedev dec562c849 [AMDGPU] Recognize x & ~(-1 << y) pattern.
Summary: The same pattern as D48010, but this one is IR-canonical as of D47428.

Reviewers: nhaehnle, bogner, tstellar, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Tags: #amdgpu

Differential Revision: https://reviews.llvm.org/D48012

llvm-svn: 334817
2018-06-15 09:56:45 +00:00
Roman Lebedev 9c17dad8f2 [AMDGPU] Recognize x & ((1 << y) - 1) pattern.
Summary:
As a followup for D48007.

Since we already handle `x << (bitwidth - y) >> (bitwidth - y)` pattern,
which does not have ub for both the edge cases (`y == 0`, `y == bitwidth`),
i think also handling a pattern that is ub for `y == bitwidth` should be fine.

Reviewers: nhaehnle, bogner, tstellar, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Tags: #amdgpu

Differential Revision: https://reviews.llvm.org/D48010

llvm-svn: 334816
2018-06-15 09:56:39 +00:00
Roman Lebedev aa8587d1fc [AMDGPU] Recognize x & (-1 >> (32 - y)) pattern.
Summary:
D47980 will canonicalize the `x << (32 - y) >> (32 - y)`,
which is the pattern the AMDGPU expects to `x &  (-1 >> (32 - y))`,
which is not recognized by AMDGPU.

Thus, it needs to be recognized, too.

Reviewers: nhaehnle, bogner, tstellar, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Tags: #amdgpu

Differential Revision: https://reviews.llvm.org/D48007

llvm-svn: 334815
2018-06-15 09:56:31 +00:00
Tom Stellard a92847359a AMDGPU/GlobalISel: Implement select() for @llvm.amdgcn.cvt.pkrtz
Reviewers: arsenm, nhaehnle

Reviewed By: arsenm

Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45907

llvm-svn: 334757
2018-06-14 19:26:37 +00:00
Tom Stellard 46bbbc33c0 AMDGPU/GlobalISel: Implement select() for 32-bit G_FADD and G_FMUL
Reviewers: arsenm, nhaehnle

Reviewed By: arsenm

Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D46171

llvm-svn: 334665
2018-06-13 22:30:47 +00:00
Stanislav Mekhanoshin 7bec57300c [AMDGPU] Corrected computeKnownBits for V_PERM_B32
Differential Revision: https://reviews.llvm.org/D48133

llvm-svn: 334640
2018-06-13 18:52:54 +00:00
Yaxun Liu fb17bf60dd [AMDGPU] Change enqueue kernel handle type
Currently the handle type is a global pointer which holds 8 bytes.
We need a larger type which hold 16 bytes, therefore change it
to [i64 x 2].

Differential Revision: https://reviews.llvm.org/D48094

llvm-svn: 334625
2018-06-13 17:31:51 +00:00
Krzysztof Parzyszek 3e039f86cc Revert "Improve handling of COPY instructions with identical value numbers"
This reverts r334594, it breaks buildbots and fails with expensive checks.

llvm-svn: 334598
2018-06-13 13:49:06 +00:00
Krzysztof Parzyszek 36b816f814 Improve handling of COPY instructions with identical value numbers
Differential Revision: https://reviews.llvm.org/D48102

llvm-svn: 334594
2018-06-13 12:47:17 +00:00
Stanislav Mekhanoshin 8fd3c4e431 [AMDGPU] DAG combine to produce V_PERM_B32
Differential Revision: https://reviews.llvm.org/D48099

llvm-svn: 334559
2018-06-12 23:50:37 +00:00
Konstantin Zhuravlyov ce25bc3e82 AMDHSA/NFC: Code object v3 updates (additional):
- Move section selection and alignment to AMDGPUAsmPrinter

llvm-svn: 334521
2018-06-12 18:33:51 +00:00
Konstantin Zhuravlyov 00f2cb1116 AMDHSA: Code object v3 updates
- Do not emit following assembler directives:
  - .hsa_code_object_version
  - .hsa_code_object_isa
  - .amd_amdgpu_isa
  - .amd_amdgpu_hsa_metadata
  - .amd_amdgpu_pal_metadata
- Do not emit .note entries
- Cleanup and bring in sync kernel descriptor header file
- Emit kernel descriptor into .rodata with appropriate relocations and
  alignments

llvm-svn: 334519
2018-06-12 18:02:46 +00:00
Mark Searles 987f292c56 [AMDGPU] prevent hitting Assertion `isReg() && "Wrong MachineOperand accessor"'
The use iterator, used within findMaskOperands(), can return anything which is
not a def. isUse() requires a register, so check isReg() before calling isUse().

Differential Revision: https://reviews.llvm.org/D48047

llvm-svn: 334459
2018-06-12 00:41:26 +00:00
Stanislav Mekhanoshin 7ba3fc730c [AMDGPU] Do not consider indirect acces through phi for wave limiter
Rational: if there is indirect access that is usually an issue
because load is not ready by the use. However, if use is inside a
loop and load is outside that is potentially an issue for a first
iteration only.

Differential Revision: https://reviews.llvm.org/D47740

llvm-svn: 334420
2018-06-11 16:50:49 +00:00
Roman Lebedev b896c4e860 [NFC][AMDGPU] Add tests for all the various IR patterns equivalent to extracting low bits.
Summary:
The idiom recognition seems rather poor.
Only the `@bzhi32_d0` produces `v_bfe_u32`.
But they all should.

This needs to be fixed before D47980 can be re-landed.

Reviewers: mareko, bogner, rampitec, arsenm, tstellar, nhaehnle

Reviewed By: nhaehnle

Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Tags: #amdgpu

Differential Revision: https://reviews.llvm.org/D48005

llvm-svn: 334398
2018-06-11 10:21:10 +00:00
Daniil Fukalov c9a098b314 [AMDGPU] Inline asm - added i16, half and i128 types support
AMDGPU inline assembler support i16, half and i128 typed variables in constraints, but they were reported as error.
Needed to fix https://github.com/RadeonOpenCompute/ROCm/issues/341,
e.g. to be able to load with global_load_dwordx4 to a 128bit integer variable

Differential Revision: https://reviews.llvm.org/D44920

llvm-svn: 334301
2018-06-08 16:29:04 +00:00
Matt Arsenault 6fc3759811 AMDGPU: Error on LDS global address in functions
These won't work as expected now, so error on them to avoid
wasting time debugging this in the future.

llvm-svn: 334269
2018-06-08 08:05:54 +00:00
Tony Tye a5a7c331e7 [AMDGPU] Simplify memory legalizer
- Make code easier to maintain.
- Avoid generating waitcnts for VMEM if the address sppace does not involve VMEM.
- Add support to generate waitcnts for LDS and GDS memory.

Differential Revision: https://reviews.llvm.org/D47504

llvm-svn: 334241
2018-06-07 22:28:32 +00:00
Matt Arsenault f1c868ef08 AMDGPU: Fix not including v2f64 in SReg_128
Fixes assertion with calls returning v2f64.

llvm-svn: 334189
2018-06-07 12:16:31 +00:00
Matt Arsenault 697300bd4f AMDGPU: Use scalar operations for f16 fabs/fneg patterns
Fixes unnecessary differences between subtargets.

llvm-svn: 334184
2018-06-07 10:15:20 +00:00
Matt Arsenault 90083d3088 AMDGPU: Try a lot harder to emit scalar loads
This has two main components. First, widen
widen short constant loads in DAG when they have
the correct alignment. This is already done a bit in
AMDGPUCodeGenPrepare, since that has access to
DivergenceAnalysis. This can't help kernarg loads
created in the DAG. Start to use DAG divergence analysis
to help this case.

The second part is to avoid kernel argument lowering
breaking the alignment of short vector elements because
calling convention lowering wants to split everything
into legal register types.

When loading a split type, load the nearest 4-byte aligned
segment and shift to get the desired bits. This extra
load of the earlier argument piece ends up merging,
and the bit extract hopefully folds out.

There are a number of improvements and regressions with
this, but I think as-is this is a better compromise between
several of the worst parts of SelectionDAG.

Particularly when i16 is legal, this produces worse code
for i8 and i16 element vector kernel arguments. This is
partially due to the very weak load merging the DAG does.
It only looks for fairly specific combines between pairs
of loads which no longer appear. In particular this
causes v4i16 loads to be split into 2 components when
previously the two halves were merged.

Worse, because of the newly introduced shifts, there
is a lot more unnecessary vector packing and unpacking code
emitted. At least some of this is due to reporting
false for isTypeDesirableForOp for i16 as a workaround for
the lack of divergence information in the DAG. The cases
where this happens it doesn't actually matter, but the
relevant code in SimplifyDemandedBits doens't have the context
to know to ignore this.

The use of the  scalar cache is probably more important
than the mess of mostly scalar instructions doing this packing
and unpacking. Future work can fix this, possibly by making better
use of the new DAG divergence information for controlling promotion
decisions, or adding another version of shift + trunc + shift
combines that doesn't only know about the used types.

llvm-svn: 334180
2018-06-07 09:54:49 +00:00
Stanislav Mekhanoshin df61be70b2 [AMDGPU] Improve reciprocal handling
When denormals are supported we are producing a full division for
1.0f / x. That still can be replaced by the faster version:

    bool c = fabs(x) > 0x1.0p+96f;
    float s = c ? 0x1.0p-32f : 1.0f;
    x *= s;
    return s * v_rcp_f32(x)

in case if requested accuracy is 2.5ulp or less. The same version
is used if denormals are not supported for non 1.0 numerators, where
just v_rcp_f32 is then used for 1.0 numerator.

The optimization of 1/x is extended to the case -1/x, which is the
same except for the resulting sign bit.

OpenCL conformance passed with both enabled and disabled denorms.

Differential Revision: https://reviews.llvm.org/D47805

llvm-svn: 334142
2018-06-06 22:22:32 +00:00
Matt Arsenault e9524f1fb3 AMDGPU: Custom lower v2f16 fneg/fabs with illegal f16
Fixes terrible code on targets without f16 support. The
legalization creates a mess that is difficult to recover
from. Also should avoid randomly breaking these tests
multiple times in sequence in future commits.

Some regressions in cases where it happens to be better
to pull the source modifier after the conversion.

llvm-svn: 334132
2018-06-06 21:28:11 +00:00
Matt Arsenault 57e541e87e AMDGPU: Preserve metadata when widening loads
Preserves the low bound of the !range. I don't think
it's legal to do anything with the top half since it's
theoretically reading garbage.

llvm-svn: 334045
2018-06-05 19:52:56 +00:00
Matt Arsenault 9224c00d2b AMDGPU: Use more custom insert/extract_vector_elt lowering
Apply to i8 vectors.

llvm-svn: 334044
2018-06-05 19:52:46 +00:00
Matt Arsenault 191bc71541 DAG: Stop dropping invariant/dereferencable
When legalizing illegal FP load results, this was
for some reason dropping the invariant and dereferencable
memory flags. There doesn't seem to be any reason for this,
and the equivalent isn't done for integer loads.

Fixes an issue in a future AMDGPU commit where some identical
loads fail to merge because one of the loads ends up
dropping the flags.

llvm-svn: 334020
2018-06-05 14:52:24 +00:00
Scott Linder ba81d7f1eb [CodeGen] Always update divergence in SelectionDAG::UpdateNodeOperands
Some overloads failed to update divergence.

Differential Revision: https://reviews.llvm.org/D47148

llvm-svn: 333947
2018-06-04 20:19:45 +00:00
Mark Searles f0b93f1e9e [AMDGPU][Waitcnt] Fix handling of flat instrs
On GFX9 and earlier, flat memory ops may decrement VMCNT out-of-order as well as LGKMCNT out-of-order.

Differential Revision: https://reviews.llvm.org/D46616

llvm-svn: 333926
2018-06-04 16:51:59 +00:00
Matt Arsenault 72a9f52c87 AMDGPU: Switch some half using-tests to use amdhsa
The default clover ABI weirdly promotes half to float,
which should probably be fixed.

llvm-svn: 333730
2018-06-01 07:06:03 +00:00
Stanislav Mekhanoshin 739174c4be [AMDGPU] Construct memory clauses before RA
Memory clauses are formed into bundles in presence of xnack.
Their source operands are marked as early-clobber.

This allows to allocate distinct source and destination registers
within a clause and prevent breaking the clause with s_nop in the
hazard recognizer.

Clauses are undone before post-RA scheduler to allow some rescheduling,
which will not break the clause since artificial edges are created in
the dag to keep memory operations together. Yet this allows a better
ILP in some cases.

Differential Revision: https://reviews.llvm.org/D47511

llvm-svn: 333691
2018-05-31 20:13:51 +00:00
Stanislav Mekhanoshin 7137f609f4 [AMDGPU] Fixed incorrect -mcpu=gfx800 in xnor.ll test. NFC.
llvm-svn: 333687
2018-05-31 19:39:54 +00:00
Jan Vesely f5016b79a6 AMDGPU/R600: Make sure functions are cacheline aligned
v2: use "ensureAlignment"
    make functions cache line aligned
Fixes GPU hangs since r333219:
"AMDGPU: Split R600 AsmPrinter code into its own class"

Differential Revision: https://reviews.llvm.org/D47516

llvm-svn: 333622
2018-05-31 04:08:08 +00:00
Matt Arsenault 7b4826e6ce AMDGPU: Use better alignment for kernarg lowering
This was just emitting loads with the ABI alignment
for the raw type. The true alignment is often better,
especially when an illegal vector type was scalarized.
The better alignment allows using a scalar load
more often.

llvm-svn: 333558
2018-05-30 16:17:51 +00:00
Mark Searles 1054541490 [AMDGPU][Waitcnt] Fix handling of loops with many bottom blocks
In terms of waitcnt insertion/if necessary, the waitcnt pass forces convergence
for a loop. Previously, that kicked if greater than 2 passes over a loop, which
doesn't account for loop with many bottom blocks. So, increase the threshold to
(n+1), where n is the number of bottom blocks. This gives the pass an
opportunity to consider the contribution of each bottom block, to the overall
loop, before the forced convergence potentially kicks in.

Differential Revision: https://reviews.llvm.org/D47488

llvm-svn: 333556
2018-05-30 15:47:45 +00:00
Matt Arsenault 4b3829d8cf AMDGPU: Fix broken check lines
llvm-svn: 333458
2018-05-29 19:35:53 +00:00
Matt Arsenault 1ea0402e82 AMDGPU: Round up kernel argument allocation size
AFAIK the driver's allocation will actually have to round this
up anyway. It is useful to track the rounded up size, so that
the end of the kernel segment is known to be dereferencable so
a wider s_load_dword can be used for a short argument at the end
of the segment.

llvm-svn: 333456
2018-05-29 19:35:00 +00:00
Konstantin Zhuravlyov 2ca6b1f2ba AMDGPU: Always set COMPUTE_PGM_RSRC2.ENABLE_TRAP_HANDLER to zero for AMDHSA as
it is set by CP

Differential Revision: https://reviews.llvm.org/D47392

llvm-svn: 333451
2018-05-29 19:09:13 +00:00
Tim Renouf 364edcd2e5 [AMDGPU] Fixed WWM bug in block otherwise entirely in WQM
Summary:
For a block with WQM on entry and exit and containing no exact mode
code, but containing some WWM code, the WQM pass forgot to process the
block at all and so did not insert code to enter and leave WWM.

This commit fixes that.

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D47027

Change-Id: I044792eead1293bed4203fb26ce75f47878afeb6
llvm-svn: 333362
2018-05-27 17:26:11 +00:00
Mark Searles 32efedcff3 [AMDGPU][Waitcnt] Remove obsolete waitcnt option
With the removal of the old waitcnt pass, the '-enable-si-insert-waitcnts' option is obsolete. Remove it.

Differential Revision: https://reviews.llvm.org/D47378

llvm-svn: 333303
2018-05-25 20:24:08 +00:00
Stanislav Mekhanoshin 1c538423dc [AMDGPU] Add perf hints to functions
This is adoption of HSAIL perfhint pass. Two types of hints are produced:

1. Function is memory bound.
2. Kernel can use wave limiter.

Currently these hints are used in the scheduler. If a function is suspected
to be memory bound we allow occupancy to decrease to 4 waves in the course
of scheduling.

Differential Revision: https://reviews.llvm.org/D46992

llvm-svn: 333289
2018-05-25 17:25:12 +00:00
Tim Renouf ad8b7c1190 [AMDGPU] Fixed incorrect break from loop
Summary:
Lower control flow did not correctly handle the case that a loop break
in if/else was on a condition that was not guaranteed to be masked by
exec. The first test kernel shows an example of this going wrong; after
exiting the loop, exec is all ones, even if it was not before the loop.

The fix is for lowering of if-break and else-break to insert an
S_AND_B64 to mask the break condition with exec. This commit also
includes the optimization of not inserting that S_AND_B64 if it is
obviously not needed because the break condition is the result of a
V_CMP in the same basic block.

V2: Addressed some review comments.
V3: Test fixes.

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D44046

Change-Id: I0fc56a01209a9e99d1d5c9b0ffd16f111caf200c
llvm-svn: 333258
2018-05-25 07:55:04 +00:00
Changpeng Fang 5f9154618e StructurizeCFG: Adjust the loop depth for a subregion to order the nodes correctly
Summary:
  StructurizeCFG::orderNodes basically uses a reverse post-order (RPO) traversal of the region list to get the order.
The only problem with it is that sometimes backedges for outer loops will be visited before backedges for inner loops.
To solve this problem, a loop depth based approach has been used to make sure all blocks in this loop has been visited
before moving on to outer loop.

However, we found a problem for a SubRegion which is a loop itself:

--> BB1 --> BB2 --> BB3 -->

In this case, BB2 is a SubRegion (loop), and thus its loopdepth is different than that of BB1 and BB3. This fact will lead
BB2 to be placed in the wrong order.

In this work, we treat the SubRegion as a special case and use its exit block to determine the loop and its depth
to guard the sorting.

Reviewers:
  arsenm, jlebar

Differential Revision:
  https://reviews.llvm.org/D46912

llvm-svn: 333111
2018-05-23 18:34:48 +00:00
Matt Arsenault f0c5c6b3f3 AMDGPU: Fix missing test coverage for some 16-bit and packed ops
llvm-svn: 333024
2018-05-22 20:42:00 +00:00
Matt Arsenault 606bc315d6 AMDGPU: Fix v2f16 fneg/fabs pattern
The integer operation convertion for some reason only happens
if the source is a bitcast from an integer, which happens to
always be the situation when the result is loaded. Add
an additional pattern for when the source operation is really
an FP operation.

llvm-svn: 333019
2018-05-22 20:13:34 +00:00
Matt Arsenault 1349a04ef5 AMDGPU: Make v2i16/v2f16 legal on VI
This usually results in better code. Fixes using
inline asm with short2, and also fixes having a different
ABI for function parameters between VI and gfx9.

Partially cleans up the mess used for lowering of the d16
operations. Making v4f16 legal will help clean this up more,
but this requires additional work.

llvm-svn: 332953
2018-05-22 06:32:10 +00:00
Sanjay Patel 17a870f07c [DAG] fold FP binops with undef operands to NaN
This is the FP sibling of D43141 with the corresponding IR change in rL327212.

We can't propagate undef here because if a variable operand is a NaN, these 
binops must propagate NaN. Neither global nor node-level fast-math makes a 
difference. If we have 'nnan', I think later folds can turn the NaN into undef.

The tests in X86/fp-undef.ll are meant to be the definitive verification for 
these folds - everything reduces identically now.

The other test changes are collateral damage. They may need to be altered to
preserve their intent.

Differential Revision: https://reviews.llvm.org/D47026

llvm-svn: 332920
2018-05-21 23:54:19 +00:00
Matt Arsenault 372d796ab1 AMDGPU: Add pass to optimize reqd_work_group_size
Eliminate loads from the dispatch packet when they will have
a known value.

Also pattern match the code used by the library to handle partial
workgroup dispatches, which isn't necessary if reqd_work_group_size
is used.

llvm-svn: 332771
2018-05-18 21:35:00 +00:00
Changpeng Fang 860d460063 AMDGPU/SI: Don't promote alloca to vector for atomic load/store
Summary:
  Don't promote alloca to vector for atomic load/store

Reviewer:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D46085

llvm-svn: 332673
2018-05-17 21:49:44 +00:00
Changpeng Fang 391bcf8893 AMDGPU/SI: Handle infinite loop for the structurizer to work with CFG with infinite loops.
Summary:
  The current StructurizeCFG pass only works for CFG with one exit. AMDGPUUnifyDivergentExitNodes combines multiple "return" blocks and/or "unreachable" blocks
to one exit block for the Structurizer to work. However, infinite loop is another kind of special "exit", and if we don't handle it, the case of multiple exits will prevent the structurizer from working.

In this work, for each infinite loop, we add a dummy edge to the "return" block, and thus the AMDGPUUnifyDivergentExitNodes pass will work with infinite loops.
This will make CFG with infinite loops be structurized.

Reviewer:
  nhaehnle

Differential Revision:
  https://reviews.llvm.org/D46340

llvm-svn: 332625
2018-05-17 16:45:01 +00:00
Stanislav Mekhanoshin 595fdcf43b [AMDGPU] Move lsr test. NFC.
llvm-svn: 332562
2018-05-17 01:30:51 +00:00
Konstantin Zhuravlyov c72ece6c2c AMDGPU : Recalculate SGPRs when trap handler is supported
Differential Revision: https://reviews.llvm.org/D29911

llvm-svn: 332523
2018-05-16 20:47:48 +00:00
Tony Tye 43259df44a [AMDGPU] Change llvm.debugtrap to be a debug breakpoint that can resume execution.
No longer require the queue pointer to be passed in in fixed SGPRs.

Differential Revision: https://reviews.llvm.org/D46769

llvm-svn: 332485
2018-05-16 16:19:34 +00:00
Matt Arsenault 67a9815a5c AMDGPU: Custom lower v4i16/v4f16 vector operations
Avoids stack access.

Also handle extract hi elt pattern from truncate + shift
to avoid a couple test regressions.

llvm-svn: 332453
2018-05-16 11:47:30 +00:00
Stanislav Mekhanoshin 57d341c27a [AMDGPU] Fix handling of void types in isLegalAddressingMode
It is legal for the type passed to isLegalAddressingMode to be
unsized or, more specifically, VoidTy. In this case, we must
check the legality of load / stores for all legal types. Directly
trying to call getTypeStoreSize is incorrect, and leads to breakage
in e.g. Loop Strength Reduction. This change guards against that
behaviour.

Differential Revision: https://reviews.llvm.org/D40405

llvm-svn: 332409
2018-05-15 22:07:51 +00:00
Marek Olsak 37b9f55cc6 AMDGPU: Add a missing test for the 128-bit local addr space option
This should have been pushed with:
  "AMDGPU: enable 128-bit for local addr space under an option"

llvm-svn: 332404
2018-05-15 21:41:57 +00:00
Tom Stellard e182b28ae4 AMDGPU/GlobalISel: Implement select() for G_FCONSTANT
Summary: Also clean up G_CONSTANT selection.

Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D46170

llvm-svn: 332379
2018-05-15 17:57:09 +00:00
Matt Arsenault dfb88dfe30 AMDGPU: Make undef legal for v2i16/v2f16
This is apparently necessary to stop undef from being
turned into a build_vector of 0s.

llvm-svn: 332195
2018-05-13 10:04:38 +00:00
Stanislav Mekhanoshin 7012c246c1 [AMDGPU] Fix amdgpu-waves-per-eu accounting in scheduler
We cannot query this attribute from a subtarget given a machine function.
At this point attribute itself is already unavailable and can only be
obtained through MFI.

Differential Revision: https://reviews.llvm.org/D46781

llvm-svn: 332166
2018-05-12 01:41:56 +00:00
Tom Stellard 655fdd3f82 AMDGPU/GlobalISel: Implement select() for >32-bit G_STORE
Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, llvm-commits, t-tye

Differential Revision: https://reviews.llvm.org/D46153

llvm-svn: 332154
2018-05-11 23:12:49 +00:00
Changpeng Fang f094885a9e AMDGPU/SI: Don't promote alloca to vector for AddrSpaceCast instruction.
Summary:
  We have no logic to promote alloca to vector for an AddrSpaceCast instruction.

Reviewer:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D45993

llvm-svn: 332147
2018-05-11 22:17:57 +00:00
Yaxun Liu deba150c27 [AMDGPU] Fix compilation failure when IR contains comdat
Remove a useless SwitchSection which also causes compilation failure
when IR contains comdat.

The SwitchSection is useless because the current section is already
correct text section for the function therefore no need to switch.

It causes compilation failure for comdat because functions with comdat
has specific text section, not the default .text section.

Since HIP uses comdat, this bug caused failures for HIP.

Differential Revision: https://reviews.llvm.org/D46770

llvm-svn: 332137
2018-05-11 20:40:14 +00:00
Tom Stellard dcc95e9385 AMDGPU/GlobalISel: Implement select() for 32-bit G_FPTOUI
Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45883

llvm-svn: 332082
2018-05-11 05:44:16 +00:00
Tom Stellard 1e0edad4bb AMDGPU/GlobalISel: Implement select() for G_BITCAST s32 <--> <2 x s16>
Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, wdng, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45881

llvm-svn: 332042
2018-05-10 21:20:10 +00:00
Tom Stellard 1dc90204bf AMDGPU/GlobalISel: Enable TableGen'd instruction selector
Reviewers: arsenm, nhaehnle

Reviewed By: arsenm

Subscribers: kzhuravl, wdng, mgorny, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45994

llvm-svn: 332039
2018-05-10 20:53:06 +00:00
Farhana Aleen e24f3ff8de [AMDGPU] Support horizontal vectorization of min/max.
Author: FarhanaAleen

Reviewed By: rampitec

Subscribers: AMDGPU

Differential Revision: https://reviews.llvm.org/D46604

llvm-svn: 331920
2018-05-09 21:18:34 +00:00
Matt Arsenault eac81b2448 AMDGPU: Ignore any_extend in mul24 combine
If a multiply is truncated, SimplifyDemandedBits
sometimes turns a zero_extend of the inputs into an
any_extend, which makes the known bits computation unhelpful.
Ignore these and compute known bits for the underlying value,
since we insert the correct extend type after.

llvm-svn: 331919
2018-05-09 21:11:35 +00:00
Matt Arsenault 74fd7600d2 AMDGPU: Handle partial shift reduction for variable shifts
If the variable shift amount has known bits, we can still reduce
the shift.

llvm-svn: 331917
2018-05-09 20:52:54 +00:00
Matt Arsenault b143d9a5ea AMDGPU: Partially shrink 64-bit shifts if reduced to 16-bit
This is an extension of an existing combine to reduce wider
shls if the result fits in the final result type. This
introduces the same combine, but reduces the shift to a middle
sized type to avoid the slow 64-bit shift.

llvm-svn: 331916
2018-05-09 20:52:43 +00:00
Matt Arsenault 762d498808 AMDGPU: Add combine for trunc of bitcast from build_vector
If the truncate is only accessing the first element of the vector,
we can use the original source value.

This helps with some combine ordering issues after operations are
lowered to integer operations between bitcasts of build_vector.
In particular it stops unnecessarily materializing the unused
top half of a vector in some cases.

llvm-svn: 331909
2018-05-09 18:37:39 +00:00
Shiva Chen 2c864551df [DebugInfo] Add DILabel metadata and intrinsic llvm.dbg.label.
In order to set breakpoints on labels and list source code around
labels, we need collect debug information for labels, i.e., label
name, the function label belong, line number in the file, and the
address label located. In order to keep these information in LLVM
IR and to allow backend to generate debug information correctly.
We create a new kind of metadata for labels, DILabel. The format
of DILabel is

!DILabel(scope: !1, name: "foo", file: !2, line: 3)

We hope to keep debug information as much as possible even the
code is optimized. So, we create a new kind of intrinsic for label
metadata to avoid the metadata is eliminated with basic block.
The intrinsic will keep existing if we keep it from optimized out.
The format of the intrinsic is

llvm.dbg.label(metadata !1)

It has only one argument, that is the DILabel metadata. The
intrinsic will follow the label immediately. Backend could get the
label metadata through the intrinsic's parameter.

We also create DIBuilder API for labels to be used by Frontend.
Frontend could use createLabel() to allocate DILabel objects, and use
insertLabel() to insert llvm.dbg.label intrinsic in LLVM IR.

Differential Revision: https://reviews.llvm.org/D45024

Patch by Hsiangkai Wang.

llvm-svn: 331841
2018-05-09 02:40:45 +00:00
Matt Arsenault 80fb05dc28 AMDGPU: Fix broken check lines in test
llvm-svn: 331796
2018-05-08 18:43:44 +00:00
Matt Arsenault 3ec8803f53 AMDGPU: Don't use undef in a test
llvm-svn: 331795
2018-05-08 18:43:34 +00:00
Matt Arsenault 869cbedc81 AMDGPU: Fix broken dynamic vector indexing for packed types
The intention of this was to multiply by 16, not shift by 16.

llvm-svn: 331793
2018-05-08 18:43:25 +00:00
Roman Tereshin d2421f9445 [MachineVerifier][GlobalISel] Verifying generic extends and truncates
Making sure we don't truncate / extend pointers, don't try to change
vector topology or bitcast vectors to scalars or back, and most
importantly, don't extend to a smaller type or truncate to a large
one.

Reviewers: qcolombet t.p.northover aditya_nandakumar

Reviewed By: qcolombet

Subscribers: rovka, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D46490

llvm-svn: 331718
2018-05-08 02:48:15 +00:00
Mark Searles 4a0f2c5047 [AMDGPU][Waitcnt] Remove the old waitcnt pass
Remove the old waitcnt pass ( si-insert-waits ), which is no longer maintained
and getting crufty

Differential Revision: https://reviews.llvm.org/D46448

llvm-svn: 331641
2018-05-07 14:43:28 +00:00
Tim Renouf 18a1e9d03a [AMDGPU] Don't force WQM for DS op
Summary:
Previously, all DS ops forced WQM in a pixel shader. That was a hack to
allow for graphics frontends using ds_swizzle to implement explicit
derivatives, on SI/CI at least where DPP is not available. But it forced
WQM for _any_ DS op.

With this commit, DS ops no longer force WQM. Both graphics frontends
(Mesa and LLPC) need to change to issue an explicit llvm.amdgcn.wqm
intrinsic call when calculating explicit derivatives.

The required Mesa change is: "amd/common: use llvm.amdgcn.wqm for
explicit derivatives".

Subscribers: qcolombet, arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D46051

Change-Id: I9b745b626fa91bbd66456e6cf41ee07eeea42f81
llvm-svn: 331633
2018-05-07 13:21:26 +00:00
Daniel Sanders acc008cb0c [globalisel] Remove redundant -global-isel option from tests that use -run-pass. NFC
As Roman Tereshin pointed out in https://reviews.llvm.org/D45541, the
-global-isel option is redundant when -run-pass is given. -global-isel sets up
the GlobalISel passes in the pass manager but -run-pass skips that entirely and
configures it's own pipeline.

llvm-svn: 331603
2018-05-05 21:19:59 +00:00
Konstantin Zhuravlyov c2c2eb7d01 AMDGPU: Add D16 instructions preserve unused bits feature
- Predicate D16 patterns on this new feature
- Added this new feature to gfx900/2/4

Differential Revision: https://reviews.llvm.org/D46366

llvm-svn: 331551
2018-05-04 20:06:57 +00:00
Piotr Padlewski 5dde809404 Rename invariant.group.barrier to launder.invariant.group
Summary:
This is one of the initial commit of "RFC: Devirtualization v2" proposal:
https://docs.google.com/document/d/16GVtCpzK8sIHNc2qZz6RN8amICNBtvjWUod2SujZVEo/edit?usp=sharing

Reviewers: rsmith, amharc, kuhar, sanjoy

Subscribers: arsenm, nhaehnle, javed.absar, hiraditya, llvm-commits

Differential Revision: https://reviews.llvm.org/D45111

llvm-svn: 331448
2018-05-03 11:03:01 +00:00
Farhana Aleen 07e612340f [AMDGPU] A trivial fix for a buildbot failure caused by "commit 224a839fcbbead221f872cd32a1dd0c308d37299".
Author: FarhanaAleen
llvm-svn: 331383
2018-05-02 18:16:39 +00:00
Farhana Aleen 150cb6d91a Revert "[AMDGPU] performAddCombine should run after DAG is legalized."
This reverts commit 6b97d2995566b4dddd6bf0d75579ff44501d4494.

llvm-svn: 331371
2018-05-02 16:48:52 +00:00
Farhana Aleen 2f4100f56e [AMDGPU] performAddCombine should run after DAG is legalized.
Summary: performAddCombine should run after DAG is legalized; Otherwise generic optimization
         in the DAGCombiner can optimize an addcarry+trunc into an addcarry instruction with
         illegal types.

Author: FarhanaAleen

Reviewed By: rampitec

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D46337

llvm-svn: 331368
2018-05-02 16:24:10 +00:00
Farhana Aleen e2dfe8a853 [AMDGPU] Support horizontal vectorization.
Author: FarhanaAleen

Reviewed By: rampitec, arsenm

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D46213

llvm-svn: 331313
2018-05-01 21:41:12 +00:00
Konstantin Zhuravlyov 1501af4846 AMDGPU: Remove remnants of gfx901 (it was deprecated some time ago)
llvm-svn: 331298
2018-05-01 18:47:48 +00:00
Matt Arsenault 0084adc516 AMDGPU: Add Vega12 and Vega20
Changes by
  Matt Arsenault
  Konstantin Zhuravlyov

llvm-svn: 331215
2018-04-30 19:08:16 +00:00
Mark Searles a6322924e6 [AMDGPU][Waitcnt] Update a few tests to use default waitcnt pass (si-insert-waitcnts) rather than old pass (si-insert-waits); this is a small step towards the overall goal of removing the old waitcnt pass, which is no longer maintained.
Differential Revision: https://reviews.llvm.org/D46154

llvm-svn: 331062
2018-04-27 17:59:15 +00:00
Matt Arsenault 540512c297 DAG: Fix not legalizing vector fcanonicalizes
If an fcanoncialize was done on a vector type that was legal,

llvm-svn: 330981
2018-04-26 19:21:37 +00:00
Matt Arsenault fcc5ba46b7 AMDGPU: Extend extract_vector_elt fneg combine to fabs
Fixes a regression in a future commit.

llvm-svn: 330980
2018-04-26 19:21:32 +00:00
Mark Searles 2a19af6e17 [AMDGPU][Waitcnt] As of gfx7, VMEM operations do not increment the export counter and the input registers are available in the next instruction; update the waitcnt pass to take this into account.
Differential Revision: https://reviews.llvm.org/D46067

llvm-svn: 330954
2018-04-26 16:11:19 +00:00
Mark Searles ec58183e1b [AMDGPU] Waitcnt pass: add debug options
- Add "amdgpu-waitcnt-forcezero" to force all waitcnt instrs to be emitted as s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

- Add debug counters to control force emit of s_waitcnt instrs; debug counters:
si-insert-waitcnts-forceexp: force emit s_waitcnt expcnt(0) instrs
si-insert-waitcnts-forcevm: force emit s_waitcnt lgkmcnt(0) instrs
si-insert-waitcnts-forcelgkm: force emit s_waitcnt vmcnt(0) instrs

- Add some debug statements

Note that a variant of this patch was previously committed/reverted.

Differential Revision: https://reviews.llvm.org/D45888

llvm-svn: 330862
2018-04-25 19:21:26 +00:00
Francis Visoiu Mistrih 57fcd3454a [MIR] Add support for debug metadata for fixed stack objects
Debug var, expr and loc were only supported for non-fixed stack objects.

This patch adds the following fields to the "fixedStack:" entries, and
renames the ones from "stack:" to:

* debug-info-variable
* debug-info-expression
* debug-info-location

Differential Revision: https://reviews.llvm.org/D46032

llvm-svn: 330859
2018-04-25 18:58:06 +00:00
Alexander Timofeev b934728cd2 [AMDGPU] Revert b0efc4fd6 (https://reviews.llvm.org/D40556)
llvm-svn: 330818
2018-04-25 12:32:46 +00:00
Tom Stellard a2be8f4c35 AMDGPU: Remove deprecated llvm.AMDGPU.kilp intrinsic
Summary: This is no longer used by mesa since its 18.0.0 release.

Reviewers: nhaehnle

Reviewed By: nhaehnle

Subscribers: arsenm, kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye

Differential Revision: https://reviews.llvm.org/D45988

llvm-svn: 330775
2018-04-24 21:37:57 +00:00
Tom Stellard 257882ff72 AMDGPU/GlobalISel: Fall-back to SelectionDAG for non-void functions
Reviewers: arsenm, nhaehnle

Reviewed By: nhaehnle

Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45843

llvm-svn: 330774
2018-04-24 21:29:36 +00:00
Tom Stellard c7709e1c29 AMDGPU/GlobalISel: Add support for amdgpu_ps calling convention
Reviewers: arsenm

Reviewed By: arsenm

Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45837

llvm-svn: 330767
2018-04-24 20:51:28 +00:00
Stanislav Mekhanoshin a4bfb3c446 [AMDGPU] Truncate packed inline constant
If a packed inline constant is sign extended it must be truncated
after the shift. I.e. a constant (0xH0000, 0xHBC00), will be represented
as 0xFFFFFFFFBC000000 in the IR because the immediate is sign extended
to 64 bit. After the value shifted right by 16 to use it in a low part
with op_sel_hi it becomes 0xFFFFFFFFBC00 and does not qualify as inline
constant any longer.

Fixed the error and added verification code. Without the fix and with
the verification bug is causing pk_max_f16_literal.ll to fail.

Differential Revision: https://reviews.llvm.org/D45987

llvm-svn: 330752
2018-04-24 18:17:55 +00:00
Matt Arsenault b21f9592be AMDGPU: Move a flawed assert when spilling SGPRs
It's possible to validly spill the frame offset register
in a call sequence to a VGPR. There are definitely issues
with SGPR spilling to memory, so move the assert later.

llvm-svn: 330612
2018-04-23 16:13:30 +00:00
Nicolai Haehnle 5a995664f0 AMDGPU: Fix a corner case crash in SIOptimizeExecMasking
Summary:
See the new test case; this is really unlikely to happen with real code,
but I ran into this while attempting to bugpoint-reduce a different issue.

Change-Id: I9ade1dc1aa8fd9c4d9fc83661d7b80e310b5c4a6

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45885

llvm-svn: 330585
2018-04-23 13:05:50 +00:00
Nicolai Haehnle 7a87977fb2 AMDGPU: Legalize the operand of SI_INIT_M0
Summary:
This fixes a case where the argument to a sendmsg intrinsic
ends up in a VGPR, for whatever reason.

The underlying performance issue is that a multiplication that
can be an s_mul_i32 is instead needlessly generated as
v_mul_u32_u24, but this is not addressed by this patch.

Change-Id: I61fd4034314d5acdf6074632c30b65364dfa7328

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45826

llvm-svn: 330393
2018-04-20 07:14:25 +00:00
Stanislav Mekhanoshin 160f85794d [AMDGPU] Use packed literals with zero either lower or hi part
Differential Revision: https://reviews.llvm.org/D45790

llvm-svn: 330365
2018-04-19 21:16:50 +00:00
Mark Searles 1bc6e71f32 [AMDGPU] Do not only rely on BB number when finding bottom loop
We should also check that the "bottom" basic block of a loopis a successor of the "header" basic block, otherwise we don't propagate the information correctly when the CFG is complex. This fixes an important rendering problem with Wolfsentein 2, because of one vector-memory wait was missing.

Differential Revision: https://reviews.llvm.org/D43831

llvm-svn: 330337
2018-04-19 15:42:30 +00:00
David Stuttard 31f482c26b [AMDGPU] Fix issues for backend divergence tracking
Summary:
A change to use divergence analysis in the AMDGPU backend was getting formal
arguments incorrect (not tagged as divergent) unless they were VGPR0, VGPR1 or
VGPR2

For graphics shaders it is possible to have more than these passed in as VGPR

Modified the checking code to check for any VGPR registers passed in as formal
arguments.

Also, some intrinsics that are sources of divergence may have been lowered
during instruction selection and are missed on subsequent calls to
isSDNodeSourceOfDivergence - added the relevant AMDGPUISD checks as well.

Finally, the FunctionLoweringInfo tracks virtual registers that are live across
basic block boundaries. This is used to check for divergence of CopyFromRegister
registers using the DivergenceAnalysis analysis. For multiple blocks the lazily
evaluated inverted map VirtReg2Value was not cleared when the ValueMap map was.

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45372

Change-Id: I112f3bd6dfe0f62e63ce9b43b893982778e4bee3
llvm-svn: 330257
2018-04-18 13:53:31 +00:00
Stanislav Mekhanoshin 8b20b7dc2b [AMDGPU] Enabled v2.16 literals for VOP3P
Literal encoding needs op_sel_hi to select low 16 bit in this case.

Differential Revision: https://reviews.llvm.org/D45745

llvm-svn: 330230
2018-04-17 23:09:05 +00:00
Tim Renouf fd8d4af3bc [AMDGPU] Ensure there are enough registers for wave dispatch
Summary:
This fixes the number of SGPRs and VGPRs in the *_RSRC1 register to
allow for registers set up in wave dispatch, even if those registers are
not used in the shader.

Re-landed after noticing that the buildbot failure from 329808 seemed to
be unrelated.

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45503

Change-Id: I6575f0e0d2a528d1319d0b289f0ebe4510fa5771
llvm-svn: 329826
2018-04-11 17:18:36 +00:00
Yaxun Liu 9381ae9791 [AMDGPU] Fix lowering enqueue_kernel
Two issues were fixed:

runtime has difficulty to allocate memory for an external symbol of a
kernel and set the address of the external symbol, therefore make the runtime
handle of an enqueued kernel an ordinary global variable. Runtime only needs
to store the address of the loaded kernel to the handle and has verified
that this approach works.

handle the situation where __enqueue_kernel* gets inlined therefore
the enqueued kernel may be used through a constant expr instead
of an instruction.

Differential Revision: https://reviews.llvm.org/D45187

llvm-svn: 329815
2018-04-11 14:46:15 +00:00
Tim Renouf 8ca33bfcf3 Revert "[AMDGPU] Ensure there are enough registers for wave dispatch"
This reverts 329808. That change caused a report of a failure in
test/CodeGen/MIR/AMDGPU/mir-canon-multi.mir that I didn't see. I suspect
it is an expensive-check-only error.

Change-Id: I8133f26f15e7d5ec2b09c687c12cd70e918461b0
llvm-svn: 329811
2018-04-11 14:27:41 +00:00
Tim Renouf f26b723491 [AMDGPU] Ensure there are enough registers for wave dispatch
Summary:
This fixes the number of SGPRs and VGPRs in the *_RSRC1 register to
allow for registers set up in wave dispatch, even if those registers are
not used in the shader.

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D45503

Change-Id: I6575f0e0d2a528d1319d0b289f0ebe4510fa5771
llvm-svn: 329808
2018-04-11 14:02:41 +00:00
Marek Olsak a9a58fa236 AMDGPU: enable 128-bit for local addr space under an option
Author: Samuel Pitoiset

ds_read_b128 and ds_write_b128 have been recently enabled
under the amdgpu-ds128 option because the performance benefit
is unclear.

Though, using 128-bit loads/stores for the local address space
appears to introduce regressions in tessellation shaders. Not
sure what is broken, but as ds_read_b128/ds_write_b128 are not
enabled by default, just introduce a global option and enable
128-bit only if requested (until it's fixed/used correctly).

v2: - fix regressions in merge-stores.ll and multiple_tails.ll

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105464
llvm-svn: 329764
2018-04-10 22:48:23 +00:00
Tim Renouf 7190a4692a [AMDGPU] For OS type AMDPAL, fixed scratch on compute shader
Summary:
For OS type AMDPAL, the scratch descriptor is loaded from offset 0 of
the GIT, whose 32 bit pointer is in s0 (s8 for gfx9 merged shaders).

This commit fixes that to use offset 0x10 instead of offset 0 for a
compute shader, per the PAL ABI spec.

V2: Ensure s0 (s8 for gfx9 merged shader) is marked live-in when loading
scratch descriptor from GIT.

Reviewers: kzhuravl, nhaehnle, timcorringham

Subscribers: kzhuravl, wdng, yaxunl, t-tye, llvm-commits, dstuttard, nhaehnle, arsenm

Differential Revision: https://reviews.llvm.org/D44468

Change-Id: I93dffa647758e37f613bb5e0dfca840d82e6d26f
llvm-svn: 329690
2018-04-10 11:25:15 +00:00
Alex Shlyapnikov 79f2c720b5 Revert "AMDGPU: enable 128-bit for local addr space under an option"
This reverts commit r329591.

It breaks various bots:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/16516
http://lab.llvm.org:8011/builders/clang-ppc64be-linux/builds/17374
http://lab.llvm.org:8011/builders/clang-ppc64le-linux/builds/15992
http://lab.llvm.org:8011/builders/clang-ppc64be-linux-lnt
http://lab.llvm.org:8011/builders/clang-ppc64le-linux-lnt/builds/11251
...

llvm-svn: 329610
2018-04-09 19:47:38 +00:00
Marek Olsak 52b033b827 AMDGPU: enable 128-bit for local addr space under an option
Author: Samuel Pitoiset

ds_read_b128 and ds_write_b128 have been recently enabled
under the amdgpu-ds128 option because the performance benefit
is unclear.

Though, using 128-bit loads/stores for the local address space
appears to introduce regressions in tessellation shaders. Not
sure what is broken, but as ds_read_b128/ds_write_b128 are not
enabled by default, just introduce a global option and enable
128-bit only if requested (until it's fixed/used correctly).

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105464
llvm-svn: 329591
2018-04-09 16:56:32 +00:00
Konstantin Zhuravlyov c233ae8004 AMDGPU/Metadata: Always report a fixed number of hidden arguments
Currently it is 6. If the "feature" was not used, report dummy
hidden argument. Otherwise it does not match the kernarg size
reported in the kernel header.

Differential Revision: https://reviews.llvm.org/D45129

llvm-svn: 329341
2018-04-05 20:46:04 +00:00
Nicolai Haehnle 2f5a73820c AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.

This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.

Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.

v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI

Change-Id: I099f309e0a394082a5901ea196c3967afb867f04

Reviewers: arsenm, rampitec, b-sumner

Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye

Differential Revision: https://reviews.llvm.org/D44939

llvm-svn: 329166
2018-04-04 10:58:54 +00:00
Nicolai Haehnle 3ffd383a15 AMDGPU: Fix copying i1 value out of loop with non-uniform exit
Summary:
When an i1-value is defined inside of a loop and used outside of it, we
cannot simply use the SGPR bitmask from the loop's last iteration.

There are also useful and correct cases of an i1-value being copied between
basic blocks, e.g. when a condition is computed outside of a loop and used
inside it. The concept of dominators is not sufficient to capture what is
going on, so I propose the notion of "lane-dominators".

Fixes a bug encountered in Nier: Automata.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103743
Change-Id: If37b969ddc71d823ab3004aeafb9ea050e45bd9a

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, wdng, mgorny, yaxunl, dstuttard, tpr, llvm-commits, t-tye

Differential Revision: https://reviews.llvm.org/D40547

llvm-svn: 329164
2018-04-04 10:57:58 +00:00
Farhana Aleen e80aeac0f2 [AMDGPU] performMinMaxCombine should not optimize patterns of vectors to min3/max3.
Summary: There are no packed instructions for min3 or max3. So, performMinMaxCombine should not optimize vectors of f16 to min3/max3.

Author: FarhanaAleen

Reviewed By: arsenm

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D45219

llvm-svn: 329131
2018-04-03 23:00:30 +00:00
Farhana Aleen 936947349a Revert "MSG"
This reverts commit 9a0ce889d1c39c74d69ecad5ce9c875155ae55de.

This was committed by mistake.

llvm-svn: 329119
2018-04-03 21:51:45 +00:00
Farhana Aleen 3ab409dc86 MSG
llvm-svn: 329114
2018-04-03 21:20:39 +00:00
Stanislav Mekhanoshin 74e2974ac6 [AMDGPU] Fixed some instructions latencies
Differential Revision: https://reviews.llvm.org/D45073

llvm-svn: 328874
2018-03-30 16:19:13 +00:00
Michael Bedy 59e5ef793c [AMDGPU] Fix the SDWA Peephole phase to handle src for dst:UNUSED_PRESERVE.
Summary:
The phase attempts to transform operations that extract a portion of a value
into an SDWA src operand in cases where that value is used only once. It
was not prepared for this use to be the preserved portion of a value for
dst:UNUSED_PRESERVE, resulting in a crash or assert.

This change either rejects the illegal SDWA attempt, or in the case where
dst:WORD_1 and the src_sel would be WORD_0, removes the unneeded
extract instruction.

Reviewers: arsenm, #amdgpu

Reviewed By: arsenm, #amdgpu

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D44364

llvm-svn: 328856
2018-03-30 05:03:36 +00:00
Matt Arsenault 03ae399d50 AMDGPU: Support realigning stack
While the stack access instructions don't care about
alignment > 4, some transformations on the pointer calculation
do make assumptions based on knowing the low bits of a pointer
are 0. If a stack object ends up being accessed through its
absolute address (relative to the kernel scratch wave offset),
the addressing expression may depend on the stack frame being
properly aligned. This was breaking in a testcase due to the
add->or combine.

I think some of the SP/FP handling logic is still backwards,
and overly simplistic to support all of the stack features.
Code which tries to modify the SP with inline asm for example
or variable sized objects will probably require redoing this.

llvm-svn: 328831
2018-03-29 21:30:06 +00:00
Matt Arsenault ffb132e74b AMDGPU: Increase default stack alignment
8 and 16-byte values are common, so increase the default
alignment to avoid realigning the stack in most functions.

llvm-svn: 328821
2018-03-29 20:22:04 +00:00
Matt Arsenault 6c041a3cab AMDGPU: Fix selection error on constant loads with < 4 byte alignment
llvm-svn: 328818
2018-03-29 19:59:28 +00:00
Tim Renouf cdac172e2a Revert "[AMDGPU] For OS type AMDPAL, fixed scratch on compute shader"
This reverts commit 0daf86291d3aa04d3cc280cd0ef24abdb0174981.

It was causing an assert in test/CodeGen/AMDGPU/amdpal.ll only on a
release-with-asserts build. I will resubmit the change when I have fixed
that.

Change-Id: If270594eba27a7dc4076bdeab3fa8e6bfda3288a
llvm-svn: 328695
2018-03-28 11:21:07 +00:00
Tim Renouf e4208bfa5b [AMDGPU] For OS type AMDPAL, fixed scratch on compute shader
Summary:
For OS type AMDPAL, the scratch descriptor is loaded from offset 0 of
the GIT, whose 32 bit pointer is in s0 (s8 for gfx9 merged shaders).

This commit fixes that to use offset 0x10 instead of offset 0 for a
compute shader, per the PAL ABI spec.

Reviewers: kzhuravl, nhaehnle, timcorringham

Subscribers: kzhuravl, wdng, yaxunl, t-tye, llvm-commits, dstuttard, nhaehnle, arsenm

Differential Revision: https://reviews.llvm.org/D44468

Change-Id: I93dffa647758e37f613bb5e0dfca840d82e6d26f
llvm-svn: 328673
2018-03-27 21:35:00 +00:00
Tim Renouf 4db0960420 [CodeGen] Fixed unreachable with -print-machineinstrs and custom pseudo source value
Summary:
Rev 327580 "[CodeGen] Use MIR syntax for MachineMemOperand printing"
broke -print-machineinstrs for us on AMDGPU, because we have custom
pseudo source values, and MIR serialization does not implement that.

This commit at least restores the functionality of -print-machineinstrs,
even if it does not properly implement the missing MIR serialization
functionality.

Differential Revision: https://reviews.llvm.org/D44871

Change-Id: I44961c0b90bf6d48c01484ed7a4e466fd300db66
llvm-svn: 328668
2018-03-27 21:14:04 +00:00
Matt Arsenault 17f3338015 AMDGPU: Fix not preserving CSR VGPR if used for SGPR spills
Before this was not done if the function had no calls in it. This
is still a possible issue with any callable function, regardless
of calls present.

llvm-svn: 328659
2018-03-27 19:42:55 +00:00
Matt Arsenault 0a0c871f60 AMDGPU: Fix crash when MachinePointerInfo invalid
The combine on a select of a load only triggers for
addrspace 0, and discards the MachinePointerInfo. The
conservative default needs to be used for this.

llvm-svn: 328652
2018-03-27 18:39:45 +00:00
Matt Arsenault 126a874952 AMDGPU: Fix register name format in tests
These were changed to match the asm output name a long time ago,
although I think the old tablegenerated names still work.

llvm-svn: 328651
2018-03-27 18:39:42 +00:00
Matt Arsenault e9f3679031 AMDGPU: Fix FP restore from being reordered with stack ops
In a function, s5 is used as the frame base SGPR. If a function
is calling another function, during the call sequence
it is copied to a preserved SGPR and restored.

Before it was possible for the scheduler to move stack operations
before the restore of s5, since there's nothing to associate
a frame index access with the restore.

Add an implicit use of s5 to the adjcallstack pseudo which ends
the call sequence to preven this from happening. I'm not 100%
satisfied with this solution, but I'm not sure what else would be
better.

llvm-svn: 328650
2018-03-27 18:38:51 +00:00
Tony Tye 88441a3d1e [AMDGPU] Update OpenCL to use 48 bytes of implicit arguments for AMDGPU
Add two additional implicit arguments for OpenCL for the AMDGPU target using the AMDHSA runtime to support device enqueue.

Differential Revision: https://reviews.llvm.org/D44697

llvm-svn: 328351
2018-03-23 18:58:47 +00:00
Tony Tye 7a893d4e34 [AMDGPU] Remove use of OpenCL triple environment and replace with function attribute for AMDGPU
- Remove use of the opencl and amdopencl environment member of the target triple for the AMDGPU target.
- Use function attribute to communicate to the AMDGPU backend to add implicit arguments for OpenCL kernels for the AMDHSA OS.

Differential Revision: https://reviews.llvm.org/D43736

llvm-svn: 328349
2018-03-23 18:45:18 +00:00
Sanjay Patel e235942a1e [InstSimplify] fp_binop X, NaN --> NaN
We propagate the existing NaN value when possible.

Differential Revision: https://reviews.llvm.org/D44521

llvm-svn: 328140
2018-03-21 19:31:53 +00:00
Sanjay Patel f2d85e78df [AMDGPU] change test to avoid NaN math
llvm-svn: 327891
2018-03-19 19:26:22 +00:00
Sanjay Patel dad3d13b99 [AMDGPU] adjust tests to be nan-free
As suggested in D44521 - bitcast to integer for the math,
so we preserve the intent of these tests when NaN math
gets folded away.

llvm-svn: 327890
2018-03-19 19:23:53 +00:00
Matt Arsenault fed0a45036 AMDGPU/GlobalISel: RegBankSelect for basic int ops
llvm-svn: 327843
2018-03-19 14:07:23 +00:00
Matt Arsenault 69932e4d69 AMDGPU: Don't leave dead illegal VGPR->SGPR copies
Normally DCE kills these, but at -O0 these get left behind
leaving suspicious looking illegal copies.

Replace with IMPLICIT_DEF to avoid iterator issues.

llvm-svn: 327842
2018-03-19 14:07:15 +00:00
Matt Arsenault abdc4f2dc7 AMDGPU/GlobalISel: Cleanup constant legality
llvm-svn: 327774
2018-03-17 15:17:48 +00:00
Matt Arsenault 685d1e8157 AMDGPU/GlobalISel: Basic G_GEP legality
llvm-svn: 327773
2018-03-17 15:17:45 +00:00
Matt Arsenault 85803366d6 AMDGPU/GlobalISel: Basic legality for load/store
llvm-svn: 327772
2018-03-17 15:17:41 +00:00
Farhana Aleen c6c9dc8773 [AMDGPU] Supported ds_write_b128 generation.
Summary: This is a follow-on patch of https://reviews.llvm.org/D44210

Author: FarhanaAleen

Reviewed By: msearles

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D44319

llvm-svn: 327726
2018-03-16 18:12:00 +00:00
Dmitry Preobrazhensky 4c8f4234b6 [AMDGPU][MC][GFX8][GFX9][DISASSEMBLER] Added "_e32" suffix to 32-bit VINTRP opcodes
See bug 36751: https://bugs.llvm.org/show_bug.cgi?id=36751

Differential Revision: https://reviews.llvm.org/D44529

Reviewers: artem.tamazov, arsenm
llvm-svn: 327723
2018-03-16 16:38:04 +00:00
Mark Searles c3c02bde73 [AMDGPU] Waitcnt pass: Modify the waitcnt pass to propagate info in the case of a single basic block loop. mergeInputScoreBrackets() does this for us; update it so that it processes the single bb's score bracket when processing the single bb's preds. It is, after all, a pred of itself, so it's score bracket is needed.
Differential Revision: https://reviews.llvm.org/D44434

llvm-svn: 327583
2018-03-14 22:04:32 +00:00
Francis Visoiu Mistrih e85b06d65f [CodeGen] Use MIR syntax for MachineMemOperand printing
Get rid of the "; mem:" suffix and use the one we use in MIR: ":: (load 2)".

rdar://38163529

Differential Revision: https://reviews.llvm.org/D42377

llvm-svn: 327580
2018-03-14 21:52:13 +00:00
Yaxun Liu a99e7d8e44 [AMDGPU] Fix lowering enqueue kernel when kernel has no name
Since the enqueued kernels have internal linkage, their names may be dropped.
In this case, give them unique names __amdgpu_enqueued_kernel or
__amdgpu_enqueued_kernel.n where n is a sequential number starting from 1.

Differential Revision: https://reviews.llvm.org/D44322

llvm-svn: 327291
2018-03-12 16:34:06 +00:00
Dmitry Preobrazhensky da4a7c01bf [AMDGPU][MC] Corrected GATHER4 opcodes
See bug 36252: https://bugs.llvm.org/show_bug.cgi?id=36252

Differential Revision: https://reviews.llvm.org/D43874

Reviewers: artem.tamazov, arsenm
llvm-svn: 327278
2018-03-12 15:03:34 +00:00
Matt Arsenault 7b9ed89dcf AMDGPU/GlobalISel: Legality and RegBankInfo for G_{INSERT|EXTRACT}_VECTOR_ELT
llvm-svn: 327269
2018-03-12 13:35:53 +00:00
Matt Arsenault c0aefd561e AMDGPU/GlobalISel: InstrMapping for G_MERGE_VALUES
llvm-svn: 327268
2018-03-12 13:35:49 +00:00
Matt Arsenault 503afda95f AMDGPU/GlobalISel: Make some G_MERGE_VALUEs legal
llvm-svn: 327267
2018-03-12 13:35:43 +00:00
Sanjay Patel 3b36bb0362 [AMDGPU] fix tests to be independent of FP undef
llvm-svn: 327211
2018-03-10 16:39:59 +00:00
Matt Arsenault cbda7ff4ae AMDGPU: Fix crash when constant folding with physreg operand
llvm-svn: 327209
2018-03-10 16:05:35 +00:00
Farhana Aleen a7cb31123c [AMDGPU] Supported ds_read_b128 generation; Widened vector length for local address-space.
Summary: Starting from GCN 2nd generation, ISA supports ds_read_b128 on top of ds_read_b64.
         This patch supports ds_read_b128 instruction pattern and generation of this instruction.
         In the vectorizer, this patch also widen the vector length so that vectorizer generates
         128 bit loads for local address-space which gets translated to ds_read_b128.
         Since the performance benefit is not clear; compiler generates ds_read_b128 under -amdgpu-ds128.

Author: FarhanaAleen

Reviewed By: rampitec, arsenm

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D44210

llvm-svn: 327153
2018-03-09 17:41:39 +00:00
Sanjay Patel 56d59c1f0f [AMDGPU] fix test to be independent of FP undef
llvm-svn: 327147
2018-03-09 16:33:34 +00:00
Stanislav Mekhanoshin c8127fc674 [AMDGPU] Fixed V_DIV_FIXUP_F16 selection on GFX9
GFX9 should select opsel version.

Differential Revision: https://reviews.llvm.org/D44279

llvm-svn: 327106
2018-03-09 07:21:43 +00:00
Sanjay Patel 672ad3269b [AMDGPU] fix test to survive more FP undef constant folding
llvm-svn: 327066
2018-03-08 21:30:56 +00:00
Sanjay Patel 7325d12f58 [AMDGPU] fix test to survive the most basic undef constant folding
This will likely need to be changed again for anything more than:
fmul undef, undef -> undef

llvm-svn: 327034
2018-03-08 17:34:25 +00:00
Farhana Aleen 89196642f7 [AMDGPU] Increased vector length for global/constant loads.
Summary: GCN ISA supports instructions that can read 16 consecutive dwords from memory through the scalar data cache;
         loadstoreVectorizer should take advantage of the wider vector length and pack 16/8 elements of dwords/quadwords.

Author: FarhanaAleen

Reviewed By: rampitec

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D44179

llvm-svn: 326910
2018-03-07 17:09:18 +00:00
Farhana Aleen 347d12b4ce Revert "[AMDGPU] Widened vector length for global/constant address space."
This reverts commit ce988cc100dc65e7c6c727aff31ceb99231cab03.

llvm-svn: 326907
2018-03-07 16:55:27 +00:00
Farhana Aleen 0d03d0588d [AMDGPU] Widened vector length for global/constant address space.
llvm-svn: 326904
2018-03-07 16:29:05 +00:00
Yaxun Liu 46439e8d4a [AMDGPU] Fix lowering OpenCL enqueue_kernel
One addrspacecast disappeared in clang emitted IR for
block invoke function due to adoption of the new
addr space mapping.

Differential Revision: https://reviews.llvm.org/D43785

llvm-svn: 326806
2018-03-06 16:04:39 +00:00
Matt Arsenault e31ab94e97 AMDGPU/GlobalISel: Add InstrMapping for G_EXTRACT
llvm-svn: 326715
2018-03-05 16:25:18 +00:00
Matt Arsenault 71272e6d4e AMDGPU/GlobalISel: Make some G_EXTRACTs legal
As far as I can tell legalization of weird sizes for the
output type isn't implemented.

llvm-svn: 326714
2018-03-05 16:25:15 +00:00
Alexander Timofeev 2e5eeceeb7 Pass Divergence Analysis data to Selection DAG to drive divergence
dependent instruction selection.

Differential revision: https://reviews.llvm.org/D35267

llvm-svn: 326703
2018-03-05 15:12:21 +00:00
Matt Arsenault b9699c009d AMDGPU/GlobalISel: InstrMapping for G_ZEXT
llvm-svn: 326589
2018-03-02 16:55:37 +00:00
Matt Arsenault 1c1aab99ae AMDGPU/GlobalISel: InstrMapping for G_TRUNC
llvm-svn: 326588
2018-03-02 16:55:33 +00:00
Matt Arsenault ef8db767d7 AMDGPU/GlobalISel: Define InstrMappings for G_FCMP
Patch by Tom Stellard

llvm-svn: 326587
2018-03-02 16:53:15 +00:00
Matt Arsenault 2607dc60de AMDGPU/GlobalISel: Define instruction mapping for @llvm.minnum
Patch by Tom Stellard

llvm-svn: 326586
2018-03-02 16:40:17 +00:00
Matt Arsenault b46c191c49 AMDGPU/GlobalISel: Define instruction mapping for @llvm.maxnum
Patch by Tom Stellard

llvm-svn: 326567
2018-03-02 12:23:00 +00:00
Jan Vesely b283ea0f0f AMDGPU/GCN: Promote i16 ctpop
i16 capable ASICs do not support i16 operands for this instruction.
Add tablegen pattern to merge chained i16 additions.

Differential Revision: https://reviews.llvm.org/D43985

llvm-svn: 326535
2018-03-02 02:50:22 +00:00
Matt Arsenault 41d2e3d98e AMDGPU/GlobalISel: Define instruction mapping for G_FPTOSI
Patch by Tom Stellard

llvm-svn: 326534
2018-03-02 02:19:16 +00:00
Matt Arsenault b23041ad4d AMDGPU/GlobalISel: Define instruction mapping for G_FPTOUI
Patch by Tom Stellard

llvm-svn: 326533
2018-03-02 02:19:11 +00:00
Matt Arsenault 327d5fb2e5 AMDGPU/GlobalISel: Define instruction mapping for G_FMUL
llvm-svn: 326532
2018-03-02 02:17:01 +00:00
Matt Arsenault 5a9e834eac AMDGPU/GlobalISel: Define instruction mapping for G_FADD
Patch by Tom Stellard

llvm-svn: 326526
2018-03-02 01:22:13 +00:00
Matt Arsenault d99317f1b3 AMDGPU/GlobalISel: Define instruction mapping for G_SHL
Patch by Tom Stellard

llvm-svn: 326525
2018-03-02 01:22:10 +00:00
Matt Arsenault 3c7a123ccc AMDGPU/GlobalISel: Define instruction mapping for G_XOR
llvm-svn: 326524
2018-03-02 01:22:06 +00:00
Matt Arsenault c0f34c9e36 AMDGPU/GlobalISel: Define instruction mapping for G_AND
Patch by Tom Stellard

llvm-svn: 326523
2018-03-02 01:22:01 +00:00
Matt Arsenault 364f12e8f9 AMDGPU/GlobalISel: Define instruction mapping for @llvm.amdgcn.cvt.pkrtz
Patch by Tom Stellard

llvm-svn: 326490
2018-03-01 21:25:30 +00:00
Matt Arsenault 5320ee4a05 AMDGPU/GlobalISel: Define instruction mapping for G_OR
Patch by Tom Stellard

llvm-svn: 326489
2018-03-01 21:25:25 +00:00
Matt Arsenault 62669ede94 AMDGPU/GlobalISel: Define instruction mapping for G_BITCAST
Patch by Tom Stellard

llvm-svn: 326482
2018-03-01 20:59:44 +00:00
Matt Arsenault 0529a8e2de AMDGPU/GlobalISel: Mark i32->i64 zext as legal
llvm-svn: 326481
2018-03-01 20:56:21 +00:00
Matt Arsenault 36b99e1937 AMDGPU/GlobalISel: InstrMapping for llvm.amdgcn.exp.compr
Patch by Tom Stellard

llvm-svn: 326479
2018-03-01 20:40:55 +00:00
Matt Arsenault 8931bbf8df AMDGPU/GlobalISel: Define instruction mapping for @llvm.amdgcn.exp
Patch by Tom Stellard

llvm-svn: 326477
2018-03-01 20:24:37 +00:00
Matt Arsenault 50721ab325 AMDGPU/GlobalISel: Define InstrMappings for G_ICMP
Patch by Tom Stellard

llvm-svn: 326472
2018-03-01 19:27:10 +00:00
Matt Arsenault dc14ec05d4 AMDGPU/GlobalISel: Make i32 mul legal
llvm-svn: 326471
2018-03-01 19:22:05 +00:00
Matt Arsenault 06cbb27a79 AMDGPU/GlobalISel: Define instruction mapping for G_IMPLICIT_DEF
Patch by Tom Stellard

llvm-svn: 326470
2018-03-01 19:16:52 +00:00
Matt Arsenault e3d9ecf2b9 AMDGPU/GlobalISel: Define instruction mapping for G_FCONSTANT
Patch by Tom Stellard

llvm-svn: 326468
2018-03-01 19:13:30 +00:00