Commit Graph

3313 Commits

Author SHA1 Message Date
Matt Arsenault fbec8fe93b GlobalISel: Implement narrowScalar for shift main type
This is pretty much directly ported from SelectionDAG. Doesn't include
the shift by non-constant but known bits version, since there isn't a
globalisel version of computeKnownBits yet.

This shows a disadvantage of targets not specifically which type
should be used for the shift amount. If type 0 is legalized before
type 1, the operations on the shift amount type use the wider type
(which are also less likely to legalize). This can be avoided by
targets specifying legalization actions on type 1 earlier than for
type 0.

llvm-svn: 353455
2019-02-07 19:37:44 +00:00
Matt Arsenault d914189a2e AMDGPU/GlobalISel: Restrict g_implicit_def legality
llvm-svn: 353452
2019-02-07 19:10:15 +00:00
Matt Arsenault c0f7569aab AMDGPU/GlobalISel: Legalize fsqrt
llvm-svn: 353438
2019-02-07 18:14:39 +00:00
Matt Arsenault 93fdec739b AMDGPU/GlobalISel: Legalize some f16 operations
llvm-svn: 353436
2019-02-07 18:03:11 +00:00
Matt Arsenault c83b82363c GlobalISel: Implement fewerElementsVector for shifts
Introduce a new function which handles instructions with multiple type
indices, but have the same number of vector elements.

Also legalize v2s16 shifts when applicable.

llvm-svn: 353432
2019-02-07 17:38:00 +00:00
Matt Arsenault 91be65be65 GlobalISel: Try to make legalize rules more useful for vectors
Mostly keep the existing functions on scalars, but add versions which
also operate based on the vector element size.

llvm-svn: 353430
2019-02-07 17:25:51 +00:00
Scott Linder e2c5847414 [AMDGPU] Consider XOR in waterfall loop as a terminator
Ensure the XOR in the waterfall loop for indirect addressing is considered a terminator.

Differential Revision: https://reviews.llvm.org/D57703

llvm-svn: 353207
2019-02-05 19:50:32 +00:00
Matt Arsenault a3f9b71c09 AMDGPU: Fix assert on trunc from bitcast of build_vector
The v2i64 argument is lowered to a bitcast of v4i32 build_vector.
This would then attempt to use the i32-element as the source of the
vector truncate. This really would need to collect 2 elements from the
build_vector to produce the intended truncate.

llvm-svn: 353202
2019-02-05 19:23:57 +00:00
Matt Arsenault cba0c6d0c9 AMDGPU: Don't rematerialize mov with implicit operands
This was pulling the mov used for register indexing on gfx9 out of the
loop.

llvm-svn: 353101
2019-02-04 22:26:21 +00:00
Scott Linder d19d197221 [AMDGPU] Support emitting GOT relocations for function calls
Differential Revision: https://reviews.llvm.org/D57416

llvm-svn: 353083
2019-02-04 20:00:07 +00:00
Matt Arsenault 10547230f3 AMDGPU/GlobalISel: Legalize select for v4s16
Also add some more select tests to help show future legalization
changes.

llvm-svn: 353045
2019-02-04 14:04:52 +00:00
Fangrui Song b21ed3c57a [AMDGPU] Fix -Wunused-variable after rL352978
llvm-svn: 352982
2019-02-03 03:51:52 +00:00
Matt Arsenault 888aa5dedd GlobalISel: Implement widenScalar for G_UNMERGE_VALUES
For the scalar case only.

Also move the similar G_MERGE_VALUES handling to a separate function
and cleanup to make them look more similar.

llvm-svn: 352979
2019-02-03 00:07:33 +00:00
Matt Arsenault 0e5d856eb8 GlobalISel: Implement widenScalar for G_EXTRACT vector sources
Handle the basic element extract case.

llvm-svn: 352978
2019-02-02 23:56:00 +00:00
Matt Arsenault eb2603cfb2 AMDGPU/GlobalISel: Avoid reporting illegal extloads as legal
This avoids breaking a test in a future commit.

llvm-svn: 352977
2019-02-02 23:39:13 +00:00
Matt Arsenault 58f9d3df97 AMDGPU/GlobalISel: Legalize icmp for pointer types
llvm-svn: 352976
2019-02-02 23:35:15 +00:00
Matt Arsenault 2065c94dd3 AMDGPU/GlobalISel: Legalize constant for pointer types
llvm-svn: 352975
2019-02-02 23:33:49 +00:00
Matt Arsenault 2491f82679 AMDGPU/GlobalISel: Legalize select for pointer types
llvm-svn: 352974
2019-02-02 23:31:50 +00:00
Matt Arsenault cbaada6bc1 GlobalISel: Legalization for inttoptr/ptrtoint
llvm-svn: 352973
2019-02-02 23:29:55 +00:00
James Y Knight 7716075a17 [opaque pointer types] Pass value type to GetElementPtr creation.
This cleans up all GetElementPtr creation in LLVM to explicitly pass a
value type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57173

llvm-svn: 352913
2019-02-01 20:44:47 +00:00
James Y Knight 14359ef1b6 [opaque pointer types] Pass value type to LoadInst creation.
This cleans up all LoadInst creation in LLVM to explicitly pass the
value type rather than deriving it from the pointer's element-type.

Differential Revision: https://reviews.llvm.org/D57172

llvm-svn: 352911
2019-02-01 20:44:24 +00:00
Tim Corringham fa3e4e5b53 [AMDGPU] Fix for vector element insertion
Summary:
Incorrect code was generated when lowering insertelement operations
for vectors with 8 or 16 bit elements.  The value being inserted was
not adjusted for the position of the element within the 32 bit word
and so only the low element within each 32 bit word could receive
the intended value.

Fixed by simply replicating the value to each element of a
congruent vector before the mask and or operation used to
update the intended element.

A number of affected LIT tests have been updated appropriately.

before the mask & or into the intended

Reviewers: arsenm, nhaehnle

Reviewed By: arsenm

Subscribers: llvm-commits, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57588

llvm-svn: 352885
2019-02-01 16:51:09 +00:00
Yevgeny Rouban 15b17d0a7c Provide reason messages for unviable inlining
InlineCost's isInlineViable() is changed to return InlineResult
instead of bool. This provides messages for failure reasons and
allows to get more specific messages for cases where callsites
are not viable for inlining.

Reviewed By: xbolva00, anemet

Differential Revision: https://reviews.llvm.org/D57089

llvm-svn: 352849
2019-02-01 10:44:43 +00:00
James Y Knight 13680223b9 [opaque pointer types] Add a FunctionCallee wrapper type, and use it.
Recommit r352791 after tweaking DerivedTypes.h slightly, so that gcc
doesn't choke on it, hopefully.

Original Message:
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.

Then:
- update the CallInst/InvokeInst instruction creation functions to
  take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.

One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.

However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)

Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.

Differential Revision: https://reviews.llvm.org/D57315

llvm-svn: 352827
2019-02-01 02:28:03 +00:00
James Y Knight fadf25068e Revert "[opaque pointer types] Add a FunctionCallee wrapper type, and use it."
This reverts commit f47d6b38c7 (r352791).

Seems to run into compilation failures with GCC (but not clang, where
I tested it). Reverting while I investigate.

llvm-svn: 352800
2019-01-31 21:51:58 +00:00
James Y Knight f47d6b38c7 [opaque pointer types] Add a FunctionCallee wrapper type, and use it.
The FunctionCallee type is effectively a {FunctionType*,Value*} pair,
and is a useful convenience to enable code to continue passing the
result of getOrInsertFunction() through to EmitCall, even once pointer
types lose their pointee-type.

Then:
- update the CallInst/InvokeInst instruction creation functions to
  take a Callee,
- modify getOrInsertFunction to return FunctionCallee, and
- update all callers appropriately.

One area of particular note is the change to the sanitizer
code. Previously, they had been casting the result of
`getOrInsertFunction` to a `Function*` via
`checkSanitizerInterfaceFunction`, and storing that. That would report
an error if someone had already inserted a function declaraction with
a mismatching signature.

However, in general, LLVM allows for such mismatches, as
`getOrInsertFunction` will automatically insert a bitcast if
needed. As part of this cleanup, cause the sanitizer code to do the
same. (It will call its functions using the expected signature,
however they may have been declared.)

Finally, in a small number of locations, callers of
`getOrInsertFunction` actually were expecting/requiring that a brand
new function was being created. In such cases, I've switched them to
Function::Create instead.

Differential Revision: https://reviews.llvm.org/D57315

llvm-svn: 352791
2019-01-31 20:35:56 +00:00
Matt Arsenault c7bce739ad GlobalISel: Handle odd splits in fewerElementsVector for load/store
llvm-svn: 352720
2019-01-31 02:46:05 +00:00
Matt Arsenault d1bfc8d0c3 GlobalISel: Implement narrowScalar for bswap
llvm-svn: 352719
2019-01-31 02:34:03 +00:00
Matt Arsenault d5684f76e0 GlobalISel: Allow bitcount ops to have different result type
For AMDGPU the result is always 32-bit for 64-bit inputs.

llvm-svn: 352717
2019-01-31 02:09:57 +00:00
Matt Arsenault 2a64598ef2 GlobalISel: Fix creating MMOs with align 0
llvm-svn: 352712
2019-01-31 01:38:47 +00:00
Erik Pilkington 600e9deacf Add a 'dynamic' parameter to the objectsize intrinsic
This is meant to be used with clang's __builtin_dynamic_object_size.
When 'true' is passed to this parameter, the intrinsic has the
potential to be folded into instructions that will be evaluated
at run time. When 'false', the objectsize intrinsic behaviour is
unchanged.

rdar://32212419

Differential revision: https://reviews.llvm.org/D56761

llvm-svn: 352664
2019-01-30 20:34:35 +00:00
Matt Arsenault 4c0409e9c7 AMDGPU: Stop generating unused intrinsic .inc files
llvm-svn: 352635
2019-01-30 17:25:37 +00:00
Matt Arsenault dc6c78596b GlobalISel: Implement fewerElementsVector for select
llvm-svn: 352601
2019-01-30 04:19:31 +00:00
Matt Arsenault f6cab16258 AMDGPU/GlobalISel: Fix clamping shifts with 16-bit insts
llvm-svn: 352599
2019-01-30 03:36:25 +00:00
Matt Arsenault 045bc9a4a6 GlobalISel: Support narrowScalar for uneven loads
llvm-svn: 352594
2019-01-30 02:35:38 +00:00
Matt Arsenault d8d193d5e2 GlobalISel: Partially implement widenScalar for MERGE_VALUES
llvm-svn: 352560
2019-01-29 23:17:35 +00:00
Matt Arsenault 18619afe1d GlobalISel: Fix narrowScalar for load/store with different mem size
This was ignoring the memory size, and producing multiple loads/stores
if the operand size was different from the memory size.

I assume this is the intent of not having an explicit G_ANYEXTLOAD
(although I think that would probably be better).

llvm-svn: 352523
2019-01-29 18:13:02 +00:00
James Y Knight 5d71fc5d7b Adjust documentation for git migration.
This fixes most references to the paths:
 llvm.org/svn/
 llvm.org/git/
 llvm.org/viewvc/
 github.com/llvm-mirror/
 github.com/llvm-project/
 reviews.llvm.org/diffusion/

to instead point to https://github.com/llvm/llvm-project.

This is *not* a trivial substitution, because additionally, all the
checkout instructions had to be migrated to instruct users on how to
use the monorepo layout, setting LLVM_ENABLE_PROJECTS instead of
checking out various projects into various subdirectories.

I've attempted to not change any scripts here, only documentation. The
scripts will have to be addressed separately.

Additionally, I've deleted one document which appeared to be outdated
and unneeded:
  lldb/docs/building-with-debug-llvm.txt

Differential Revision: https://reviews.llvm.org/D57330

llvm-svn: 352514
2019-01-29 16:37:27 +00:00
Neil Henning 0799352026 [AMDGPU] Fix a weird WWM intrinsic issue.
I found a really strange WWM issue through a very convoluted shader that
essentially boils down to a bug in SIInstrInfo where canReadVGPR did not
correctly identify that WWM is like a copy and can have a VGPR as its
source.

Differential Revision: https://reviews.llvm.org/D56002

llvm-svn: 352500
2019-01-29 14:28:17 +00:00
Matt Arsenault cdd191d9db AMDGPU: Add DS append/consume intrinsics
Since these pass the pointer in m0 unlike other DS instructions, these
need to worry about whether the address is uniform or not. This
assumes the address is dynamically uniform, and just uses
readfirstlane to get a copy into an SGPR.

I don't know if these have the same 16-bit add for the addressing mode
offset problem on SI or not, but I've just assumed they do.

Also includes some misc. changes to avoid test differences between the
LDS and GDS versions.

llvm-svn: 352422
2019-01-28 20:14:49 +00:00
Tim Corringham 824ca3f3dd [AMDGPU] Add intrinsics for 16 bit interpolation
Summary:
Added the intrinsics llvm.amdgcn.interp.p1.f16() and
llvm.amdgcn.interp.p2.f16() and related LIT test.

The p1 intrinsic generates code appropriate for both 16 and 32
bank LDS.

Reviewers: #amdgpu, dstuttard, arsenm, tpr

Reviewed By: #amdgpu, arsenm

Subscribers: jvesely, mgorny, arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D46754

llvm-svn: 352357
2019-01-28 13:48:59 +00:00
Matt Arsenault 211e89d4dd GlobalISel: Implement narrowScalar for mul
llvm-svn: 352300
2019-01-27 00:52:51 +00:00
Matt Arsenault 2e5f900849 GlobalISel: fewerElementsVector for intrinsic_trunc/intrinsic_round
llvm-svn: 352298
2019-01-27 00:12:21 +00:00
Matt Arsenault ded2f82662 AMDGPU/GlobalISel: Use scalarize instead of clampMaxNumElements
llvm-svn: 352297
2019-01-26 23:54:53 +00:00
Matt Arsenault 26a6c74fbe AMDGPU/GlobalISel: Legalize more bit ops
llvm-svn: 352295
2019-01-26 23:47:07 +00:00
Matt Arsenault 4d47594fc5 AMDGPU/GlobalISel: Widen small uaddo/usubo
llvm-svn: 352294
2019-01-26 23:44:51 +00:00
Matt Arsenault 3b9a82ff2c AMDGPU/GlobalISel: Remove leftover setAction
Also move G_GEP actions together.

llvm-svn: 352168
2019-01-25 04:54:00 +00:00
Matt Arsenault 3e08b772b3 AMDGPU/GlobalISel: Scalarize add/sub
llvm-svn: 352167
2019-01-25 04:53:57 +00:00
Matt Arsenault e6cebd0d69 GlobalISel: fewerElementsVector for more cast types
llvm-svn: 352166
2019-01-25 04:37:33 +00:00
Matt Arsenault 95fd95cfe0 GlobalISel: fewerElementsVector for a few more trivial ops
llvm-svn: 352165
2019-01-25 04:03:38 +00:00
Matt Arsenault 5d622fbcc1 AMDGPU/GlobalISel: Legalize smulh/umulh and scalarize mul
llvm-svn: 352162
2019-01-25 03:23:04 +00:00
Matt Arsenault 1b1e685f10 GlobalISel: Support fewerElementsVector for icmp/fcmp
Also legalize 64-bit compares for AMDGPU

llvm-svn: 352157
2019-01-25 02:59:34 +00:00
Matt Arsenault ca676343a9 GlobalISel: Implement fewerElementsVector for extensions
llvm-svn: 352155
2019-01-25 02:36:32 +00:00
Matt Arsenault 990f507704 GlobalISel: Add convenience mutatations to scalarize
llvm-svn: 352143
2019-01-25 00:51:00 +00:00
Matt Arsenault baa5d2e69c RegBankSelect: Support some more complex part mappings
llvm-svn: 352123
2019-01-24 22:47:04 +00:00
Tim Renouf f64f8efe13 [AMDGPU] With XNACK, cannot clause a load with result coalesced with operand
Summary:
With XNACK, an smem load whose result is coalesced with an operand (thus
it overwrites its own operand) cannot appear in a clause, because some
other instruction might XNACK and restart the whole clause.

The clause breaker already realized that an smem that overwrites an
operand cannot appear in a clause, and broke the clause. The problem
that this commit fixes is that the SIFormMemoryClauses optimization
formed a bundle with early clobber, which caused the earlier code that
set up the coalesced operand to be removed as dead.

Differential Revision: https://reviews.llvm.org/D57008

Change-Id: I703c4d5b0bf7d6060222bec491f45c18bb3c0016
llvm-svn: 351950
2019-01-23 13:38:06 +00:00
Matt Arsenault 4c5e8f51e7 AMDGPU/GlobalISel: Start selectively legalizing 16-bit operations
It might be a bit nicer to use the fancy .legalIf and co. predicates,
but this was requiring more boilerplate and disables the coverage
assertions.

llvm-svn: 351886
2019-01-22 22:00:19 +00:00
Matt Arsenault 736cfa9ffb AMDGPU/GlobalISel: Handle legality/regbanks for 32/64-bit shifts
llvm-svn: 351884
2019-01-22 21:51:38 +00:00
Matt Arsenault 30989e492b GlobalISel: Allow shift amount to be a different type
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.

X86 uses i8, but seemed to be hacking around this before.

llvm-svn: 351882
2019-01-22 21:42:11 +00:00
Matt Arsenault 6378629609 GlobalISel: Implement widen for extract_vector_elt elt type
llvm-svn: 351871
2019-01-22 20:38:15 +00:00
Matt Arsenault aebb2ee036 GlobalISel: Implement fewerElementsVector for basic FP ops
llvm-svn: 351866
2019-01-22 20:14:29 +00:00
Matt Arsenault 41a8bee93b AMDGPU/GlobalISel: Remove vectors from legal constant types
llvm-svn: 351859
2019-01-22 19:04:51 +00:00
Matt Arsenault 6614f852b6 GlobalISel: Support narrowing zextload/sextload
llvm-svn: 351856
2019-01-22 19:02:10 +00:00
Matt Arsenault a5840c3c39 Codegen support for atomicrmw fadd/fsub
llvm-svn: 351851
2019-01-22 18:36:06 +00:00
Matt Arsenault fb67164ebc AMDGPU/GlobalISel: Legalize more fp<->int conversions
llvm-svn: 351767
2019-01-22 00:20:17 +00:00
Stanislav Mekhanoshin f92ed6966e [AMDGPU] Fixed hazard recognizer to walk predecessors
Fixes two problems with GCNHazardRecognizer:
1. It only scans up to 5 instructions emitted earlier.
2. It does not take control flow into account. An earlier instruction
from the previous basic block is not necessarily a predecessor.
At the same time a real predecessor block is not scanned.

The patch provides a way to distinguish between scheduler and
hazard recognizer mode. It is OK to work with emitted instructions
in the scheduler because we do not really know what will be emitted
later and its order. However, when pass works as a hazard recognizer
the schedule is already finalized, and we have full access to the
instructions for the whole function, so we can properly traverse
predecessors and their instructions.

Differential Revision: https://reviews.llvm.org/D56923

llvm-svn: 351759
2019-01-21 19:11:26 +00:00
Matt Arsenault 7ac79ed8f0 AMDGPU: Legalize more bitcasts
llvm-svn: 351700
2019-01-20 19:45:18 +00:00
Matt Arsenault 46ffe68d77 AMDGPU/GlobalISel: Really legalize exts from i1
There is a combine that was hiding these tests
not actually testing what they should be, although
they were producing the expected end result.

llvm-svn: 351698
2019-01-20 19:28:20 +00:00
Matt Arsenault 745fd9f547 GlobalISel: Implement widenScalar for basic FP ops
llvm-svn: 351696
2019-01-20 19:10:31 +00:00
Matt Arsenault cfd9e7f594 AMDGPU/GlobalISel: Legalize f32->f16 fptrunc
llvm-svn: 351695
2019-01-20 19:10:26 +00:00
Matt Arsenault ff6a9a275b AMDGPU/GlobalISel: Fix some crashs in g_unmerge_values/g_merge_values
This was crashing in the predicate function assuming the value
is a vector.

Copy more of what AArch64 uses. This probably needs more refinement
later, but I don't exactly understand what it means in some cases,
particularly since any legalization for these seems to be missing.

llvm-svn: 351693
2019-01-20 18:40:36 +00:00
Matt Arsenault 2a2086b830 AMDGPU/GlobalISel: Regbank select for fpext
llvm-svn: 351692
2019-01-20 18:35:41 +00:00
Matt Arsenault 24563ef628 AMDGPU/GlobalISel: Cleanup legality for extensions
llvm-svn: 351691
2019-01-20 18:34:24 +00:00
Chandler Carruth 2946cd7010 Update the file headers across all of the LLVM projects in the monorepo
to reflect the new license.

We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.

Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.

llvm-svn: 351636
2019-01-19 08:50:56 +00:00
Matt Arsenault 96e4701401 AMDGPU/GlobalISel: Legalize more types for select
llvm-svn: 351599
2019-01-18 21:42:55 +00:00
Matt Arsenault 4599159ac3 AMDGPU/GlobalISel: Legalize illegal g_constant
llvm-svn: 351596
2019-01-18 21:33:50 +00:00
Matt Arsenault 85af701e85 AMDGPU: Remove llvm.SI.load.const
It's taken 3 years, but now all of the old AMDGPU and SI intrinsics
are finally gone

llvm-svn: 351586
2019-01-18 20:27:02 +00:00
Neil Henning 3ed09f8e0c [AMDGPU] Add some missing always-uniform values.
This commit adds some missing intrinsics into the isAlwaysUniform list
for the AMDGPU backend.

Differential Revision: https://reviews.llvm.org/D56845

llvm-svn: 351562
2019-01-18 16:39:27 +00:00
Dmitry Preobrazhensky 6bc26aaada [AMDGPU][MC][GFX8+][DISASSEMBLER] Corrected 1/2pi value for 64-bit operands
See bug 39332: https://bugs.llvm.org/show_bug.cgi?id=39332

Reviewers: artem.tamazov, arsenm

Differential Revision: https://reviews.llvm.org/D56794

llvm-svn: 351555
2019-01-18 15:17:17 +00:00
Dmitry Preobrazhensky 61105bab29 [AMDGPU][MC] Disabled use of 2 different literals with SOP2/SOPC instructions
See bug 39319: https://bugs.llvm.org/show_bug.cgi?id=39319

Reviewers: artem.tamazov, arsenm, rampitec

Differential Revision: https://reviews.llvm.org/D56847

llvm-svn: 351549
2019-01-18 13:57:43 +00:00
Changpeng Fang fe9269f804 AMDGPU: Adjust the chain for loads writing to the HI part of a register.
Summary:
  For these loads that write to the HI part of a register, we should chain them to the op that writes to the LO part
of the register to maintain the appropriate order.

Reviewers:
  rampitec, arsenm

Differential Revision:
  https://reviews.llvm.org/D56454

llvm-svn: 351379
2019-01-16 21:32:53 +00:00
Marek Olsak c5cec5e1fa AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap
Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52944

llvm-svn: 351351
2019-01-16 15:43:53 +00:00
Changpeng Fang 20fe3d2f35 AMDGPU: Raise the priority of MAD24 in instruction selection.
Summary:
  We have seen performance regression when v_add3 is generated. The major reason is that the v_mad pattern
is broken when v_add3 is generated. We also see the register pressure increased. While we could not properly
estimate register pressure during instruction selection, we can give mad a higher priority.

In this work, we raise the priority for mad24 in selection and resolve the performance regression.

Reviewers:
  rampitec

Differential Revision:
  https://reviews.llvm.org/D56745

llvm-svn: 351273
2019-01-15 23:12:36 +00:00
Marek Olsak 33eb4d947d AMDGPU: Add a fast path for icmp.i1(src, false, NE)
Summary:
This allows moving the condition from the intrinsic to the standard ICmp
opcode, so that LLVM can do simplifications on it. The icmp.i1 intrinsic
is an identity for retrieving the SGPR mask.

And we can also get the mask from and i1, or i1, xor i1.

Reviewers: arsenm, nhaehnle

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52060

llvm-svn: 351150
2019-01-15 02:13:18 +00:00
David Stuttard f77079f892 [AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.

This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.

This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).

There's an additional fix now to avoid a dmask=0

For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.

Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.

The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:

%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
                                      i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1

This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.

Differential revision: https://reviews.llvm.org/D48826

Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda


Work around for ppcle compiler bug

Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 11:55:24 +00:00
Neil Henning e85d45a699 [AMDGPU] Fix dwordx3/southern-islands failures.
This commit fixes the dwordx3/southern-islands failures that were found
in bugzilla https://bugs.llvm.org/show_bug.cgi?id=40129, by not
generating the dwordx3 variants of load/store instructions that were
added to the ISA after southern islands.

Differential Revision: https://reviews.llvm.org/D56434

llvm-svn: 350838
2019-01-10 16:21:08 +00:00
James Y Knight 62df5eed16 [opaque pointer types] Remove some calls to generic Type subtype accessors.
That is, remove many of the calls to Type::getNumContainedTypes(),
Type::subtypes(), and Type::getContainedType(N).

I'm not intending to remove these accessors -- they are
useful/necessary in some cases. However, removing the pointee type
from pointers would potentially break some uses, and reducing the
number of calls makes it easier to audit.

llvm-svn: 350835
2019-01-10 16:07:20 +00:00
Stanislav Mekhanoshin d3757d3f3a [AMDGPU] Separate feature dot-insts
Differential Revision: https://reviews.llvm.org/D56524

llvm-svn: 350793
2019-01-10 03:25:20 +00:00
Valery Pykhtin b7a459547d Revert "[AMDGPU] Fix DPP combiner"
This reverts commit e3e2923a39cbec3b3bc3a7d3f0e9a77a4115080e, svn revision rL350721

llvm-svn: 350730
2019-01-09 15:21:53 +00:00
Valery Pykhtin 1e0b5c719b [AMDGPU] Fix DPP combiner
Fixed issue with identity values and other cases, f32/f16 identity values to be added later. fma/mac instructions is disabled for now.
Test is fully reworked, added comments. Other fixes:

1. dpp move with uses and old reg initializer should be in the same BB.
2. bound_ctrl:0 is only considered when bank_mask and row_mask are fully enabled (0xF). Othervise the old register value is checked for identity.
3. Added add, subrev, and, or instructions to the old folding function.
4. Kill flag is cleared for the src0 (DPP register) as it may be copied into more than one user.

Differential revision: https://reviews.llvm.org/D55444

llvm-svn: 350721
2019-01-09 13:43:32 +00:00
Stanislav Mekhanoshin ed0d6c60af Remove check for single use in ShrinkDemandedConstant
This removes check for single use from general ShrinkDemandedConstant
to the BE because of the AArch64 regression after D56289/rL350475.

After several hours of experiments I did not come up with a testcase
failing on any other targets if check is not performed.

Moreover, direct call to ShrinkDemandedConstant is not really needed
and superceed by SimplifyDemandedBits.

Differential Revision: https://reviews.llvm.org/D56406

llvm-svn: 350684
2019-01-09 02:24:22 +00:00
Matt Arsenault c765240060 AMDGPU/GlobalISel: Introduce vcc reg bank
I'm not entirely sure this is the correct thing
to do with the global isel philosophy, but I think
this is necessary to handle how differently SGPRs
are used normally vs. from a condition.

For example, it makes sense to allow a copy
from a VGPR to an SGPR, but it makes no sense
to allow a copy from VGPRs to SGPRs used as
select mask.

This avoids regbankselecting strange code with
a truncate feeding directly into a condition field.
Now a copy is forced from sgpr(s1) to vcc, which is
more sensible to handle.

Some of these issues could probably avoided with making enough
operations resulting in i1 illegal. I think we can't avoid
this register bank for legality.

For example, an i1 and where one source is from a truncate, and
one source is a compare needs some kind of copy inserted to
make sure both are in condition registers.

llvm-svn: 350611
2019-01-08 06:30:53 +00:00
Matt Arsenault a1515d2d33 AMDGPU/GlobalISel: Legalize concat_vectors
llvm-svn: 350598
2019-01-08 01:30:02 +00:00
Matt Arsenault adc40baa29 RegBankSelect: Fix copy insertion point for terminators
If a copy was needed to handle the condition of brcond, it was being
inserted before the defining instruction. Add tests for iterator edge
cases.

I find the existing code here suspect for the case where it's looking
for terminators that modify the register. It's going to insert a copy
in the middle of the terminators, which isn't allowed (it might be
necessary to have a COPY_terminator if anybody actually needs this).

Also legalize brcond for AMDGPU.

llvm-svn: 350595
2019-01-08 01:22:47 +00:00
Matt Arsenault ae6f1e07fc AMDGPU/GlobalISel: Disallow VGPR->SCC copies
This fixes using scalar adds when only the carry in is a VGPR
using greedy regbankselect.

llvm-svn: 350593
2019-01-08 01:13:20 +00:00
Matt Arsenault 68c668a5f3 AMDGPU/GlobalISel: RegBankSelect for carry-in
I'm not sure we should be allowing the truncate
to s1 for the inputs. It may be necessary to
create a new VCC reg bank.

llvm-svn: 350592
2019-01-08 01:09:09 +00:00
Matt Arsenault 2cc15b67b7 AMDGPU/GlobalISel: RegBankSelect for add/sub with carry out
llvm-svn: 350589
2019-01-08 01:03:58 +00:00
Matt Arsenault 299302fbe7 AMDGPU/GlobalISel: InstrMapping for G_UNMERGE_VALUES
llvm-svn: 350588
2019-01-08 00:46:19 +00:00
Craig Topper 826f44b550 [TargetLowering][AMDGPU] Remove the SimplifyDemandedBits function that takes a User and OpIdx. Stop using it in AMDGPU target for simplifyI24.
As we saw in D56057 when we tried to use this function on X86, it's unsafe. It allows the operand node to have multiple users, but doesn't prevent recursing past the first node when it does have multiple users. This can cause other simplifications earlier in the graph without regard to what bits are needed by the other users of the first node. Ideally all we should do to the first node if it has multiple uses is bypass it when its not needed by the user we started from. Doing any other transformation that SimplifyDemandedBits can do like turning ZEXT/SEXT into AEXT would result in an increase in instructions.

Fortunately, we already have a function that can do just that, GetDemandedBits. It will only make transformations that involve bypassing a node.

This patch changes AMDGPU's simplifyI24, to use a combination of GetDemandedBits to handle the multiple use simplifications. And then uses the regular SimplifyDemandedBits on each operand to handle simplifications allowed when the operand only has a single use. Unfortunately, GetDemandedBits simplifies constants more aggressively than SimplifyDemandedBits. This caused the -7 constant in the changed test to be simplified to remove the upper bits. I had to modify computeKnownBits to account for this by ignoring the upper 8 bits of the input.

Differential Revision: https://reviews.llvm.org/D56087

llvm-svn: 350560
2019-01-07 19:30:43 +00:00
Rhys Perry f77e2e8406 AMDGPU: test for uniformity of branch instruction, not its condition
Summary:
If a divergent branch instruction is marked as divergent by propagation
rule 2 in DivergencePropagator::exploreSyncDependency() and its condition
is uniform, that branch would incorrectly be assumed to be uniform.

Reviewers: arsenm, tstellar

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D56331

llvm-svn: 350532
2019-01-07 15:52:28 +00:00
Matt Arsenault 7a27c1886f AMDGPU: Remove v16i8 from register classes
llvm-svn: 350518
2019-01-07 13:31:55 +00:00
Matt Arsenault 369acb8470 AMDGPU: Remove VS/SV mappings from select
These would violate the constant bus restriction

llvm-svn: 350517
2019-01-07 13:21:36 +00:00
Alexander Timofeev 993e2798fd [AMDGPU] Fix scalar operand folding bug that causes SHOC performance regression.
Detailed description: SIFoldOperands::foldInstOperand iterates over the
operand uses calling the function that changes def-use iteratorson the
way. As a result loop exits immediately when def-use iterator is
changed. Hence, the operand is folded to the very first use instruction
only. This makes VGPR live along the whole basic block and increases
register pressure significantly. The performance drop observed in SHOC
DeviceMemory test is caused by this bug.

Proposed fix: collect uses to separate container for further processing
in another loop.

Testing: make check-llvm
SHOC performance test.

Reviewers: rampitec, ronlieb

Differential Revision: https://reviews.llvm.org/D56161

llvm-svn: 350350
2019-01-03 19:55:32 +00:00
Piotr Sobczak 3abef8f9ea [AMDGPU] Change section name with metadata access
Summary:
The commit rL348922 introduced a means to set Metadata
section kind for a global variable, if its explicit section
name was prefixed with ".AMDGPU.metadata.".

This patch changes that prefix to ".AMDGPU.comment.",
as "metadata" in the section name might lead to
ambiguity with metadata used by AMD PAL runtime.

Change-Id: Idd4748800d6fe801441d91595fc21e5a4171e668

Reviewers: kzhuravl

Reviewed By: kzhuravl

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D56197

llvm-svn: 350292
2019-01-03 11:22:58 +00:00
Piotr Sobczak 378131bae0 [AMDGPU] Handle OR as operand of raw load/store
Summary:
Use isBaseWithConstantOffset() which handles OR as an operand
to llvm.amdgcn.raw.buffer.load and llvm.amdgcn.raw.buffer.store.

Change-Id: Ifefb9dc5ded8710d333df07ab1900b230e33539a

Reviewers: nhaehnle, mareko, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55999

llvm-svn: 350208
2019-01-02 09:47:41 +00:00
Changpeng Fang 6f539294b5 AMDGPU: Don't peel of the offset if the resulting base could possibly be negative in Indirect addressing.
Summary:
  Don't peel of the offset if the resulting base could possibly be negative in Indirect addressing.
This is because the M0 field is of unsigned.

This patch achieves the similar goal as https://reviews.llvm.org/D55241, but keeps the optimization
if the base is known unsigned.

Reviewers:
  arsemn

Differential Revision:
  https://reviews.llvm.org/D55568

llvm-svn: 349951
2018-12-21 20:57:34 +00:00
Simon Pilgrim 3c157d3fa3 [AMDGPU] Always use the version of computeKnownBits that returns a value. NFCI.
Continues the work started by @bogner in rL340594 to remove uses of the KnownBits output paramater version.

llvm-svn: 349912
2018-12-21 15:29:47 +00:00
Matt Arsenault 3eae3c4590 AMDGPU/GlobalISel: RegBankSelect for amdgcn.wqm.vote
llvm-svn: 349882
2018-12-21 03:20:54 +00:00
Matt Arsenault f4c21c575a AMDGPU/GlobalISel: RegBankSelect for some fp ops
llvm-svn: 349880
2018-12-21 03:14:45 +00:00
Matt Arsenault bee2ad7185 AMDGPU/GlobalISel: Redo legality for build_vector
It seems better to avoid using the callback if possible since
there are coverage assertions which are disabled if this is used.

Also fix missing tests. Only test the legal cases since it seems
legalization for build_vector is quite lacking.

llvm-svn: 349878
2018-12-21 03:03:11 +00:00
Matt Arsenault 4339883710 AMDGPU: Make i1/i64/v2i32 and/or/xor legal
The 64-bit types do depend on the register bank,
but that's another issue to deal with later.

llvm-svn: 349716
2018-12-20 01:35:49 +00:00
Matt Arsenault 8cc98bee8a AMDGPU/GlobalISel: Fix ValueMapping tables for i1
This was incorrectly selecting SGPR for any i1 values,
e.g. G_TRUNC to i1 from a VGPR was still an SGPR.

llvm-svn: 349715
2018-12-20 01:33:43 +00:00
Matt Arsenault dff33c38e1 AMDGPU/GlobalISel: RegBankSelect for fp conversions
llvm-svn: 349709
2018-12-20 00:37:02 +00:00
Matt Arsenault 36d4092173 AMDGPU/GlobalISel: Legality/regbankselect for atomicrmw/atomic_cmpxchg
llvm-svn: 349708
2018-12-20 00:33:49 +00:00
Rhys Perry 3931ad38b9 AMDGPU: Add patterns for v4i16/v4f16 -> v4i16/v4f16 bitcasts
Reviewers: arsenm, tstellar

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55058

llvm-svn: 349694
2018-12-19 22:53:33 +00:00
Nicolai Haehnle 8d5e974076 AMDGPU: Use an ABS32_LO relocation for SCRATCH_RSRC_DWORD1
Summary:
Using HI here makes no logical sense, since the dword is only
32 bits to begin with.

Current Mesa master does not look at the relocation type at all,
so this change is fine. Future Mesa will rely on this, however.

Change-Id: I91085707834c4ac0370926602b93c94b90e44cb1

Reviewers: arsenm, rampitec, mareko

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55369

llvm-svn: 349620
2018-12-19 11:55:03 +00:00
Carl Ritson c521ac3a44 AMDGPU/InsertWaitcnts: Update VGPR/SGPR bounds when brackets are merged
Summary:
Fix an issue where VGPR/SGPR bounds are not properly extended when brackets are merged.
This manifests as missing waitcnt insertions when multiple brackets are forwarded to a successor block and the first forward has lower VGPR/SGPR bounds.

Irreducible loop test has been extended based on a CTS failure detected for GFX9.

Reviewers: nhaehnle

Reviewed By: nhaehnle

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, jfb, llvm-commits

Differential Revision: https://reviews.llvm.org/D55602

llvm-svn: 349611
2018-12-19 10:17:49 +00:00
Matt Arsenault b110e2277c AMDGPU/GlobalISel: Regbankselect for fsub
llvm-svn: 349608
2018-12-19 09:07:58 +00:00
Farhana Aleen 59ee2c5362 [AMDGPU] Removed the unnecessary operand size-check-assert from processBaseWithConstOffset().
Summary: 32bit operand sizes are guaranteed by the opcode check AMDGPU::V_ADD_I32_e64 and
         AMDGPU::V_ADDC_U32_e64. Therefore, we don't any additional operand size-check-assert.

Author: FarhanaAleen
llvm-svn: 349529
2018-12-18 19:58:39 +00:00
Matt Arsenault c94e26c71d AMDGPU: Legalize/regbankselect frame_index
llvm-svn: 349468
2018-12-18 09:46:13 +00:00
Matt Arsenault c0ea221068 AMDGPU: Legalize/regbankselect fma
llvm-svn: 349467
2018-12-18 09:39:56 +00:00
Matt Arsenault e01e7c81f2 AMDGPU/GlobalISel: Legalize/regbankselect fneg/fabs/fsub
llvm-svn: 349463
2018-12-18 09:19:03 +00:00
Simon Pilgrim 9831d4058c Fix -Wunused-variable warning. NFCI.
llvm-svn: 349265
2018-12-15 12:25:22 +00:00
Florian Hahn abe32c9125 [SILoadStoreOptimizer] Use std::abs to avoid truncation.
Using regular abs() causes the following warning

error: absolute value function 'abs' given an argument of type 'int64_t' (aka 'long') but has parameter of type 'int' which may cause truncation of value [-Werror,-Wabsolute-value]
        (uint32_t)abs(Dist) > MaxDist) {
                  ^
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp:1369:19: note: use function 'std::abs' instead

which causes a bot to fail:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/18284/steps/bootstrap%20clang/logs/stdio

llvm-svn: 349224
2018-12-15 01:32:58 +00:00
Farhana Aleen ce095c564a [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions.
Summary: Promote constant offset to immediate by recomputing the relative 13bit offset from nearby instructions.
 E.g.
  s_movk_i32 s0, 0x1800
  v_add_co_u32_e32 v0, vcc, s0, v2
  v_addc_co_u32_e32 v1, vcc, 0, v6, vcc

  s_movk_i32 s0, 0x1000
  v_add_co_u32_e32 v5, vcc, s0, v2
  v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
  global_load_dwordx2 v[5:6], v[5:6], off
  global_load_dwordx2 v[0:1], v[0:1], off
  =>
  s_movk_i32 s0, 0x1000
  v_add_co_u32_e32 v5, vcc, s0, v2
  v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
  global_load_dwordx2 v[5:6], v[5:6], off
  global_load_dwordx2 v[0:1], v[5:6], off offset:2048

Author: FarhanaAleen

Reviewed By: arsenm, rampitec

Subscribers: llvm-commits, AMDGPU

Differential Revision: https://reviews.llvm.org/D55539

llvm-svn: 349196
2018-12-14 21:13:14 +00:00
Aakanksha Patil bc568766b2 Revert r348971: [AMDGPU] Support for "uniform-work-group-size" attribute
This patch breaks RADV (and probably RadeonSI as well)

llvm-svn: 349084
2018-12-13 21:23:12 +00:00
Matt Arsenault 934e534c47 AMDGPU/GlobalISel: Legalize/regbankselect block_addr
llvm-svn: 349081
2018-12-13 20:34:15 +00:00
Matt Arsenault 577b9fc543 AMDGPU/GlobalISel: Legalize f64 fadd/fmul
llvm-svn: 349014
2018-12-13 08:27:48 +00:00
Matt Arsenault f38f483bef AMDGPU/GlobalISel: RegBankSelect some simple operations
llvm-svn: 349012
2018-12-13 08:23:51 +00:00
Stanislav Mekhanoshin d933c2ced7 [AMDGPU] Fix build failure, second attempt
Some compilers complain that variable is captured and some
complain when it is not. Switch to [&].

llvm-svn: 349006
2018-12-13 05:52:11 +00:00
Stanislav Mekhanoshin 5225746e03 [AMDGPU] Fix build failure
Fixed error 'lambda capture 'CondReg' is not required to be captured
for this use'.

llvm-svn: 349005
2018-12-13 05:21:25 +00:00
Stanislav Mekhanoshin 6071e1aa58 [AMDGPU] Simplify negated condition
Optimize sequence:

  %sel = V_CNDMASK_B32_e64 0, 1, %cc
  %cmp = V_CMP_NE_U32 1, %1
  $vcc = S_AND_B64 $exec, %cmp
  S_CBRANCH_VCC[N]Z
=>
  $vcc = S_ANDN2_B64 $exec, %cc
  S_CBRANCH_VCC[N]Z

It is the negation pattern inserted by DAGCombiner::visitBRCOND() in the
rebuildSetCC().

Differential Revision: https://reviews.llvm.org/D55402

llvm-svn: 349003
2018-12-13 03:17:40 +00:00
Aakanksha Patil 729309cc89 [AMDGPU] Support for "uniform-work-group-size" attribute
Updated the annotate-kernel-features pass to support the propagation of uniform-work-group attribute from the kernel to the called functions. Once this pass is run, all kernels, even the ones which initially did not have the attribute, will be able to indicate weather or not they have uniform work group size depending on the value of the attribute. 

Differential Revision: https://reviews.llvm.org/D50200

llvm-svn: 348971
2018-12-12 20:49:17 +00:00
Scott Linder f5b36e56fb [AMDGPU] Emit MessagePack HSA Metadata for v3 code object
Continue to present HSA metadata as YAML in ASM and when output by tools
(e.g. llvm-readobj), but encode it in Messagepack in the code object.

Differential Revision: https://reviews.llvm.org/D48179

llvm-svn: 348963
2018-12-12 19:39:27 +00:00
Neil Henning 76504a4c5e [AMDGPU] Extend the SI Load/Store optimizer to combine more things.
I've extended the load/store optimizer to be able to produce dwordx3
loads and stores, This change allows many more load/stores to be combined,
and results in much more optimal code for our hardware.

Differential Revision: https://reviews.llvm.org/D54042

llvm-svn: 348937
2018-12-12 16:15:21 +00:00
Piotr Sobczak 3732b4ce25 [AMDGPU] Set metadata access for explicit section
Summary:
This patch provides a means to set Metadata section kind
for a global variable, if its explicit section name is
prefixed with ".AMDGPU.metadata."
This could be useful to make the global variable go to
an ELF section without any section flags set.

Reviewers: dstuttard, tpr, kzhuravl, nhaehnle, t-tye

Reviewed By: dstuttard, kzhuravl

Subscribers: llvm-commits, arsenm, jvesely, wdng, yaxunl, t-tye

Differential Revision: https://reviews.llvm.org/D55267

llvm-svn: 348922
2018-12-12 11:20:04 +00:00
Amara Emerson 5ec146046c [GlobalISel] Restrict G_MERGE_VALUES capability and replace with new opcodes.
This patch restricts the capability of G_MERGE_VALUES, and uses the new
G_BUILD_VECTOR and G_CONCAT_VECTORS opcodes instead in the appropriate places.

This patch also includes AArch64 support for selecting G_BUILD_VECTOR of <4 x s32>
and <2 x s64> vectors.

Differential Revisions: https://reviews.llvm.org/D53629

llvm-svn: 348788
2018-12-10 18:44:58 +00:00
Neil Henning e448351b77 [AMDGPU] Change the l1 flush instruction for AMDPAL/MESA3D.
This commit changes which l1 flush instruction is used for AMDPAL and
MESA3d workloads to flush the entire l1 cache instead of just the
volatile lines.

Differential Revision: https://reviews.llvm.org/D55367

llvm-svn: 348771
2018-12-10 16:35:53 +00:00
Tim Corringham 2faadb15f4 [AMDGPU] Add new Mode Register pass - minor fix
Trivial change to add parentheses to an expression to avoid a
sanitizer error in SIModeRegister.cpp, which was committed earlier.

llvm-svn: 348767
2018-12-10 16:23:30 +00:00
Tim Corringham 4c4d2fe280 [AMDGPU] Add new Mode Register pass
A new pass to manage the Mode register.

Currently this just manages the floating point double precision
rounding requirements, but is intended to be easily extended to
encompass all Mode register settings.

The immediate motivation comes from the requirement to use the
round-to-zero rounding mode for the 16 bit interpolation
instructions, where the rounding mode setting is shared between
16 and 64 bit operations.

llvm-svn: 348754
2018-12-10 12:06:10 +00:00
Brian Gesiak b963c5150d [AMDGPU] Fix discarded result of addAttribute
Summary:
`llvm::AttributeList` and `llvm::AttributeSet` are immutable, and so methods
defined on these classes, such as `addAttribute`, return a new immutable
object with the attribute added. In https://reviews.llvm.org/D55217 I attempted
to annotate methods such as `addAttribute` with `LLVM_NODISCARD`, since
calling these methods has no side-effects, and so ignoring the result
that is returned is almost certainly a programmer error.

However, committing the change resulted in new warnings in the AMDGPU target.
The AMDGPU simplify libcalls pass added in https://reviews.llvm.org/D36436
attempts to add the readonly and nounwind attributes to simplified
library functions, but instead calls the `addAttribute` methods and
ignores the result.

Modify the simplify libcalls pass to actually add the nounwind and
readonly attributes. Also update the simplify libcalls test to assert
that these attributes are actually being set.

Reviewers: rampitec, vpykhtin, rnk

Reviewed By: rampitec

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D55435

llvm-svn: 348732
2018-12-09 21:56:50 +00:00
Matt Arsenault b5613ecf17 AMDGPU: Fix offsets for < 4-byte aggregate kernel arguments
We were still using the rounded down offset and alignment even though
they aren't handled because you can't trivially bitcast the loaded
value.

llvm-svn: 348658
2018-12-07 22:12:17 +00:00
Simon Pilgrim 44dfd81d01 Fix unused variable warning. NFCI.
llvm-svn: 348649
2018-12-07 21:44:25 +00:00
Matt Arsenault ce2e053134 AMDGPU: Allow f32 types for llvm.amdgcn.s.buffer.load
llvm-svn: 348625
2018-12-07 18:41:39 +00:00
Matt Arsenault ca8eb0b672 AMDGPU: Remove llvm.SI.tbuffer.store
llvm-svn: 348619
2018-12-07 18:03:47 +00:00
Matt Arsenault 3ff764a944 AMDGPU: Remove llvm.SI.buffer.load.dword
llvm-svn: 348616
2018-12-07 17:46:20 +00:00
Matt Arsenault aa9bcd56b1 AMDGPU: Remove llvm.AMDGPU.kill
This is the last of the old AMDGPU intrinsics.

llvm-svn: 348615
2018-12-07 17:46:16 +00:00
Graham Sellers b297379ef0 [AMDGPU] Shrink scalar AND, OR, XOR instructions
This change attempts to shrink scalar AND, OR and XOR instructions which take an immediate that isn't inlineable.

It performs:
AND s0, s0, ~(1 << n) -> BITSET0 s0, n
OR s0, s0, (1 << n) -> BITSET1 s0, n
AND s0, s1, x -> ANDN2 s0, s1, ~x
OR s0, s1, x -> ORN2 s0, s1, ~x
XOR s0, s1, x -> XNOR s0, s1, ~x

In particular, this catches setting and clearing the sign bit for fabs (and x, 0x7ffffffff -> bitset0 x, 31 and or x, 0x80000000 -> bitset1 x, 31).

llvm-svn: 348601
2018-12-07 15:33:21 +00:00
David Green ca29c271d2 [Targets] Add errors for tiny and kernel codemodel on targets that don't support them
Adds fatal errors for any target that does not support the Tiny or Kernel
codemodels by rejigging the getEffectiveCodeModel calls.

Differential Revision: https://reviews.llvm.org/D50141

llvm-svn: 348585
2018-12-07 12:10:23 +00:00
Nicolai Haehnle ca4a32945f AMDGPU: Generate VALU ThreeOp Integer instructions
Summary:
Original patch by: Fabian Wahlster <razor@singul4rity.com>

Change-Id: I148f692a88432541fad468963f58da9ddf79fac5

Reviewers: arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, b-sumner, llvm-commits

Differential Revision: https://reviews.llvm.org/D51995

llvm-svn: 348488
2018-12-06 14:33:40 +00:00
Valery Pykhtin f479fbba5f [AMDGPU] Partial revert of rL348371: Turn on the DPP combiner by default
Turn the combiner back off as there're failures until the issue is fixed.

Differential revision: https://reviews.llvm.org/D55314

llvm-svn: 348487
2018-12-06 14:20:02 +00:00
Valery Pykhtin 5b4db77b13 [AMDGPU]: Turn on the DPP combiner by default
Differential revision: https://reviews.llvm.org/D55314

llvm-svn: 348371
2018-12-05 15:21:17 +00:00
Matt Arsenault 0f414b81cc AMDGPU: Add f32 vectors to SGPR register classes
llvm-svn: 348286
2018-12-04 17:51:36 +00:00
Ron Lieberman 16de4fd2eb [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
The introduction of S_{ADD|SUB}_U64_PSEUDO instructions which are decomposed
into VOP3 instruction pairs for S_ADD_U64_PSEUDO:
  V_ADD_I32_e64
  V_ADDC_U32_e64
and for S_SUB_U64_PSEUDO
  V_SUB_I32_e64
  V_SUBB_U32_e64
preclude the use of SDWA to encode a constant.
SDWA: Sub-Dword addressing is supported on VOP1 and VOP2 instructions,
but not on VOP3 instructions.

We desire to fold the bit-and operand into the instruction encoding
for the V_ADD_I32 instruction. This requires that we transform the
VOP3 into a VOP2 form of the instruction (_e32).
  %19:vgpr_32 = V_AND_B32_e32 255,
      killed %16:vgpr_32, implicit $exec
  %47:vgpr_32, %49:sreg_64_xexec = V_ADD_I32_e64
      %26.sub0:vreg_64, %19:vgpr_32, implicit $exec
 %48:vgpr_32, dead %50:sreg_64_xexec = V_ADDC_U32_e64
      %26.sub1:vreg_64, %54:vgpr_32, killed %49:sreg_64_xexec, implicit $exec

which then allows the SDWA encoding and becomes
  %47:vgpr_32 = V_ADD_I32_sdwa
      0, %26.sub0:vreg_64, 0, killed %16:vgpr_32, 0, 6, 0, 6, 0,
      implicit-def $vcc, implicit $exec
  %48:vgpr_32 = V_ADDC_U32_e32
      0, %26.sub1:vreg_64, implicit-def $vcc, implicit $vcc, implicit $exec


Differential Revision: https://reviews.llvm.org/D54882

llvm-svn: 348132
2018-12-03 13:04:54 +00:00
Graham Sellers ba559ac058 [AMDGPU] Split 64-Bit XNOR to 64-Bit NOT/XOR
The identity ~(x ^ y) == (~x ^ y) == (x ^ ~y) allows XNOR (XOR/NOT) to turn into NOT/XOR. Handling this case with its own split means we can make the NOT remain in the scalar unit. Previously, we split 64-bit XNOR into two 32-bit XNOR, then lowered. Now, we get three instructions (s_not, v_xor, v_xor) rather than four in the case where either of the sources is a scalar 64-bit.

Add test cases to xnor.ll to attempt XNOR Vx, Sy and XNOR Sx, Vy. Also adding test that uses the opposite identity such that (~x ^ y) on the scalar unit (or vector for gfx906) can generate XNOR. This already worked, but I didn't see a test for it.

Differential: https://reviews.llvm.org/D55071
llvm-svn: 348075
2018-12-01 12:27:53 +00:00
Nicolai Haehnle a7b00058e0 AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.

If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.

There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.

Change-Id: I170e6816323beb1348677b358c9d380865cd1a19

Reviewers: arsenm, alex-t, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53283

llvm-svn: 348050
2018-11-30 22:55:38 +00:00
Nicolai Haehnle a9cc92c247 AMDGPU: Fix various issues around the VirtReg2Value mapping
Summary:
The VirtReg2Value mapping is crucial for getting consistently
reliable divergence information into the SelectionDAG. This
patch fixes a bunch of issues that lead to incorrect divergence
info and introduces tight assertions to ensure we don't regress:

1. VirtReg2Value is generated lazily; there were some cases where
   a lookup was performed before all relevant virtual registers were
   created, leading to an out-of-sync mapping. Those cases were:

  - Complex code to lower formal arguments that generated CopyFromReg
    nodes from live-in registers (fixed by never querying the mapping
    for live-in registers).

  - Code that generates CopyToReg for formal arguments that are used
    outside the entry basic block (fixed by never querying the
    mapping for Register nodes, which don't need the divergence info
    anyway).

2. For complex values that are lowered to a sequence of registers,
   all registers must be reflected in the VirtReg2Value mapping.

I am not adding any new tests, since I'm not actually aware of any
bugs that these problems are causing with trunk as-is. However,
I recently added a test case (in r346423) which fails when D53283 is
applied without this change. Also, the new assertions should provide
most of the effective test coverage.

There is one test change in sdwa-peephole.ll. The underlying issue
is that since the divergence info is now correct, the DAGISel will
select V_OR_B32 directly instead of S_OR_B32. This leads to an extra
COPY which affects the behavior of MachineLICM in a way that ends up
with the S_MOV_B32 with the constant in a different basic block than
the V_OR_B32, which is presumably what defeats the peephole.

Reviewers: alex-t, arsenm, rampitec

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D54340

llvm-svn: 348049
2018-11-30 22:55:29 +00:00
Ron Lieberman f48e43bbf7 [AMDGPU] Disable SReg Global LD/ST, perf regression
Differential Revision: https://reviews.llvm.org/D55093

llvm-svn: 348014
2018-11-30 18:29:17 +00:00
Valery Pykhtin 3d9afa273f [AMDGPU] Combine DPP mov with use instructions (VOP1/2/3)
Introduces DPP pseudo instructions and the pass that combines DPP mov with subsequent uses.

Differential revision: https://reviews.llvm.org/D53762

llvm-svn: 347993
2018-11-30 14:21:56 +00:00
David Stuttard c6603861d8 Revert r347871 "Fix: Add support for TFE/LWE in image intrinsic"
Also revert fix r347876

One of the buildbots was reporting a failure in some relevant tests that I can't
repro or explain at present, so reverting until I can isolate.

llvm-svn: 347911
2018-11-29 20:14:17 +00:00
Graham Sellers 04f7a4d2d2 [AMDGPU] Add and update scalar instructions
This patch adds support for S_ANDN2, S_ORN2 32-bit and 64-bit instructions and adds splits to move them to the vector unit (for which there is no equivalent instruction). It modifies the way that the more complex scalar instructions are lowered to vector instructions by first breaking them down to sequences of simpler scalar instructions which are then lowered through the existing code paths. The pattern for S_XNOR has also been updated to apply inversion to one input rather than the output of the XOR as the result is equivalent and may allow leaving the NOT instruction on the scalar unit.

A new tests for NAND, NOR, ANDN2 and ORN2 have been added, and existing tests now hit the new instructions (and have been modified accordingly).

Differential: https://reviews.llvm.org/D54714
llvm-svn: 347877
2018-11-29 16:05:38 +00:00
David Stuttard 535c1af0bf Fix: Add support for TFE/LWE in image intrinsic
My change svn-id: 347871 caused a buildbot failure due to an unused
variable def (used in an assert).

Change-Id: Ia882d18bb6fa79b4d7bbfda422b9ea5d23eab336
llvm-svn: 347876
2018-11-29 15:56:36 +00:00
David Stuttard de02e4b1cc Add support for TFE/LWE in image intrinsics
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.

This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.

This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).

There's an additional fix now to avoid a dmask=0

For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.

Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.

The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:

%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
                                      i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1

Differential revision: https://reviews.llvm.org/D48826

Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
llvm-svn: 347871
2018-11-29 15:21:13 +00:00
Nicolai Haehnle 7bed696915 AMDGPU/InsertWaitcnts: Remove the dependence on MachineLoopInfo
Summary:
MachineLoopInfo cannot be relied on for correctness, because it cannot
properly recognize loops in irreducible control flow which can be
introduced by late machine basic block optimization passes. See the new
test case for the reduced form of an example that occurred in practice.

Use a simple fixpoint iteration instead.

In order to facilitate this change, refactor WaitcntBrackets so that it
only tracks pending events and registers, rather than also maintaining
state that is relevant for the high-level algorithm. Various accessor
methods can be removed or made private as a consequence.

Affects (in radv):
- dEQP-VK.glsl.loops.special.{for,while}_uniform_iterations.select_iteration_count_{fragment,vertex}

Fixes: r345719 ("AMDGPU: Rewrite SILowerI1Copies to always stay on SALU")

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54231

llvm-svn: 347853
2018-11-29 11:06:26 +00:00
Nicolai Haehnle ab43bf60fe AMDGPU/InsertWaitcnt: Consistently use uint32_t for scores / time points
Summary:
There is one obsolete reference to using -1 as an indication of "unknown",
but this isn't actually used anywhere.

Using unsigned makes robust wrapping checks easier.

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, llvm-commits, tpr, t-tye, hakzsam

Differential Revision: https://reviews.llvm.org/D54230

llvm-svn: 347852
2018-11-29 11:06:21 +00:00
Nicolai Haehnle f96456c611 AMDGPU/InsertWaitcnt: Remove unused WaitAtBeginning
Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54229

llvm-svn: 347851
2018-11-29 11:06:18 +00:00
Nicolai Haehnle d1f45dad84 AMDGPU/InsertWaitcnts: Simplify pending events tracking
Summary:
Instead of storing the "score" (last time point) of the various relevant
events, only store whether an event is pending or not.

This is sufficient, because whenever only one event of a count type is
pending, its last time point is naturally the upper bound of all time
points of this count type, and when multiple event types are pending,
the count type has gone out of order and an s_waitcnt to 0 is required
to clear any pending event type (and will then clear all pending event
types for that count type).

This also removes the special handling of GDS_GPR_LOCK and EXP_GPR_LOCK.
I do not understand what this special handling ever attempted to achieve.
It has existed ever since the original port from an internal code base,
so my best guess is that it solved a problem related to EXEC handling in
that internal code base.

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54228

llvm-svn: 347850
2018-11-29 11:06:14 +00:00
Nicolai Haehnle ae369d70c3 AMDGPU/InsertWaitcnts: Use foreach loops for inst and wait event types
Summary:
It hides the type casting ugliness, and I happened to have to add a new
such loop (in a later patch).

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54227

llvm-svn: 347849
2018-11-29 11:06:11 +00:00
Nicolai Haehnle 1a94cbb3f5 AMDGPU/InsertWaitcnts: Untangle some semi-global state
Summary:
Reduce the statefulness of the algorithm in two ways:

1. More clearly split generateWaitcntInstBefore into two phases: the
   first one which determines the required wait, if any, without changing
   the ScoreBrackets, and the second one which actually inserts the wait
   and updates the brackets.

2. Communicate pre-existing s_waitcnt instructions using an argument to
   generateWaitcntInstBefore instead of through the ScoreBrackets.

To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.

There are some functional changes:

1. The FIXME for the VCCZ bug workaround was implemented: we only wait for
   SMEM instructions as required instead of waiting on all counters.

2. We now properly track pre-existing waitcnt's in all cases, which leads
   to less conservative waitcnts being emitted in some cases.

     s_load_dword ...
     s_waitcnt lgkmcnt(0)    <-- pre-existing wait count
     ds_read_b32 v0, ...
     ds_read_b32 v1, ...
     s_waitcnt lgkmcnt(0)    <-- this is too conservative
     use(v0)
     more code
     use(v1)

   This increases code size a bit, but the reduced latency should still be a
   win in basically all cases. The worst code size regressions in my shader-db
   are:

 WORST REGRESSIONS - Code Size
 Before After     Delta Percentage
   1724  1736        12    0.70 %   shaders/private/f1-2015/1334.shader_test [0]
   2276  2284         8    0.35 %   shaders/private/f1-2015/1306.shader_test [0]
   4632  4640         8    0.17 %   shaders/private/ue4_elemental/62.shader_test [0]
   2376  2384         8    0.34 %   shaders/private/f1-2015/1308.shader_test [0]
   3284  3292         8    0.24 %   shaders/private/talos_principle/1955.shader_test [0]

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54226

llvm-svn: 347848
2018-11-29 11:06:06 +00:00
Francis Visoiu Mistrih d7eebd6d83 [CodeGen][NFC] Make `TII::getMemOpBaseImmOfs` return a base operand
Currently, instructions doing memory accesses through a base operand that is
not a register can not be analyzed using `TII::getMemOpBaseRegImmOfs`.

This means that functions such as `TII::shouldClusterMemOps` will bail
out on instructions using an FI as a base instead of a register.

The goal of this patch is to refactor all this to return a base
operand instead of a base register.

Then in a separate patch, I will add FI support to the mem op clustering
in the MachineScheduler.

Differential Revision: https://reviews.llvm.org/D54846

llvm-svn: 347746
2018-11-28 12:00:20 +00:00
Stanislav Mekhanoshin 443a7f9788 [AMDGPU] Disable DAG combine at -O0
Differential Revision: https://reviews.llvm.org/D54358

llvm-svn: 347659
2018-11-27 15:13:37 +00:00
Matt Arsenault 88ce3dcbc8 AMDGPU: Record SGPR spills when restoring too
It's possible in some cases to have a restore present
without a corresponding spill. Due to an apparent bug
in D54366 <https://reviews.llvm.org/D54366>, only the
restore for a register was emitted. It's probably
always a bug for this to happen, but due to how SGPR
spilling is implemented, this makes the issues appear
worse than it is.

llvm-svn: 347595
2018-11-26 21:28:40 +00:00
Matt Arsenault 105fc1a5f3 AMDGPU: Don't optimize exec masks at -O0
llvm-svn: 347573
2018-11-26 17:02:02 +00:00
Matt Arsenault 6384d9ea31 AMDGPU: Only add implicit super-reg def for first subreg
llvm-svn: 347572
2018-11-26 17:02:01 +00:00
Konstantin Zhuravlyov 700b1ef54d AMDGPU: Fix V_FMA_F16 selection on GFX9
GFX9 should select opsel version.

Differential Revision: https://reviews.llvm.org/D54545

llvm-svn: 347265
2018-11-19 21:10:16 +00:00
Stanislav Mekhanoshin 8bafbae889 [AMDGPU] Restored selection of scalar_to_vector (v2x16)
This works if DAG combiner is enabled, but without combining
we cannot select scalar_to_vector of <2 x half> and <2 x i16>.

Differential Revision: https://reviews.llvm.org/D54718

llvm-svn: 347259
2018-11-19 19:58:13 +00:00
Fangrui Song d83a5526d5 [AMDGPU] Fix -Wunused-variable
llvm-svn: 347234
2018-11-19 17:54:27 +00:00
Stanislav Mekhanoshin 054f8101f1 [AMDGPU] Convert insert_vector_elt into set of selects
This allows to avoid scratch use or indirect VGPR addressing for
small vectors.

Differential Revision: https://reviews.llvm.org/D54606

llvm-svn: 347231
2018-11-19 17:39:20 +00:00
David Stuttard be3d7ba9fb [AMDGPU] Derive GCNSubtarget from MF to get overridden target features
Summary:
AMDGPUAsmPrinter has a getSTI function that derives a GCNSubtarget from the
TM. However, this means that overridden target features are not detected and can
result in incorrect behaviour.

Switch to using STM which is a GCNSubtarget derived from the MF (used elsewhere
in the same function).

Change-Id: Ib6328ad667b7fcdc87e9c06344e59859207db9b0

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D54301

llvm-svn: 347221
2018-11-19 15:44:20 +00:00
Nicolai Haehnle c548d91419 AMDGPU/InsertWaitcnts: Some more const-correctness
Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam

Differential Revision: https://reviews.llvm.org/D54225

llvm-svn: 347192
2018-11-19 12:03:11 +00:00
Matt Arsenault eabb8dd015 AMDGPU: Fix analyzeBranch failing with pseudoterminators
If a block had one of the _term instructions used for gluing
exec modifying instructions to the end of the block,
analyzeBranch would fail, preventing the verifier from catching
a broken successor list.

llvm-svn: 347027
2018-11-16 05:03:02 +00:00
Ron Lieberman cac749ac88 [AMDGPU] Add FixupVectorISel pass, currently Supports SREGs in GLOBAL LD/ST
Add a pass to fixup various vector ISel issues.
Currently we handle converting GLOBAL_{LOAD|STORE}_*
and GLOBAL_Atomic_* instructions into their _SADDR variants.
This involves feeding the sreg into the saddr field of the new instruction.

llvm-svn: 347008
2018-11-16 01:13:34 +00:00
Ron Lieberman 2f5683e6b0 [AMDGPU] NFC Test commit
llvm-svn: 347002
2018-11-16 00:46:51 +00:00
Konstantin Zhuravlyov af7b5d7092 AMDHSA: More code object v3 fixes:
- Make sure IsaInfo::hasCodeObjectV3 returns true only
    for AMDHSA
  - Update assembler metadata tests to use v2 by default

llvm-svn: 347001
2018-11-15 23:14:23 +00:00
Konstantin Zhuravlyov a25e0524c0 AMDGPU: Enable code object v3 for AMDHSA only
Differential Revision: https://reviews.llvm.org/D54186

llvm-svn: 346923
2018-11-15 02:32:43 +00:00
Aakanksha Patil 1a60116b5c AMDGPU: Additional pattern for i16 median3 matching
min(max(a, b), max(min(a, b), c))

Differential Revision: https://reviews.llvm.org/D54494

llvm-svn: 346886
2018-11-14 20:10:41 +00:00
Stanislav Mekhanoshin bcb34ac2ea [AMDGPU] combine extractelement into several selects
An extractelement with non-constant index will be lowered either to
scratch or movrel loop in most cases. This patch converts such
instruction into a set of selects if vector size is not too big.

Differential Revision: https://reviews.llvm.org/D54351

llvm-svn: 346800
2018-11-13 21:18:21 +00:00
Aakanksha Patil a992c694c6 AMDGPU: Adding more median3 patterns
min(max(a, b), max(min(a, b), c)) -> med3 a, b, c

Differential Revision: https://reviews.llvm.org/D54331

llvm-svn: 346704
2018-11-12 21:04:06 +00:00
Stanislav Mekhanoshin e86c8d33b1 [AMDGPU] Optimize S_CBRANCH_VCC[N]Z -> S_CBRANCH_EXEC[N]Z
Sometimes after basic block placement we end up with a code like:

  sreg = s_mov_b64 -1
  vcc = s_and_b64 exec, sreg
  s_cbranch_vccz

This happens as a join of a block assigning -1 to a saved mask and
another block which consumes that saved mask with s_and_b64 and a
branch.

This is essentially a single s_cbranch_execz instruction when moved
into a single new basic block.

Differential Revision: https://reviews.llvm.org/D54164

llvm-svn: 346690
2018-11-12 18:48:17 +00:00
Sanjay Patel 0a515595a7 [x86] allow vector load narrowing with multi-use values
This is a long-awaited follow-up suggested in D33578. Since then, we've picked up even more
opportunities for vector narrowing from changes like D53784, so there are a lot of test diffs.
Apart from 2-3 strange cases, these are all wins.

I've structured this to be no-functional-change-intended for any target except for x86
because I couldn't tell if AArch64, ARM, and AMDGPU would improve or not. All of those
targets have existing regression tests (4, 4, 10 files respectively) that would be
affected. Also, Hexagon overrides the shouldReduceLoadWidth() hook, but doesn't show
any regression test diffs. The trade-off is deciding if an extra vector load is better
than a single wide load + extract_subvector.

For x86, this is almost always better (on paper at least) because we often can fold
loads into subsequent ops and not increase the official instruction count. There's also
some unknown -- but potentially large -- benefit from using narrower vector ops if wide
ops are implemented with multiple uops and/or frequency throttling is avoided.

Differential Revision: https://reviews.llvm.org/D54073

llvm-svn: 346595
2018-11-10 20:05:31 +00:00
Stanislav Mekhanoshin 13d3371e68 [AMDGPU] Always pass TRI into findRegister[Use/Def]OperandIdx
This only covers AMDGPU BE, hopefully all occurrences.

Differential Revision: https://reviews.llvm.org/D54235

llvm-svn: 346528
2018-11-09 17:58:59 +00:00
Stanislav Mekhanoshin 6cc8b2fc65 [AMDGPU] Extend promote alloca vectorization
Promote alloca can vectorize a small array by bitcasting it to a
vector type. Extend vectorization for the case when alloca is
already a vector type. We still want to replace GEPs with an
insert/extract element instructions in this case.

Differential Revision: https://reviews.llvm.org/D54219

llvm-svn: 346376
2018-11-08 00:16:23 +00:00
Nicolai Haehnle bc233f5523 Revert "AMDGPU: Divergence-driven selection of scalar buffer load intrinsics"
This reverts commit r344696 for now (except for some test additions).

See https://bugs.freedesktop.org/show_bug.cgi?id=108611.

llvm-svn: 346364
2018-11-07 21:53:43 +00:00
Nicolai Haehnle 61396ff67c AMDGPU/InsertWaitcnts: Cleanup some old cruft (NFCI)
Summary: Remove redundant logic and simplify control flow.

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D54086

llvm-svn: 346363
2018-11-07 21:53:36 +00:00
Nicolai Haehnle 0ab31c9c44 AMDGPU/InsertWaitcnts: Remove kill-related logic
Summary:
This is not needed, because we don't actually insert relevant branches
for KILLs that late in the compilation flow.

Besides, this was always checking for the wrong kill opcode anyway...

Reviewers: msearles, rampitec, scott.linder, kanarayan

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D54085

llvm-svn: 346362
2018-11-07 21:53:29 +00:00
Konstantin Zhuravlyov 15e90e331c AMDGPU/NFC: Split FLAT_Global_Atomic_Pseudo into RTN/NO_RTN multiclasses
llvm-svn: 346361
2018-11-07 21:42:13 +00:00
Konstantin Zhuravlyov 7f1959ebb3 AMDGPU/NFC: Split MUBUF_Pseudo_Atomics into RTN/NO_RTN multiclasses
llvm-svn: 346357
2018-11-07 21:21:32 +00:00
Matt Arsenault 8ba740a5a8 Allow subclassing ExternalAA
This allows testing AMDGPU alias analysis like any
other alias analysis pass. This fixes the existing
test pointlessly running opt -O3 when it really
just wants to run the one analysis.

Before there was no way to test this using -aa-eval
with opt, since the default constructed pass
is run. The wrapper subclass allows the
default constructor to pass the necessary callback.

llvm-svn: 346353
2018-11-07 20:26:42 +00:00
Sanjay Patel de58e93666 fix typos aggressively; NFC
llvm-svn: 346316
2018-11-07 14:35:36 +00:00
Yaxun Liu 73bf0af32f AMDGPU: Add an option -disable-promote-alloca-to-lds
Add this option for debugging and providing workaround.

By default it is off so no behavior change in backend.

Differential Revision: https://reviews.llvm.org/D54158

llvm-svn: 346267
2018-11-06 21:28:17 +00:00
Craig Topper 0b5f8169b0 [TargetLowering] Change TargetLoweringBase::getPreferredVectorAction to take an MVT instead of an EVT. NFC
The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit.

llvm-svn: 346180
2018-11-05 23:26:13 +00:00
Konstantin Zhuravlyov 108927b944 AMDGPU: Add sram-ecc feature
Differential Revision: https://reviews.llvm.org/D53222

llvm-svn: 346177
2018-11-05 22:44:19 +00:00
Neil Henning 233a02d0ed [AMDGPU] Fix the new atomic optimizer in pixel shaders.
The new atomic optimizer I previously added in D51969 did not work
correctly when a pixel shader was using derivatives, and had helper
lanes active.

To fix this we add an llvm.amdgcn.ps.live call that guards a branch
around the entire atomic operation - ensuring that all helper lanes are
inactive within the wavefront when we compute our atomic results.

I've added a test case that can cause derivatives, and exposes the
problem.

Differential Revision: https://reviews.llvm.org/D53930

llvm-svn: 346128
2018-11-05 12:04:48 +00:00
Sylvestre Ledru df92dabaef Fixed inclusion of M_PI fow MinGW-w64
Patch by KOLANICH

llvm-svn: 346000
2018-11-02 17:25:40 +00:00
Neil Henning 7d1b77df57 [AMDGPU] UBSan bug fix for r345710
UBSan detected an error in our ISelLowering that is exposed only when
you have a dmask == 0x1. Fix this by adding in an explicit check to
ensure we don't do the UBSan detected shl << 32.

llvm-svn: 345962
2018-11-02 10:24:57 +00:00
Matt Arsenault 8e0269ba0b AMDGPU: Fix assertion with bitcast from i64 constant to v4i16
llvm-svn: 345922
2018-11-02 02:43:55 +00:00
Farhana Aleen 5853762e5a [AMDGPU] Handle the idot8 pattern generated by FE.
Summary: Different variants of idot8 codegen dag patterns are not generated by llvm-tablegen due to a huge
         increase in the compile time. Support the pattern that clang FE generates after reordering the
         additions in integer-dot8 source language pattern.

Author: FarhanaAleen

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D53937

llvm-svn: 345902
2018-11-01 22:48:19 +00:00
Reid Kleckner 4dc0b1ac60 Fix clang -Wimplicit-fallthrough warnings across llvm, NFC
This patch should not introduce any behavior changes. It consists of
mostly one of two changes:
1. Replacing fall through comments with the LLVM_FALLTHROUGH macro
2. Inserting 'break' before falling through into a case block consisting
   of only 'break'.

We were already using this warning with GCC, but its warning behaves
slightly differently. In this patch, the following differences are
relevant:
1. GCC recognizes comments that say "fall through" as annotations, clang
   doesn't
2. GCC doesn't warn on "case N: foo(); default: break;", clang does
3. GCC doesn't warn when the case contains a switch, but falls through
   the outer case.

I will enable the warning separately in a follow-up patch so that it can
be cleanly reverted if necessary.

Reviewers: alexfh, rsmith, lattner, rtrieu, EricWF, bollu

Differential Revision: https://reviews.llvm.org/D53950

llvm-svn: 345882
2018-11-01 19:54:45 +00:00
Stanislav Mekhanoshin 222e9c11f7 Check shouldReduceLoadWidth from SimplifySetCC
SimplifySetCC could shrink a load without checking for
profitability or legality of such shink with a target.

Added checks to prevent shrinking of aligned scalar loads
in AMDGPU below dword as scalar engine does not support it.

Differential Revision: https://reviews.llvm.org/D53846

llvm-svn: 345778
2018-10-31 21:24:30 +00:00
Scott Linder c6c627253d [AMDGPU] Remove FeatureVGPRSpilling
This feature is only relevant to shaders, and is no longer used. When disabled,
lowering of reserved registers for shaders causes a compiler crash.

Remove the feature and add a test for compilation of shaders at OptNone.

Differential Revision: https://reviews.llvm.org/D53829

llvm-svn: 345763
2018-10-31 18:54:06 +00:00
Nicolai Haehnle 814abb59df AMDGPU: Rewrite SILowerI1Copies to always stay on SALU
Summary:
Instead of writing boolean values temporarily into 32-bit VGPRs
if they are involved in PHIs or are observed from outside a loop,
we use bitwise masking operations to combine lane masks in a way
that is consistent with wave control flow.

Move SIFixSGPRCopies to before this pass, since that pass
incorrectly attempts to move SGPR phis to VGPRs.

This should recover most of the code quality that was lost with
the bug fix in "AMDGPU: Remove PHI loop condition optimization".

There are still some relevant cases where code quality could be
improved, in particular:

- We often introduce redundant masks with EXEC. Ideally, we'd
  have a generic computeKnownBits-like analysis to determine
  whether masks are already masked by EXEC, so we can avoid this
  masking both here and when lowering uniform control flow.

- The criterion we use to determine whether a def is observed
  from outside a loop is conservative: it doesn't check whether
  (loop) branch conditions are uniform.

Change-Id: Ibabdb373a7510e426b90deef00f5e16c5d56e64b

Reviewers: arsenm, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, mgorny, yaxunl, dstuttard, t-tye, eraman, llvm-commits

Differential Revision: https://reviews.llvm.org/D53496

llvm-svn: 345719
2018-10-31 13:27:08 +00:00
Nicolai Haehnle 28212cc689 AMDGPU: Remove PHI loop condition optimization
Summary:
The optimization to early break out of loops if all threads are dead was
never fully implemented.

But the PHI node analyzing is actually causing a number of problems, so
remove all the extra code for it.

(This does actually regress code quality in a few places because it
 ends up relying more heavily on phi's of i1, which we don't do a
 great job with. However, since it fixes real bugs in the wild, we
 should take this change. I have some prototype changes to improve
 i1 lowering in general -- not just for control flow -- which should
 help recover the code quality, I just need to make those changes
 fit for general consumption. -- Nicolai)

Change-Id: I6fc6c6c8961857ac6009fcfb9f7e5e48dc23fbb1
Patch-by: Christian König <christian.koenig@amd.com>

Reviewers: arsenm, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53359

llvm-svn: 345718
2018-10-31 13:26:48 +00:00
Neil Henning 63718b214a [AMDGPU] support image load/store a16
Our a16 support was only enabled for sample/gather and buffer
load/store, but not for image load/store operations (which take an i16
as the pixel index rather than a half).

Fix our isel lowering and add test cases to prove it out.

Differential Revision: https://reviews.llvm.org/D53750

llvm-svn: 345710
2018-10-31 10:34:48 +00:00
Konstantin Zhuravlyov 2d22d24ac4 Revert r345542: AMDGPU: Enable code object v3 by default
It breaks mesa.

llvm-svn: 345662
2018-10-30 22:02:40 +00:00
Simon Pilgrim 858303b827 [SelectionDAG] Add FoldBUILD_VECTOR to simplify new BUILD_VECTOR nodes
Similar to FoldCONCAT_VECTORS, this patch adds FoldBUILD_VECTOR to simplify cases that can avoid the creation of the BUILD_VECTOR - if all the operands are UNDEF or if the BUILD_VECTOR simplifies to a copy.

This exposed an assumption in some AMDGPU code that getBuildVector was guaranteed to be a BUILD_VECTOR node that I've tried to handle.	
	
Differential Revision: https://reviews.llvm.org/D53760

llvm-svn: 345578
2018-10-30 10:32:11 +00:00
Matt Arsenault abc4f29f9c AMDGPU: Remove custom BUILD_VECTOR combine
This was looping in a testcase and removing it
now slightly improves a test.

llvm-svn: 345560
2018-10-30 01:37:59 +00:00
Matt Arsenault b0b741efb8 AMDGPU: Use scavengeRegisterBackwards
llvm-svn: 345559
2018-10-30 01:33:14 +00:00
Konstantin Zhuravlyov 5cb950200c AMDGPU: Enable code object v3 by default
Differential Revision: https://reviews.llvm.org/D53525

llvm-svn: 345542
2018-10-29 21:07:27 +00:00
Stanislav Mekhanoshin 6b1c6548bd [AMDGPU] Fixed return value causing warning and regression
llvm-svn: 345518
2018-10-29 17:53:23 +00:00
Stanislav Mekhanoshin 79080ecd82 [AMDGPU] Match v_swap_b32
Differential Revision: https://reviews.llvm.org/D52677

llvm-svn: 345514
2018-10-29 17:26:01 +00:00
Scott Linder 11ef7984b0 [AMDGPU] Add a pass to promote bitcast calls
AMDGPU currently only supports direct calls, but at lower optimisation levels it
fails to lower statically direct calls which appear indirect due to a bitcast.

Add a pass to visit all CallSites and use CallPromotionUtils to "devirtualize"
calls.

Differential Revision: https://reviews.llvm.org/D52741

llvm-svn: 345382
2018-10-26 13:18:36 +00:00
Tim Renouf 2a1b1d94b6 [AMDGPU] Defined gfx909 Raven Ridge 2
Differential Revision: https://reviews.llvm.org/D53418

Change-Id: Ie3d054f2e956c2768988c0f4c0ffd29a47294eef
llvm-svn: 345120
2018-10-24 08:14:07 +00:00
Matt Arsenault 687ec75d10 DAG: Change behavior of fminnum/fmaxnum nodes
Introduce new versions that follow the IEEE semantics
to help with legalization that may need quieted inputs.

There are some regressions from inserting unnecessary
canonicalizes when these are matched from fast math
fcmp + select which should be fixed in a future commit.

llvm-svn: 344914
2018-10-22 16:27:27 +00:00
Changpeng Fang f95f763ea5 AMDGPU: Add support pattern for SUB of one bit
Summary:
  Add selection patterns to support one bit Sub.

Reviewers:
  rampitec, arsenm

Differential Revision:
  https://reviews.llvm.org/D52946

llvm-svn: 344815
2018-10-19 21:09:21 +00:00
Nicolai Haehnle 4821937d2e AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI
Summary:
To workaround a hardware issue in the (base + offset) calculation
when base is negative. The impact on code quality should be limited
since SILoadStoreOptimizer still runs afterwards and is able to
combine loads/stores based on known sign information.

This fixes visible corruption in Hitman on SI (easily reproducible
by running benchmark mode).

Change-Id: Ia178d207a5e2ac38ae7cd98b532ea2ae74704e5f
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99923

Reviewers: arsenm, mareko

Subscribers: jholewinski, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53160

llvm-svn: 344698
2018-10-17 15:37:48 +00:00
Nicolai Haehnle c4a2ff0950 AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
Summary:
Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.

If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.

There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.

Change-Id: I170e6816323beb1348677b358c9d380865cd1a19

Reviewers: arsenm, alex-t, rampitec, tpr

Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53283

llvm-svn: 344696
2018-10-17 15:37:30 +00:00
Nicolai Haehnle e9b134aa31 AMDGPU: Remove dead TableGen code
Summary: Change-Id: Ic1f2c1d0cf9e90a0baa9fc6bacd0d3c386069fb0

Reviewers: tpr

Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53318

Change-Id: Ib4d143c898801e5cf6cb9999a495d62c91ae77fb
llvm-svn: 344691
2018-10-17 12:14:26 +00:00
Konstantin Zhuravlyov 94dfcc2eb2 AMDGPU: Generate .amdgcn_target for object code v3
Differential Revision: https://reviews.llvm.org/D53221

llvm-svn: 344552
2018-10-15 20:37:47 +00:00
Chandler Carruth edb12a838a [TI removal] Make variables declared as `TerminatorInst` and initialized
by `getTerminator()` calls instead be declared as `Instruction`.

This is the biggest remaining chunk of the usage of `getTerminator()`
that insists on the narrow type and so is an easy batch of updates.
Several files saw more extensive updates where this would cascade to
requiring API updates within the file to use `Instruction` instead of
`TerminatorInst`. All of these were trivial in nature (pervasively using
`Instruction` instead just worked).

llvm-svn: 344502
2018-10-15 10:04:59 +00:00
Tom Stellard a894043910 Revert "AMDGPU/GlobalISel: Implement select for G_INSERT"
This reverts commit r344310.

The test case was failing on some bots.

llvm-svn: 344317
2018-10-11 23:36:46 +00:00
Tom Stellard 4733be6e7b AMDGPU/GlobalISel: Implement select for G_INSERT
Reviewers: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D53116

llvm-svn: 344310
2018-10-11 22:49:54 +00:00
Scott Linder 823549a6ec [AMDGPU] Legalize VGPR Rsrc operands for MUBUF instructions
Emit a waterfall loop in the general case for a potentially-divergent Rsrc
operand. When practical, avoid this by using Addr64 instructions.

Recommits r341413 with changes to update the MachineDominatorTree when present.

Differential Revision: https://reviews.llvm.org/D51742

llvm-svn: 343992
2018-10-08 18:47:01 +00:00
Tom Stellard 14d8807d9a AMDGPU/GlobalISel: Select amdgcn.cvt.pkrtz to 64-bit instructions
Summary: The 32-bit variants do not exist on VI+.

Reviewers: arsenm

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52958

llvm-svn: 343985
2018-10-08 17:49:29 +00:00
Neil Henning 6641657453 [AMDGPU] Add an AMDGPU specific atomic optimizer.
This commit adds a new IR level pass to the AMDGPU backend to perform
atomic optimizations. It works by:

- Running through a function and finding atomicrmw add/sub or uses of
  the atomic buffer intrinsics for add/sub.
- If all arguments except the value to be added/subtracted are uniform,
  record the value to be optimized.
- Run through the atomic operations we can optimize and, depending on
  whether the value is uniform/divergent use wavefront wide operations
  (DPP in the divergent case) to calculate the total amount to be
  atomically added/subtracted.
- Then let only a single lane of each wavefront perform the atomic
  operation, reducing the total number of atomic operations in flight.
- Lastly we recombine the result from the single lane to each lane of
  the wavefront, and calculate our individual lanes offset into the
  final result.

Differential Revision: https://reviews.llvm.org/D51969

llvm-svn: 343973
2018-10-08 15:49:19 +00:00
Neil Henning 57f5d0a885 [IRBuilder] Fixup CreateIntrinsic to allow specifying Types to Mangle.
The IRBuilder CreateIntrinsic method wouldn't allow you to specify the
types that you wanted the intrinsic to be mangled with. To fix this
I've:

- Added an ArrayRef<Type *> member to both CreateIntrinsic overloads.
- Used that array to pass into the Intrinsic::getDeclaration call.
- Added a CreateUnaryIntrinsic to replace the most common use of
  CreateIntrinsic where the type was auto-deduced from operand 0.
- Added a bunch more unit tests to test Create*Intrinsic calls that
  weren't being tested (including the FMF flag that wasn't checked).

This was suggested as part of the AMDGPU specific atomic optimizer
review (https://reviews.llvm.org/D51969).

Differential Revision: https://reviews.llvm.org/D52087

llvm-svn: 343962
2018-10-08 10:32:33 +00:00
Tom Stellard 251ee083a3 AMDGPU: Consolidate SMRD TableGen patterns
Summary:
Merge the SMRD patterns for CI into the same multiclass as the
patterns for other sub-targets.

This removes some duplicate code and will make it easier for some
future GlobalISel changes I would like to do.

Reviewers: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52557

llvm-svn: 343909
2018-10-06 03:32:43 +00:00
Jonas Paulsson faad1b3056 [TargetRegisterInfo] Remove temporary hook enableMultipleCopyHints()
Finally all targets are enabling multiple regalloc hints, so the hook to
disable this can now be removed.

NFC.

Review: Simon Pilgrim
https://reviews.llvm.org/D52316

llvm-svn: 343851
2018-10-05 14:23:11 +00:00
Tom Stellard 7c65078f04 AMDGPU/GlobalISel: Add support for G_INTTOPTR
Summary: This is a no-op.

Reviewers: arsenm

Reviewed By: arsenm

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D52916

llvm-svn: 343839
2018-10-05 04:34:09 +00:00
Konstantin Zhuravlyov aa067cb9fb AMDGPU: Rename isAmdCodeObjectV2 -> isAmdHsaOrMesa
The isAmdCodeObjectV2 is a misleading name which actually checks whether the os
is amdhsa or mesa.

Also add a test to make sure we do not generate old kernel header for code
object v3.

Differential Revision: https://reviews.llvm.org/D52897

llvm-svn: 343813
2018-10-04 21:02:16 +00:00
Farhana Aleen 4bc597bff5 [AMDGPU] Match signed dot4/8 pattern.
Summary: This patch matches signed dot4 and dot8 pattern.

Author: FarhanaAleen

Reviewed By: msearles

Differential Revision: https://reviews.llvm.org/D52520

llvm-svn: 343798
2018-10-04 16:57:37 +00:00
Tim Renouf a37679d67b [AMDGPU] Fix for negative offsets in buffer/tbuffer intrinsics
Summary:
The new buffer/tbuffer intrinsics handle an out-of-range immediate
offset by moving/adding offset&-4096 to a vgpr, leaving an in-range
immediate offset, with a chance of the move/add being CSEd for similar
loads/stores.

However it turns out that a negative offset in a vgpr is illegal, even
if adding the immediate offset makes it legal again.

Therefore, this commit disables the offset&-4096 thing if the offset is
negative.

Differential Revision: https://reviews.llvm.org/D52683

Change-Id: Ie02f0a74f240a138dc2a29d17cfbd9e350e4ed13
llvm-svn: 343672
2018-10-03 10:29:43 +00:00
Fangrui Song 3d76d36059 [AMDGPU] Rename pass "isel" to "amdgpu-isel"
Summary: The AMDGPU target specific pass "isel" is a misleading name.

Reviewers: tstellar, echristo, javed.absar, arsenm

Reviewed By: arsenm

Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, kristof.beyls, llvm-commits

Differential Revision: https://reviews.llvm.org/D52759

llvm-svn: 343659
2018-10-03 03:38:22 +00:00
Matt Arsenault 635d479322 AMDGPU: Always run AMDGPUAlwaysInline
Even if calls are enabled, it still needs to be run
for forcing inline of functions that use LDS.

llvm-svn: 343657
2018-10-03 02:47:25 +00:00
Stanislav Mekhanoshin 1821513e2f [AMDGPU] Assert in getOpSize() there are no sub-dword subregs
Differential Revision: https://reviews.llvm.org/D52769

llvm-svn: 343648
2018-10-03 00:00:41 +00:00
Matt Arsenault ab41193312 AMDGPU: Expand atomicrmw nand in IR
llvm-svn: 343559
2018-10-02 03:50:56 +00:00
Stanislav Mekhanoshin ae8bd6d9b5 [AMDGPU] Fixed SIInstrInfo::getOpSize to handle subregs
Currently it returns incorrect operand size for a target independet
node such as COPY if operand is a register with subreg. Instead of
correct subreg size it returns a size of the whole superreg.

Differential Revision: https://reviews.llvm.org/D52736

llvm-svn: 343508
2018-10-01 18:00:02 +00:00
Alexander Timofeev b048fa3344 [AMDGPU] Divergence driven instruction selection. Shift operations.
Summary: This change enables VOP3 shifts to be explicitly selected
         dependent on the divergence.

Differential Revision: https://reviews.llvm.org/D52559

Reviewers: rampitec
llvm-svn: 343455
2018-10-01 11:06:35 +00:00
Vitaly Buka 0509070811 [cxx2a] Fix warning triggered by r343285
llvm-svn: 343369
2018-09-29 02:17:12 +00:00
Konstantin Zhuravlyov 5f1b8181ad AMDGPU: Split HasExt into HasExtDPP/SDWA/SDWA9
llvm-svn: 343264
2018-09-27 20:49:00 +00:00
Konstantin Zhuravlyov 9da26b20da AMDGPU: Split VOP2Inst into VOP2Inst_e32/e64/sdwa
llvm-svn: 343259
2018-09-27 19:46:41 +00:00