Commit Graph

67 Commits

Author SHA1 Message Date
Benjamin Kramer f9ab3ddb8f [AMDGPU] Clean up symbols in the global namespace.
llvm-svn: 317051
2017-10-31 23:21:30 +00:00
Yaxun Liu de4b88d9a1 [AMDGPU] Lower enqueued blocks and generate runtime metadata
This patch adds a post-linking pass which replaces the function pointer of enqueued
block kernel with a global variable (runtime handle) and adds
runtime-handle attribute to the enqueued block kernel.

In LLVM CodeGen the runtime-handle metadata will be translated to
RuntimeHandle metadata in code object. Runtime allocates a global buffer
for each kernel with RuntimeHandel metadata and saves the kernel address
required for the AQL packet into the buffer. __enqueue_kernel function
in device library knows that the invoke function pointer in the block
literal is actually runtime handle and loads the kernel address from it
and puts it into AQL packet for dispatching.

This cannot be done in FE since FE cannot create a unique global variable
with external linkage across LLVM modules. The global variable with internal
linkage does not work since optimization passes will try to replace loads
of the global variable with its initialization value.

Differential Revision: https://reviews.llvm.org/D38610

llvm-svn: 315352
2017-10-10 19:39:48 +00:00
Stanislav Mekhanoshin 1d8cf2be89 [AMDGPU] Set fast-math flags on functions given the options
We have a single library build without relaxation options.
When inlined library functions remove fast math attributes
from the functions they are integrated into.

This patch sets relaxation attributes on the functions after
linking provided corresponding relaxation options are given.
Math instructions inside the inlined functions remain to have
no fast flags, but inlining does not prevent fast math
transformations of a surrounding caller code anymore.

Differential Revision: https://reviews.llvm.org/D38325

llvm-svn: 314568
2017-09-29 23:40:19 +00:00
Stanislav Mekhanoshin 5670e6d482 [AMDGPU] Port of HSAIL inliner
Differential Revision: https://reviews.llvm.org/D36849

llvm-svn: 313714
2017-09-20 04:25:58 +00:00
Stanislav Mekhanoshin 7f37794ebd [AMDGPU] Ported and adopted AMDLibCalls pass
The pass does simplifications of well known AMD library calls.
If given -amdgpu-prelink option it works in a pre-link mode which
allows to reference new library functions which will be linked in
later.

In addition it also used to process traditional AMD option
-fuse-native which allows to replace some of the functions with
their fast native implementations from the library.

The necessary glue to pass the prelink option and translate
-fuse-native is to be added to the driver.

Differential Revision: https://reviews.llvm.org/D36436

llvm-svn: 310731
2017-08-11 16:42:09 +00:00
Tom Stellard 20287697f8 AMDGPU: Move R600 parts of AMDGPUISelDAGToDAG into their own class
Summary: This refactoring is required in order to split the R600 and GCN tablegen files.

Reviewers: arsenm

Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, llvm-commits, t-tye

Differential Revision: https://reviews.llvm.org/D36286

llvm-svn: 310336
2017-08-08 04:57:55 +00:00
Matt Arsenault 3db456820d AMDGPU: Remove FixControlFlowLiveIntervals pass
This hasn't done anything in a long time. This was
running after the the control flow pseudos were expanded,
so this would never find them. The control flow pseudo
expansion was moved to solve the problem this pass was
supposed to solve in the first place, except handling
it earlier also fixes it for fast regalloc which doesn't
use LiveIntervals.

Noticed by checking LCOV reports.

llvm-svn: 310274
2017-08-07 18:12:47 +00:00
Connor Abbott 92638ab625 [AMDGPU] Add support for Whole Wavefront Mode
Summary:
Whole Wavefront Wode (WWM) is similar to WQM, except that all of the
lanes are always enabled, regardless of control flow. This is required
for implementing wavefront reductions in non-uniform control flow, where
we need to use the inactive lanes to propagate intermediate results, so
they need to be enabled. We need to propagate WWM to uses (unless
they're explicitly marked as exact) so that they also propagate
intermediate results correctly. We do the analysis and exec mask munging
during the WQM pass, since there are interactions with WQM for things
that require both WQM and WWM. For simplicity, WWM is entirely
block-local -- blocks are never WWM on entry or exit of a block, and WWM
is not propagated to the block level.  This means that computations
involving WWM cannot involve control flow, but we only ever plan to use
WWM for a few limited purposes (none of which involve control flow)
anyways.

Shaders can ask for WWM using the @llvm.amdgcn.wwm intrinsic. There
isn't yet a way to turn WWM off -- that will be added in a future
change.

Finally, it turns out that turning on inactive lanes causes a number of
problems with register allocation. While the best long-term solution
seems like teaching LLVM's register allocator about predication, for now
we need to add some hacks to prevent ourselves from getting into trouble
due to constraints that aren't currently expressed in LLVM. For the gory
details, see the comments at the top of SIFixWWMLiveness.cpp.

Reviewers: arsenm, nhaehnle, tpr

Subscribers: kzhuravl, wdng, mgorny, yaxunl, dstuttard, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D35524

llvm-svn: 310087
2017-08-04 18:36:52 +00:00
Matt Arsenault 7016f13450 AMDGPU: Add analysis pass for function argument info
This will allow only adding necessary inputs to callee functions
that need special inputs forwarded from the kernel.

llvm-svn: 309996
2017-08-03 22:30:46 +00:00
Tom Stellard a2f57be260 AMDGPU/R600: Initialize more passes
Reviewers: arsenm

Reviewed By: arsenm

Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits

Differential Revision: https://reviews.llvm.org/D36128

llvm-svn: 309893
2017-08-02 22:19:45 +00:00
Stanislav Mekhanoshin 37e7f959c0 [AMDGPU] Collapse adjacent SI_END_CF
Add a pass to remove redundant S_OR_B64 instructions enabling lanes in
the exec. If two SI_END_CF (lowered as S_OR_B64) come together without any
vector instructions between them we can only keep outer SI_END_CF, given
that CFG is structured and exec bits of the outer end statement are always
not less than exec bit of the inner one.

This needs to be done before the RA to eliminate saved exec bits registers
but after register coalescer to have no vector registers copies in between
of different end cf statements.

Differential Revision: https://reviews.llvm.org/D35967

llvm-svn: 309762
2017-08-01 23:14:32 +00:00
Matt Arsenault c06574ffc0 AMDGPU: Add pass to replace out arguments
It is better to return arguments directly in registers
if we are making a call rather than introducing expensive
stack usage. In one of sample compile from one of
Blender's many kernel variants, this fires on about
~20 different functions. Future improvements may be to
recognize simple cases where the pointer is indexing a small
array. This also fails when the store to the out argument
is in a separate block from the return, which happens in
a few of the Blender functions. This should also probably
be using MemorySSA which might help with that.

I'm not sure this is correct as a FunctionPass, but
MemoryDependenceAnalysis seems to not work with
a ModulePass.

I'm also not sure where it should run.I think it should
run  before DeadArgumentElimination, so maybe either
EP_CGSCCOptimizerLate or EP_ScalarOptimizerLate.

llvm-svn: 309416
2017-07-28 18:40:05 +00:00
Konstantin Zhuravlyov e9a5a77ee3 AMDGPU: Implement memory model
llvm-svn: 308781
2017-07-21 21:19:23 +00:00
Matt Arsenault 6b93046f29 AMDGPU: Annotate call graph with used features
Previously this wouldn't detect used features indirectly
used in callee functions.

llvm-svn: 307967
2017-07-13 21:43:42 +00:00
Matt Arsenault 7c525903ef AMDGPU: Remove SITypeRewriter
This was an old workaround for using v16i8 in some old intrinsics
for resource descriptors.

llvm-svn: 306603
2017-06-28 21:38:50 +00:00
Matt Arsenault 746e065716 AMDGPU: Register AMDGPUAlwaysInline
llvm-svn: 304574
2017-06-02 18:02:42 +00:00
Francis Visoiu Mistrih 8b61764cbb [LegacyPassManager] Remove TargetMachine constructors
This provides a new way to access the TargetMachine through
TargetPassConfig, as a dependency.

The patterns replaced here are:

* Passes handling a null TargetMachine call
  `getAnalysisIfAvailable<TargetPassConfig>`.

* Passes not handling a null TargetMachine
  `addRequired<TargetPassConfig>` and call
  `getAnalysis<TargetPassConfig>`.

* MachineFunctionPasses now use MF.getTarget().

* Remove all the TargetMachine constructors.
* Remove INITIALIZE_TM_PASS.

This fixes a crash when running `llc -start-before prologepilog`.

PEI needs StackProtector, which gets constructed without a TargetMachine
by the pass manager. The StackProtector pass doesn't handle the case
where there is no TargetMachine, so it segfaults.

Related to PR30324.

Differential Revision: https://reviews.llvm.org/D33222

llvm-svn: 303360
2017-05-18 17:21:13 +00:00
Jan Sjodin a06bfe054e Re-submit AMDGPUMachineCFGStructurizer.
Differential Revision: https://reviews.llvm.org/D23209

llvm-svn: 303111
2017-05-15 20:18:37 +00:00
Jan Sjodin 0e289822fa Revert 303091.
llvm-svn: 303098
2017-05-15 18:39:47 +00:00
Jan Sjodin e9d2ddc9dd Add AMDGPUMachineCFGStructurizer.
Differential Revision: https://reviews.llvm.org/D23209

llvm-svn: 303091
2017-05-15 18:13:56 +00:00
Stanislav Mekhanoshin c90347d760 [AMDGPU] Generate range metadata for workitem id
If workgroup size is known inform llvm about range returned by local
id  and local size queries.

Differential Revision: https://reviews.llvm.org/D31804

llvm-svn: 300102
2017-04-12 20:48:56 +00:00
Kannan Narayanan acb089e12a [AMDGPU] Add a new pass to insert waitcnts. Leave under an option for testing.
Based on comments in https://reviews.llvm.org/D31161.

llvm-svn: 300023
2017-04-12 03:25:12 +00:00
Matt Arsenault 678e111e11 AMDGPU: Fix crash when disassembling VOP3 mac
The unused dummy src2_modifiers is missing, so it crashes
when trying to print it.

I tried to fully remove src2_modifiers, but there are some
irritations in the places where it is converted to mad since
it starts to require modifying use lists while iterating over
them.

llvm-svn: 299861
2017-04-10 17:58:06 +00:00
Yaxun Liu 76ae47cb35 [AMDGPU] Temporarily change constant address space from 4 to 2
Our final address space mapping is to let constant address space to be 4 to match nvptx.
However for now we will make it 2 to avoid unnecessary work in FE/BE/devlib
about intrinsics returning constant pointers.

Differential Revision: https://reviews.llvm.org/D31770

llvm-svn: 299690
2017-04-06 19:17:32 +00:00
Stanislav Mekhanoshin 89653dfd2a [AMDGPU] Add GlobalOpt parameter to Always Inliner pass
If set to false it does not remove global aliases. With this parameter
set to false it should be safe to run the pass before link.

Differential Revision: https://reviews.llvm.org/D31489

llvm-svn: 299108
2017-03-30 20:16:02 +00:00
Yaxun Liu 1a14bfa022 [AMDGPU] Get address space mapping by target triple environment
As we introduced target triple environment amdgiz and amdgizcl, the address
space values are no longer enums. We have to decide the value by target triple.

The basic idea is to use struct AMDGPUAS to represent address space values.
For address space values which are not depend on target triple, use static
const members, so that they don't occupy extra memory space and is equivalent
to a compile time constant.

Since the struct is lightweight and cheap, it can be created on the fly at
the point of usage. Or it can be added as member to a pass and created at
the beginning of the run* function.

Differential Revision: https://reviews.llvm.org/D31284

llvm-svn: 298846
2017-03-27 14:04:01 +00:00
Matt Arsenault b8f8dbc227 AMDGPU: Unify divergent function exits.
StructurizeCFG can't handle cases with multiple
returns creating regions with multiple exits.
Create a copy of UnifyFunctionExitNodes that only
unifies exit nodes that skips exit nodes
with uniform branch sources.

llvm-svn: 298729
2017-03-24 19:52:05 +00:00
Sam Kolton f60ad58dad [ADMGPU] SDWA peephole optimization pass.
Summary:
First iteration of SDWA peephole.

This pass tries to combine several instruction into one SDWA instruction. E.g. it converts:
'''
    V_LSHRREV_B32_e32 %vreg0, 16, %vreg1
    V_ADD_I32_e32 %vreg2, %vreg0, %vreg3
    V_LSHLREV_B32_e32 %vreg4, 16, %vreg2
'''
Into:
'''
   V_ADD_I32_sdwa %vreg4, %vreg1, %vreg3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
'''

Pass structure:
    1. Iterate over machine instruction in basic block and try to apply "SDWA patterns" to each of them. SDWA patterns match machine instruction into either source or destination SDWA operand. E.g. ''' V_LSHRREV_B32_e32 %vreg0, 16, %vreg1''' is matched to source SDWA operand '''%vreg1 src_sel:WORD_1'''.
    2. Iterate over found SDWA operands and find instruction that could be potentially coverted into SDWA. E.g. for source SDWA operand potential instruction are all instruction in this basic block that uses '''%vreg0'''
    3. Iterate over all potential instructions and check if they can be converted into SDWA.
    4. Convert instructions to SDWA.

This review contains basic implementation of SDWA peephole pass. This pass requires additional testing fot both correctness and performance (no performance testing done).
There are several ways this pass can be improved:
    1. Make this pass work on whole function not only basic block. As I can see this can be done right now without changes to pass.
    2. Introduce more SDWA patterns
    3. Introduce mnemonics to limit when SDWA patterns should apply

Reviewers: vpykhtin, alex-t, arsenm, rampitec

Subscribers: wdng, nhaehnle, mgorny

Differential Revision: https://reviews.llvm.org/D30038

llvm-svn: 298365
2017-03-21 12:51:34 +00:00
Stanislav Mekhanoshin 8e45acfc38 [AMDGPU] Add address space based alias analysis pass
This is direct port of HSAILAliasAnalysis pass, just cleaned for
style and renamed.

Differential Revision: https://reviews.llvm.org/D31103

llvm-svn: 298172
2017-03-17 23:56:58 +00:00
Matt Arsenault e823d92f7f AMDGPU: Merge initial gfx9 support
llvm-svn: 295554
2017-02-18 18:29:53 +00:00
Matt Arsenault 0699ef39ce AMDGPU: Add pass to expand memcpy/memmove/memset
llvm-svn: 294635
2017-02-09 22:00:42 +00:00
Stanislav Mekhanoshin f6c1feb8c3 [AMDGPU] Turn AMDGPUUnifyMetadata back into module pass
With the adjustPassManager interface that is now possible to use
custom early module passes.

Differential Revision: https://reviews.llvm.org/D29189

llvm-svn: 293300
2017-01-27 16:38:10 +00:00
Stanislav Mekhanoshin 22a56f2f5a [AMDGPU] Add VGPR copies post regalloc fix pass
Regalloc creates COPY instructions which do not formally use VALU.
That results in v_mov instructions displaced after exec mask modification.
One pass which do it is SIOptimizeExecMasking, but potentially it can be
done by other passes too.

This patch adds a pass immediately after regalloc to add implicit exec
use operand to all VGPR copy instructions.

Differential Revision: https://reviews.llvm.org/D28874

llvm-svn: 292956
2017-01-24 17:46:17 +00:00
Stanislav Mekhanoshin 50ea93a2bd [AMDGPU] Add amdgpu-unify-metadata pass
Multiple metadata values for records such as opencl.ocl.version, llvm.ident
and similar are created after linking several modules. For some of them, notably
opencl.ocl.version, this creates semantic problem because we cannot tell which
version of OpenCL the composite module conforms.

Moreover, such repetitions of identical values often create a huge list of
unneeded metadata, which grows bitcode size both in memory and stored on disk.
It can go up to several Mb when linked against our OpenCL library. Lastly, such
long lists obscure reading of dumped IR.

The pass unifies metadata after linking.

Differential Revision: https://reviews.llvm.org/D25381

llvm-svn: 289092
2016-12-08 19:46:04 +00:00
Mehdi Amini f42454b94b Move the global variables representing each Target behind accessor function
This avoids "static initialization order fiasco"

Differential Revision: https://reviews.llvm.org/D25412

llvm-svn: 283702
2016-10-09 23:00:34 +00:00
Konstantin Zhuravlyov 60a8373778 [AMDGPU] Pass optimization level to SelectionDAGISel
llvm-svn: 283133
2016-10-03 18:47:26 +00:00
Matt Arsenault e6740754f0 AMDGPU: Partially fix control flow at -O0
Fixes to allow spilling all registers at the end of the block
work with exec modifications. Don't emit s_and_saveexec_b64 for
if lowering, and instead emit copies. Mark control flow mask
instructions as terminators to get correct spill code placement
with fast regalloc, and then have a separate optimization pass
form the saveexec.

This should work if SGPRs are spilled to VGPRs, but
will likely fail in the case that an SGPR spills to memory
and no workitem takes a divergent branch.

llvm-svn: 282667
2016-09-29 01:44:16 +00:00
Matt Arsenault 78fc9daf8d AMDGPU: Split SILowerControlFlow into two pieces
Do most of the lowering in a pre-RA pass. Keep the skip jump
insertion late, plus a few other things that require more
work to move out.

One concern I have is now there may be COPY instructions
which do not have the necessary implicit exec uses
if they will be lowered to v_mov_b32.

This has a positive effect on SGPR usage in shader-db.

llvm-svn: 279464
2016-08-22 19:33:16 +00:00
Matt Arsenault 2ffe8fd2ce AMDGPU: Prune includes
llvm-svn: 278391
2016-08-11 19:18:50 +00:00
Matt Arsenault a1fe17c9ad AMDGPU: Change fdiv lowering based on !fpmath metadata
If 2.5 ulp is acceptable, denormals are not required, and
isn't a reciprocal which will already be handled, replace
with a faster fdiv.

Simplify the lowering tests by using per function
subtarget features.

llvm-svn: 276051
2016-07-19 23:16:53 +00:00
Matt Arsenault ca7f5701f8 AMDGPU/R600: Delete/rename intrinsics no longer used by mesa
Use the replacement pass to update the tests, and delete old names.

llvm-svn: 275375
2016-07-14 05:47:17 +00:00
Matt Arsenault 86de486d31 AMDGPU: Add stub custom CodeGenPrepare pass
This will do various things including ones
CodeGenPrepare does, but with knowledge of uniform
values.

llvm-svn: 273657
2016-06-24 07:07:55 +00:00
Matt Arsenault c3a01ec9db AMDGPU: Properly initialize SIShrinkInstructions
llvm-svn: 272336
2016-06-09 23:18:47 +00:00
Matt Arsenault ec30eb507e AMDGPU: Remove unused address space
Also return a single StringRef instead of building a string.

llvm-svn: 271296
2016-05-31 16:57:45 +00:00
Jan Vesely 81f1b30035 AMDGPU/EG,CM: Add instruction to read from constant AS (VTX2)
Reviewers: tstellard

Subscribers: arsenm

Differential Revision: http://reviews.llvm.org/D19785

llvm-svn: 269473
2016-05-13 20:39:16 +00:00
Konstantin Zhuravlyov a791932145 [AMDGPU][NFC] Rename SIInsertNops -> SIDebuggerInsertNops
Differential Revision: http://reviews.llvm.org/D20117

llvm-svn: 269098
2016-05-10 18:33:41 +00:00
Nicolai Haehnle 723b73b4eb AMDGPU: Remove SIFixSGPRLiveRanges pass
Summary:
This pass is unnecessary and overly conservative. It was motivated by
situations like

  def %vreg0:SGPR_32
  ...
if-block:
  ..
  def %vreg1:SGPR_32
  ...
else-block:
  ...
  use %vreg0:SGPR_32
  ...

and similar situations with uses after the non-uniform control flow, where
we are not allowed to assign %vreg0 and %vreg1 to the same physical register,
even though in the original, thread/workitem-based CFG, it looks like the
live ranges of these registers do not overlap.

However, by the time register allocation runs, we have moved to a wave-based
CFG that accurately represents the fact that the wave may run through both
the if- and the else-block. So the live ranges of %vreg0 and %vreg1 already
overlap even without the SIFixSGPRLiveRanges pass.

In addition to proving this change correct, I have tested it with Piglit
and a small number of other tests.

Reviewers: arsenm, tstellarAMD

Subscribers: MatzeB, arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D19041

llvm-svn: 266345
2016-04-14 17:42:29 +00:00
Nicolai Haehnle df3a20cd80 AMDGPU: Add a shader calling convention
This makes it possible to distinguish between mesa shaders
and other kernels even in the presence of compute shaders.

Patch By: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>

Differential Revision: http://reviews.llvm.org/D18559

llvm-svn: 265589
2016-04-06 19:40:20 +00:00
Nicolai Haehnle 213e87f2ee AMDGPU: Add SIWholeQuadMode pass
Summary:
Whole quad mode is already enabled for pixel shaders that compute
derivatives, but it must be suspended for instructions that cause a
shader to have side effects (i.e. stores and atomics).

This pass addresses the issue by storing the real (initial) live mask
in a register, masking EXEC before instructions that require exact
execution and (re-)enabling WQM where required.

This pass is run before register coalescing so that we can use
machine SSA for analysis.

The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.

This patch fixes the problem by adding use-implicit operands to
target independent instructions. Some general codegen passes are
relaxed to work with such implicit use operands.

Reviewers: arsenm, tstellarAMD, mareko

Subscribers: MatzeB, arsenm, llvm-commits

Differential Revision: http://reviews.llvm.org/D18162

llvm-svn: 263982
2016-03-21 20:28:33 +00:00
Matt Arsenault 6b6a2c37bc AMDGPU: R600 code splitting cleanup
Move a few functions only used by R600 to R600 specific code,
fix header macros to stop using R600, mark classes as final.

llvm-svn: 263204
2016-03-11 08:00:27 +00:00