forked from OSchip/llvm-project
12536 lines
764 KiB
ReStructuredText
12536 lines
764 KiB
ReStructuredText
=============================
|
|
User Guide for AMDGPU Backend
|
|
=============================
|
|
|
|
.. contents::
|
|
:local:
|
|
|
|
.. toctree::
|
|
:hidden:
|
|
|
|
AMDGPU/AMDGPUAsmGFX7
|
|
AMDGPU/AMDGPUAsmGFX8
|
|
AMDGPU/AMDGPUAsmGFX9
|
|
AMDGPU/AMDGPUAsmGFX900
|
|
AMDGPU/AMDGPUAsmGFX904
|
|
AMDGPU/AMDGPUAsmGFX906
|
|
AMDGPU/AMDGPUAsmGFX908
|
|
AMDGPU/AMDGPUAsmGFX90a
|
|
AMDGPU/AMDGPUAsmGFX10
|
|
AMDGPU/AMDGPUAsmGFX1011
|
|
AMDGPUModifierSyntax
|
|
AMDGPUOperandSyntax
|
|
AMDGPUInstructionSyntax
|
|
AMDGPUInstructionNotation
|
|
AMDGPUDwarfExtensionsForHeterogeneousDebugging
|
|
AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
|
|
|
|
Introduction
|
|
============
|
|
|
|
The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
|
|
R600 family up until the current GCN families. It lives in the
|
|
``llvm/lib/Target/AMDGPU`` directory.
|
|
|
|
LLVM
|
|
====
|
|
|
|
.. _amdgpu-target-triples:
|
|
|
|
Target Triples
|
|
--------------
|
|
|
|
Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
|
|
to specify the target triple:
|
|
|
|
.. table:: AMDGPU Architectures
|
|
:name: amdgpu-architecture-table
|
|
|
|
============ ==============================================================
|
|
Architecture Description
|
|
============ ==============================================================
|
|
``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
|
|
``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
|
|
============ ==============================================================
|
|
|
|
.. table:: AMDGPU Vendors
|
|
:name: amdgpu-vendor-table
|
|
|
|
============ ==============================================================
|
|
Vendor Description
|
|
============ ==============================================================
|
|
``amd`` Can be used for all AMD GPU usage.
|
|
``mesa3d`` Can be used if the OS is ``mesa3d``.
|
|
============ ==============================================================
|
|
|
|
.. table:: AMDGPU Operating Systems
|
|
:name: amdgpu-os
|
|
|
|
============== ============================================================
|
|
OS Description
|
|
============== ============================================================
|
|
*<empty>* Defaults to the *unknown* OS.
|
|
``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
|
|
such as:
|
|
|
|
- AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
|
|
loader on Linux. See *AMD ROCm Platform Release Notes*
|
|
[AMD-ROCm-Release-Notes]_ for supported hardware and
|
|
software.
|
|
- AMD's PAL runtime using the *pal-amdhsa* loader on
|
|
Windows.
|
|
|
|
``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
|
|
runtime using the *pal-amdpal* loader on Windows and Linux
|
|
Pro.
|
|
``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
|
|
3D runtime using the *mesa-mesa3d* loader on Linux.
|
|
============== ============================================================
|
|
|
|
.. table:: AMDGPU Environments
|
|
:name: amdgpu-environment-table
|
|
|
|
============ ==============================================================
|
|
Environment Description
|
|
============ ==============================================================
|
|
*<empty>* Default.
|
|
============ ==============================================================
|
|
|
|
.. _amdgpu-processors:
|
|
|
|
Processors
|
|
----------
|
|
|
|
Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
|
|
specify the AMDGPU processor together with optional target features. See
|
|
:ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
|
|
specific information.
|
|
|
|
Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
|
|
|
|
* ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
|
|
|
|
|
|
.. table:: AMDGPU Processors
|
|
:name: amdgpu-processor-table
|
|
|
|
=========== =============== ============ ===== ================= =============== =============== ======================
|
|
Processor Alternative Target dGPU/ Target Target OS Support Example
|
|
Processor Triple APU Features Properties *(see* Products
|
|
Architecture Supported `amdgpu-os`_
|
|
*and
|
|
corresponding
|
|
runtime release
|
|
notes for
|
|
current
|
|
information and
|
|
level of
|
|
support)*
|
|
=========== =============== ============ ===== ================= =============== =============== ======================
|
|
**Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``r600`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``r630`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``rs880`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``rv670`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
**Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``rv710`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``rv730`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``rv770`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
**Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``cedar`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``cypress`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``juniper`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``redwood`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``sumo`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
**Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``barts`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``caicos`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``cayman`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``turks`` ``r600`` dGPU - Does not
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
**GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
|
|
support
|
|
generic
|
|
address
|
|
space
|
|
``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
|
|
- ``verde`` support
|
|
generic
|
|
address
|
|
space
|
|
``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
|
|
- ``oland`` support
|
|
generic
|
|
address
|
|
space
|
|
**GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
|
|
flat - *pal-amdhsa* - A6 Pro-7050B
|
|
scratch - *pal-amdpal* - A8-7100
|
|
- A8 Pro-7150B
|
|
- A10-7300
|
|
- A10 Pro-7350B
|
|
- FX-7500
|
|
- A8-7200P
|
|
- A10-7400P
|
|
- FX-7600P
|
|
``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
|
|
flat - *pal-amdhsa* - FirePro W9100
|
|
scratch - *pal-amdpal* - FirePro S9150
|
|
- FirePro S9170
|
|
``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
|
|
flat - *pal-amdhsa* - Radeon R9 290x
|
|
scratch - *pal-amdpal* - Radeon R390
|
|
- Radeon R390x
|
|
``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
|
|
- ``mullins`` flat - *pal-amdpal* - E1-2200
|
|
scratch - E1-2500
|
|
- E2-3000
|
|
- E2-3800
|
|
- A4-5000
|
|
- A4-5100
|
|
- A6-5200
|
|
- A4 Pro-3340B
|
|
``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
|
|
flat - *pal-amdpal* - Radeon HD 8770
|
|
scratch - R7 260
|
|
- R7 260X
|
|
``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
|
|
flat - *pal-amdpal*
|
|
scratch .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
**GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
|
|
flat - *pal-amdhsa* - Pro A6-8500B
|
|
scratch - *pal-amdpal* - A8-8600P
|
|
- Pro A8-8600B
|
|
- FX-8800P
|
|
- Pro A12-8800B
|
|
- A10-8700P
|
|
- Pro A10-8700B
|
|
- A10-8780P
|
|
- A10-9600P
|
|
- A10-9630P
|
|
- A12-9700P
|
|
- A12-9730P
|
|
- FX-9800P
|
|
- FX-9830P
|
|
- E2-9010
|
|
- A6-9210
|
|
- A9-9410
|
|
``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
|
|
- ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
|
|
scratch - *pal-amdpal* - Radeon R9 385
|
|
``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
|
|
- *pal-amdhsa* - Radeon R9 Fury
|
|
- *pal-amdpal* - Radeon R9 FuryX
|
|
- Radeon Pro Duo
|
|
- FirePro S9300x2
|
|
- Radeon Instinct MI8
|
|
\ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
|
|
flat - *pal-amdhsa* - Radeon RX 480
|
|
scratch - *pal-amdpal* - Radeon Instinct MI6
|
|
\ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
|
|
flat - *pal-amdhsa*
|
|
scratch - *pal-amdpal*
|
|
``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
|
|
flat - *pal-amdhsa* - FirePro S7100
|
|
scratch - *pal-amdpal* - FirePro W7100
|
|
- Mobile FirePro
|
|
M7170
|
|
``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
|
|
flat - *pal-amdhsa*
|
|
scratch - *pal-amdpal* .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
**GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
|
|
flat - *pal-amdhsa* Frontier Edition
|
|
scratch - *pal-amdpal* - Radeon RX Vega 56
|
|
- Radeon RX Vega 64
|
|
- Radeon RX Vega 64
|
|
Liquid
|
|
- Radeon Instinct MI25
|
|
``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
|
|
flat - *pal-amdhsa* - Ryzen 5 2400G
|
|
scratch - *pal-amdpal*
|
|
``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
|
|
- *pal-amdhsa*
|
|
- *pal-amdpal* .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
|
|
- xnack flat - *pal-amdhsa* - Radeon Instinct MI60
|
|
scratch - *pal-amdpal* - Radeon VII
|
|
- Radeon Pro VII
|
|
``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
|
|
- xnack - Absolute
|
|
flat
|
|
scratch
|
|
``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
|
|
flat
|
|
scratch .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
|
|
- tgsplit flat
|
|
- xnack scratch .. TODO::
|
|
- Packed
|
|
work-item Add product
|
|
IDs names.
|
|
|
|
``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
|
|
flat - Ryzen 7 4700GE
|
|
scratch - Ryzen 5 4600G
|
|
- Ryzen 5 4600GE
|
|
- Ryzen 3 4300G
|
|
- Ryzen 3 4300GE
|
|
- Ryzen Pro 4000G
|
|
- Ryzen 7 Pro 4700G
|
|
- Ryzen 7 Pro 4750GE
|
|
- Ryzen 5 Pro 4650G
|
|
- Ryzen 5 Pro 4650GE
|
|
- Ryzen 3 Pro 4350G
|
|
- Ryzen 3 Pro 4350GE
|
|
|
|
**GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
|
|
- wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
|
|
- xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
|
|
- Radeon Pro 5600M
|
|
``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
|
|
- wavefrontsize64 - Absolute - *pal-amdhsa*
|
|
- xnack flat - *pal-amdpal*
|
|
scratch
|
|
``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
|
|
- wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
|
|
- xnack scratch - *pal-amdpal*
|
|
``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
|
|
- wavefrontsize64 flat - *pal-amdhsa*
|
|
- xnack scratch - *pal-amdpal* .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
**GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
|
|
- wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
|
|
scratch - *pal-amdpal* - Radeon RX 6900 XT
|
|
``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
|
|
- wavefrontsize64 flat - *pal-amdhsa*
|
|
scratch - *pal-amdpal*
|
|
``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
|
|
- wavefrontsize64 flat - *pal-amdhsa*
|
|
scratch - *pal-amdpal* .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
|
|
- wavefrontsize64 flat
|
|
scratch .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
|
|
- wavefrontsize64 flat
|
|
scratch .. TODO::
|
|
|
|
Add product
|
|
names.
|
|
|
|
``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
|
|
- wavefrontsize64 flat
|
|
scratch .. TODO::
|
|
Add product
|
|
names.
|
|
|
|
=========== =============== ============ ===== ================= =============== =============== ======================
|
|
|
|
.. _amdgpu-target-features:
|
|
|
|
Target Features
|
|
---------------
|
|
|
|
Target features control how code is generated to support certain
|
|
processor specific features. Not all target features are supported by
|
|
all processors. The runtime must ensure that the features supported by
|
|
the device used to execute the code match the features enabled when
|
|
generating the code. A mismatch of features may result in incorrect
|
|
execution, or a reduction in performance.
|
|
|
|
The target features supported by each processor is listed in
|
|
:ref:`amdgpu-processor-table`.
|
|
|
|
Target features are controlled by exactly one of the following Clang
|
|
options:
|
|
|
|
``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
|
|
|
|
The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
|
|
optional components of the target ID. If omitted, the target feature has the
|
|
``any`` value. See :ref:`amdgpu-target-id`.
|
|
|
|
``-m[no-]<target-feature>``
|
|
|
|
Target features not specified by the target ID are specified using a
|
|
separate option. These target features can have an ``on`` or ``off``
|
|
value. ``on`` is specified by omitting the ``no-`` prefix, and
|
|
``off`` is specified by including the ``no-`` prefix. The default
|
|
if not specified is ``off``.
|
|
|
|
For example:
|
|
|
|
``-mcpu=gfx908:xnack+``
|
|
Enable the ``xnack`` feature.
|
|
``-mcpu=gfx908:xnack-``
|
|
Disable the ``xnack`` feature.
|
|
``-mcumode``
|
|
Enable the ``cumode`` feature.
|
|
``-mno-cumode``
|
|
Disable the ``cumode`` feature.
|
|
|
|
.. table:: AMDGPU Target Features
|
|
:name: amdgpu-target-features-table
|
|
|
|
=============== ============================ ==================================================
|
|
Target Feature Clang Option to Control Description
|
|
Name
|
|
=============== ============================ ==================================================
|
|
cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
|
|
when generating code for kernels. When disabled
|
|
native WGP wavefront execution mode is used,
|
|
when enabled CU wavefront execution mode is used
|
|
(see :ref:`amdgpu-amdhsa-memory-model`).
|
|
|
|
sramecc - ``-mcpu`` If specified, generate code that can only be
|
|
- ``--offload-arch`` loaded and executed in a process that has a
|
|
matching setting for SRAMECC.
|
|
|
|
If not specified for code object V2 to V3, generate
|
|
code that can be loaded and executed in a process
|
|
with SRAMECC enabled.
|
|
|
|
If not specified for code object V4 or above, generate
|
|
code that can be loaded and executed in a process
|
|
with either setting of SRAMECC.
|
|
|
|
tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
|
|
work-groups are launched in threadgroup split mode.
|
|
When enabled the waves of a work-group may be
|
|
launched in different CUs.
|
|
|
|
wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
|
|
generating code for kernels. When disabled
|
|
native wavefront size 32 is used, when enabled
|
|
wavefront size 64 is used.
|
|
|
|
xnack - ``-mcpu`` If specified, generate code that can only be
|
|
- ``--offload-arch`` loaded and executed in a process that has a
|
|
matching setting for XNACK replay.
|
|
|
|
If not specified for code object V2 to V3, generate
|
|
code that can be loaded and executed in a process
|
|
with XNACK replay enabled.
|
|
|
|
If not specified for code object V4 or above, generate
|
|
code that can be loaded and executed in a process
|
|
with either setting of XNACK replay.
|
|
|
|
XNACK replay can be used for demand paging and
|
|
page migration. If enabled in the device, then if
|
|
a page fault occurs the code may execute
|
|
incorrectly unless generated with XNACK replay
|
|
enabled, or generated for code object V4 or above without
|
|
specifying XNACK replay. Executing code that was
|
|
generated with XNACK replay enabled, or generated
|
|
for code object V4 or above without specifying XNACK replay,
|
|
on a device that does not have XNACK replay
|
|
enabled will execute correctly but may be less
|
|
performant than code generated for XNACK replay
|
|
disabled.
|
|
=============== ============================ ==================================================
|
|
|
|
.. _amdgpu-target-id:
|
|
|
|
Target ID
|
|
---------
|
|
|
|
AMDGPU supports target IDs. See `Clang Offload Bundler
|
|
<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
|
|
description. The AMDGPU target specific information is:
|
|
|
|
**processor**
|
|
Is an AMDGPU processor or alternative processor name specified in
|
|
:ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
|
|
the primary processor and alternative processor names. The canonical form
|
|
target ID only allow the primary processor name.
|
|
|
|
**target-feature**
|
|
Is a target feature name specified in :ref:`amdgpu-target-features-table` that
|
|
is supported by the processor. The target features supported by each processor
|
|
is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
|
|
a target ID are marked as being controlled by ``-mcpu`` and
|
|
``--offload-arch``. Each target feature must appear at most once in a target
|
|
ID. The non-canonical form target ID allows the target features to be
|
|
specified in any order. The canonical form target ID requires the target
|
|
features to be specified in alphabetic order.
|
|
|
|
.. _amdgpu-target-id-v2-v3:
|
|
|
|
Code Object V2 to V3 Target ID
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The target ID syntax for code object V2 to V3 is the same as defined in `Clang
|
|
Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
|
|
when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
|
|
directive and the bundle entry ID. In those cases it has the following BNF
|
|
syntax:
|
|
|
|
.. code::
|
|
|
|
<target-id> ::== <processor> ( "+" <target-feature> )*
|
|
|
|
Where a target feature is omitted if *Off* and present if *On* or *Any*.
|
|
|
|
.. note::
|
|
|
|
The code object V2 to V3 cannot represent *Any* and treats it the same as
|
|
*On*.
|
|
|
|
.. _amdgpu-embedding-bundled-objects:
|
|
|
|
Embedding Bundled Code Objects
|
|
------------------------------
|
|
|
|
AMDGPU supports the HIP and OpenMP languages that perform code object embedding
|
|
as described in `Clang Offload Bundler
|
|
<https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
|
|
|
|
.. note::
|
|
|
|
The target ID syntax used for code object V2 to V3 for a bundle entry ID
|
|
differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
|
|
|
|
.. _amdgpu-address-spaces:
|
|
|
|
Address Spaces
|
|
--------------
|
|
|
|
The AMDGPU architecture supports a number of memory address spaces. The address
|
|
space names use the OpenCL standard names, with some additions.
|
|
|
|
The AMDGPU address spaces correspond to target architecture specific LLVM
|
|
address space numbers used in LLVM IR.
|
|
|
|
The AMDGPU address spaces are described in
|
|
:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
|
|
supported for the ``amdgcn`` target.
|
|
|
|
.. table:: AMDGPU Address Spaces
|
|
:name: amdgpu-address-spaces-table
|
|
|
|
================================= =============== =========== ================ ======= ============================
|
|
.. 64-Bit Process Address Space
|
|
--------------------------------- --------------- ----------- ---------------- ------------------------------------
|
|
Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
|
|
Space Number Name Name Size
|
|
================================= =============== =========== ================ ======= ============================
|
|
Generic 0 flat flat 64 0x0000000000000000
|
|
Global 1 global global 64 0x0000000000000000
|
|
Region 2 N/A GDS 32 *not implemented for AMDHSA*
|
|
Local 3 group LDS 32 0xFFFFFFFF
|
|
Constant 4 constant *same as global* 64 0x0000000000000000
|
|
Private 5 private scratch 32 0xFFFFFFFF
|
|
Constant 32-bit 6 *TODO* 0x00000000
|
|
Buffer Fat Pointer (experimental) 7 *TODO*
|
|
================================= =============== =========== ================ ======= ============================
|
|
|
|
**Generic**
|
|
The generic address space is supported unless the *Target Properties* column
|
|
of :ref:`amdgpu-processor-table` specifies *Does not support generic address
|
|
space*.
|
|
|
|
The generic address space uses the hardware flat address support for two fixed
|
|
ranges of virtual addresses (the private and local apertures), that are
|
|
outside the range of addressable global memory, to map from a flat address to
|
|
a private or local address. This uses FLAT instructions that can take a flat
|
|
address and access global, private (scratch), and group (LDS) memory depending
|
|
on if the address is within one of the aperture ranges.
|
|
|
|
Flat access to scratch requires hardware aperture setup and setup in the
|
|
kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
|
|
access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
|
|
setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
|
|
|
|
To convert between a private or group address space address (termed a segment
|
|
address) and a flat address the base address of the corresponding aperture
|
|
can be used. For GFX7-GFX8 these are available in the
|
|
:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
|
|
Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
|
|
GFX9-GFX10 the aperture base addresses are directly available as inline
|
|
constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
|
|
In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
|
|
aligned to 2^32 which makes it easier to convert from flat to segment or
|
|
segment to flat.
|
|
|
|
A global address space address has the same value when used as a flat address
|
|
so no conversion is needed.
|
|
|
|
**Global and Constant**
|
|
The global and constant address spaces both use global virtual addresses,
|
|
which are the same virtual address space used by the CPU. However, some
|
|
virtual addresses may only be accessible to the CPU, some only accessible
|
|
by the GPU, and some by both.
|
|
|
|
Using the constant address space indicates that the data will not change
|
|
during the execution of the kernel. This allows scalar read instructions to
|
|
be used. As the constant address space could only be modified on the host
|
|
side, a generic pointer loaded from the constant address space is safe to be
|
|
assumed as a global pointer since only the device global memory is visible
|
|
and managed on the host side. The vector and scalar L1 caches are invalidated
|
|
of volatile data before each kernel dispatch execution to allow constant
|
|
memory to change values between kernel dispatches.
|
|
|
|
**Region**
|
|
The region address space uses the hardware Global Data Store (GDS). All
|
|
wavefronts executing on the same device will access the same memory for any
|
|
given region address. However, the same region address accessed by wavefronts
|
|
executing on different devices will access different memory. It is higher
|
|
performance than global memory. It is allocated by the runtime. The data
|
|
store (DS) instructions can be used to access it.
|
|
|
|
**Local**
|
|
The local address space uses the hardware Local Data Store (LDS) which is
|
|
automatically allocated when the hardware creates the wavefronts of a
|
|
work-group, and freed when all the wavefronts of a work-group have
|
|
terminated. All wavefronts belonging to the same work-group will access the
|
|
same memory for any given local address. However, the same local address
|
|
accessed by wavefronts belonging to different work-groups will access
|
|
different memory. It is higher performance than global memory. The data store
|
|
(DS) instructions can be used to access it.
|
|
|
|
**Private**
|
|
The private address space uses the hardware scratch memory support which
|
|
automatically allocates memory when it creates a wavefront and frees it when
|
|
a wavefronts terminates. The memory accessed by a lane of a wavefront for any
|
|
given private address will be different to the memory accessed by another lane
|
|
of the same or different wavefront for the same private address.
|
|
|
|
If a kernel dispatch uses scratch, then the hardware allocates memory from a
|
|
pool of backing memory allocated by the runtime for each wavefront. The lanes
|
|
of the wavefront access this using dword (4 byte) interleaving. The mapping
|
|
used from private address to backing memory address is:
|
|
|
|
``wavefront-scratch-base +
|
|
((private-address / 4) * wavefront-size * 4) +
|
|
(wavefront-lane-id * 4) + (private-address % 4)``
|
|
|
|
If each lane of a wavefront accesses the same private address, the
|
|
interleaving results in adjacent dwords being accessed and hence requires
|
|
fewer cache lines to be fetched.
|
|
|
|
There are different ways that the wavefront scratch base address is
|
|
determined by a wavefront (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
Scratch memory can be accessed in an interleaved manner using buffer
|
|
instructions with the scratch buffer descriptor and per wavefront scratch
|
|
offset, by the scratch instructions, or by flat instructions. Multi-dword
|
|
access is not supported except by flat and scratch instructions in
|
|
GFX9-GFX10.
|
|
|
|
**Constant 32-bit**
|
|
*TODO*
|
|
|
|
**Buffer Fat Pointer**
|
|
The buffer fat pointer is an experimental address space that is currently
|
|
unsupported in the backend. It exposes a non-integral pointer that is in
|
|
the future intended to support the modelling of 128-bit buffer descriptors
|
|
plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
|
|
*pointer*), allowing normal LLVM load/store/atomic operations to be used to
|
|
model the buffer descriptors used heavily in graphics workloads targeting
|
|
the backend.
|
|
|
|
.. _amdgpu-memory-scopes:
|
|
|
|
Memory Scopes
|
|
-------------
|
|
|
|
This section provides LLVM memory synchronization scopes supported by the AMDGPU
|
|
backend memory model when the target triple OS is ``amdhsa`` (see
|
|
:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
|
|
|
|
The memory model supported is based on the HSA memory model [HSA]_ which is
|
|
based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
|
|
relation is transitive over the synchronizes-with relation independent of scope
|
|
and synchronizes-with allows the memory scope instances to be inclusive (see
|
|
table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
|
|
|
|
This is different to the OpenCL [OpenCL]_ memory model which does not have scope
|
|
inclusion and requires the memory scopes to exactly match. However, this
|
|
is conservatively correct for OpenCL.
|
|
|
|
.. table:: AMDHSA LLVM Sync Scopes
|
|
:name: amdgpu-amdhsa-llvm-sync-scopes-table
|
|
|
|
======================= ===================================================
|
|
LLVM Sync Scope Description
|
|
======================= ===================================================
|
|
*none* The default: ``system``.
|
|
|
|
Synchronizes with, and participates in modification
|
|
and seq_cst total orderings with, other operations
|
|
(except image operations) for all address spaces
|
|
(except private, or generic that accesses private)
|
|
provided the other operation's sync scope is:
|
|
|
|
- ``system``.
|
|
- ``agent`` and executed by a thread on the same
|
|
agent.
|
|
- ``workgroup`` and executed by a thread in the
|
|
same work-group.
|
|
- ``wavefront`` and executed by a thread in the
|
|
same wavefront.
|
|
|
|
``agent`` Synchronizes with, and participates in modification
|
|
and seq_cst total orderings with, other operations
|
|
(except image operations) for all address spaces
|
|
(except private, or generic that accesses private)
|
|
provided the other operation's sync scope is:
|
|
|
|
- ``system`` or ``agent`` and executed by a thread
|
|
on the same agent.
|
|
- ``workgroup`` and executed by a thread in the
|
|
same work-group.
|
|
- ``wavefront`` and executed by a thread in the
|
|
same wavefront.
|
|
|
|
``workgroup`` Synchronizes with, and participates in modification
|
|
and seq_cst total orderings with, other operations
|
|
(except image operations) for all address spaces
|
|
(except private, or generic that accesses private)
|
|
provided the other operation's sync scope is:
|
|
|
|
- ``system``, ``agent`` or ``workgroup`` and
|
|
executed by a thread in the same work-group.
|
|
- ``wavefront`` and executed by a thread in the
|
|
same wavefront.
|
|
|
|
``wavefront`` Synchronizes with, and participates in modification
|
|
and seq_cst total orderings with, other operations
|
|
(except image operations) for all address spaces
|
|
(except private, or generic that accesses private)
|
|
provided the other operation's sync scope is:
|
|
|
|
- ``system``, ``agent``, ``workgroup`` or
|
|
``wavefront`` and executed by a thread in the
|
|
same wavefront.
|
|
|
|
``singlethread`` Only synchronizes with and participates in
|
|
modification and seq_cst total orderings with,
|
|
other operations (except image operations) running
|
|
in the same thread for all address spaces (for
|
|
example, in signal handlers).
|
|
|
|
``one-as`` Same as ``system`` but only synchronizes with other
|
|
operations within the same address space.
|
|
|
|
``agent-one-as`` Same as ``agent`` but only synchronizes with other
|
|
operations within the same address space.
|
|
|
|
``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
|
|
other operations within the same address space.
|
|
|
|
``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
|
|
other operations within the same address space.
|
|
|
|
``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
|
|
other operations within the same address space.
|
|
======================= ===================================================
|
|
|
|
LLVM IR Intrinsics
|
|
------------------
|
|
|
|
The AMDGPU backend implements the following LLVM IR intrinsics.
|
|
|
|
*This section is WIP.*
|
|
|
|
.. TODO::
|
|
|
|
List AMDGPU intrinsics.
|
|
|
|
LLVM IR Attributes
|
|
------------------
|
|
|
|
The AMDGPU backend supports the following LLVM IR attributes.
|
|
|
|
.. table:: AMDGPU LLVM IR Attributes
|
|
:name: amdgpu-llvm-ir-attributes-table
|
|
|
|
======================================= ==========================================================
|
|
LLVM Attribute Description
|
|
======================================= ==========================================================
|
|
"amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
|
|
will be specified when the kernel is dispatched. Generated
|
|
by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
|
|
The implied default value is 1,1024.
|
|
|
|
"amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
|
|
argument block size for the implicit arguments. This
|
|
varies by OS and language (for OpenCL see
|
|
:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
|
|
"amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
|
|
the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
|
|
"amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
|
|
``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
|
|
"amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
|
|
execution unit. Generated by the ``amdgpu_waves_per_eu``
|
|
CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
|
|
and the backend may not be able to satisfy the request. If
|
|
the specified range is incompatible with the function's
|
|
"amdgpu-flat-work-group-size" value, the implied occupancy
|
|
bounds by the workgroup size takes precedence.
|
|
|
|
"amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
|
|
mode register to be set on entry. Overrides the default for
|
|
the calling convention.
|
|
"amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
|
|
the mode register to be set on entry. Overrides the default
|
|
for the calling convention.
|
|
|
|
"amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
|
|
llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
|
|
attribute, or reached through a call site marked with this attribute,
|
|
the value returned by the intrinsic is undefined. The backend can
|
|
generally infer this during code generation, so typically there is no
|
|
benefit to frontends marking functions with this.
|
|
|
|
"amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.workitem.id.y intrinsic.
|
|
|
|
"amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.workitem.id.z intrinsic.
|
|
|
|
"amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.workgroup.id.x intrinsic.
|
|
|
|
"amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.workgroup.id.y intrinsic.
|
|
|
|
"amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.workgroup.id.z intrinsic.
|
|
|
|
"amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.dispatch.ptr intrinsic.
|
|
|
|
"amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.implicitarg.ptr intrinsic.
|
|
|
|
"amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.dispatch.id intrinsic.
|
|
|
|
"amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
|
|
llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
|
|
attributes, the queue pointer may be required in situations where the
|
|
intrinsic call does not directly appear in the program. Some subtargets
|
|
require the queue pointer for to handle some addrspacecasts, as well
|
|
as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
|
|
llvm.debug intrinsics.
|
|
|
|
======================================= ==========================================================
|
|
|
|
.. _amdgpu-elf-code-object:
|
|
|
|
ELF Code Object
|
|
===============
|
|
|
|
The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
|
|
can be linked by ``lld`` to produce a standard ELF shared code object which can
|
|
be loaded and executed on an AMDGPU target.
|
|
|
|
.. _amdgpu-elf-header:
|
|
|
|
Header
|
|
------
|
|
|
|
The AMDGPU backend uses the following ELF header:
|
|
|
|
.. table:: AMDGPU ELF Header
|
|
:name: amdgpu-elf-header-table
|
|
|
|
========================== ===============================
|
|
Field Value
|
|
========================== ===============================
|
|
``e_ident[EI_CLASS]`` ``ELFCLASS64``
|
|
``e_ident[EI_DATA]`` ``ELFDATA2LSB``
|
|
``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
|
|
- ``ELFOSABI_AMDGPU_HSA``
|
|
- ``ELFOSABI_AMDGPU_PAL``
|
|
- ``ELFOSABI_AMDGPU_MESA3D``
|
|
``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
|
|
- ``ELFABIVERSION_AMDGPU_HSA_V3``
|
|
- ``ELFABIVERSION_AMDGPU_HSA_V4``
|
|
- ``ELFABIVERSION_AMDGPU_HSA_V5``
|
|
- ``ELFABIVERSION_AMDGPU_PAL``
|
|
- ``ELFABIVERSION_AMDGPU_MESA3D``
|
|
``e_type`` - ``ET_REL``
|
|
- ``ET_DYN``
|
|
``e_machine`` ``EM_AMDGPU``
|
|
``e_entry`` 0
|
|
``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
|
|
:ref:`amdgpu-elf-header-e_flags-table-v3`,
|
|
and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
|
|
========================== ===============================
|
|
|
|
..
|
|
|
|
.. table:: AMDGPU ELF Header Enumeration Values
|
|
:name: amdgpu-elf-header-enumeration-values-table
|
|
|
|
=============================== =====
|
|
Name Value
|
|
=============================== =====
|
|
``EM_AMDGPU`` 224
|
|
``ELFOSABI_NONE`` 0
|
|
``ELFOSABI_AMDGPU_HSA`` 64
|
|
``ELFOSABI_AMDGPU_PAL`` 65
|
|
``ELFOSABI_AMDGPU_MESA3D`` 66
|
|
``ELFABIVERSION_AMDGPU_HSA_V2`` 0
|
|
``ELFABIVERSION_AMDGPU_HSA_V3`` 1
|
|
``ELFABIVERSION_AMDGPU_HSA_V4`` 2
|
|
``ELFABIVERSION_AMDGPU_HSA_V5`` 3
|
|
``ELFABIVERSION_AMDGPU_PAL`` 0
|
|
``ELFABIVERSION_AMDGPU_MESA3D`` 0
|
|
=============================== =====
|
|
|
|
``e_ident[EI_CLASS]``
|
|
The ELF class is:
|
|
|
|
* ``ELFCLASS32`` for ``r600`` architecture.
|
|
|
|
* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
|
|
process address space applications.
|
|
|
|
``e_ident[EI_DATA]``
|
|
All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
|
|
|
|
``e_ident[EI_OSABI]``
|
|
One of the following AMDGPU target architecture specific OS ABIs
|
|
(see :ref:`amdgpu-os`):
|
|
|
|
* ``ELFOSABI_NONE`` for *unknown* OS.
|
|
|
|
* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
|
|
|
|
* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
|
|
|
|
* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
|
|
|
|
``e_ident[EI_ABIVERSION]``
|
|
The ABI version of the AMDGPU target architecture specific OS ABI to which the code
|
|
object conforms:
|
|
|
|
* ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
|
|
runtime ABI for code object V2. Specify using the Clang option
|
|
``-mcode-object-version=2``.
|
|
|
|
* ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
|
|
runtime ABI for code object V3. Specify using the Clang option
|
|
``-mcode-object-version=3``.
|
|
|
|
* ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
|
|
runtime ABI for code object V4. Specify using the Clang option
|
|
``-mcode-object-version=4``. This is the default code object
|
|
version if not specified.
|
|
|
|
* ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
|
|
runtime ABI for code object V5. Specify using the Clang option
|
|
``-mcode-object-version=5``.
|
|
|
|
* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
|
|
runtime ABI.
|
|
|
|
* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
|
|
3D runtime ABI.
|
|
|
|
``e_type``
|
|
Can be one of the following values:
|
|
|
|
|
|
``ET_REL``
|
|
The type produced by the AMDGPU backend compiler as it is relocatable code
|
|
object.
|
|
|
|
``ET_DYN``
|
|
The type produced by the linker as it is a shared code object.
|
|
|
|
The AMD HSA runtime loader requires a ``ET_DYN`` code object.
|
|
|
|
``e_machine``
|
|
The value ``EM_AMDGPU`` is used for the machine for all processors supported
|
|
by the ``r600`` and ``amdgcn`` architectures (see
|
|
:ref:`amdgpu-processor-table`). The specific processor is specified in the
|
|
``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
|
|
:ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
|
|
``e_flags`` for code object V3 and above (see
|
|
:ref:`amdgpu-elf-header-e_flags-table-v3` and
|
|
:ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
|
|
|
|
``e_entry``
|
|
The entry point is 0 as the entry points for individual kernels must be
|
|
selected in order to invoke them through AQL packets.
|
|
|
|
``e_flags``
|
|
The AMDGPU backend uses the following ELF header flags:
|
|
|
|
.. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
|
|
:name: amdgpu-elf-header-e_flags-v2-table
|
|
|
|
===================================== ===== =============================
|
|
Name Value Description
|
|
===================================== ===== =============================
|
|
``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
|
|
target feature is
|
|
enabled for all code
|
|
contained in the code object.
|
|
If the processor
|
|
does not support the
|
|
``xnack`` target
|
|
feature then must
|
|
be 0.
|
|
See
|
|
:ref:`amdgpu-target-features`.
|
|
``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
|
|
handler is enabled for all
|
|
code contained in the code
|
|
object. If the processor
|
|
does not support a trap
|
|
handler then must be 0.
|
|
See
|
|
:ref:`amdgpu-target-features`.
|
|
===================================== ===== =============================
|
|
|
|
.. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
|
|
:name: amdgpu-elf-header-e_flags-table-v3
|
|
|
|
================================= ===== =============================
|
|
Name Value Description
|
|
================================= ===== =============================
|
|
``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
|
|
mask for
|
|
``EF_AMDGPU_MACH_xxx`` values
|
|
defined in
|
|
:ref:`amdgpu-ef-amdgpu-mach-table`.
|
|
``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
|
|
target feature is
|
|
enabled for all code
|
|
contained in the code object.
|
|
If the processor
|
|
does not support the
|
|
``xnack`` target
|
|
feature then must
|
|
be 0.
|
|
See
|
|
:ref:`amdgpu-target-features`.
|
|
``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
|
|
target feature is
|
|
enabled for all code
|
|
contained in the code object.
|
|
If the processor
|
|
does not support the
|
|
``sramecc`` target
|
|
feature then must
|
|
be 0.
|
|
See
|
|
:ref:`amdgpu-target-features`.
|
|
================================= ===== =============================
|
|
|
|
.. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
|
|
:name: amdgpu-elf-header-e_flags-table-v4-onwards
|
|
|
|
============================================ ===== ===================================
|
|
Name Value Description
|
|
============================================ ===== ===================================
|
|
``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
|
|
mask for
|
|
``EF_AMDGPU_MACH_xxx`` values
|
|
defined in
|
|
:ref:`amdgpu-ef-amdgpu-mach-table`.
|
|
``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
|
|
``EF_AMDGPU_FEATURE_XNACK_*_V4``
|
|
values.
|
|
``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored.
|
|
``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
|
|
``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
|
|
``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
|
|
``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
|
|
``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
|
|
values.
|
|
``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
|
|
``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
|
|
``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
|
|
``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
|
|
============================================ ===== ===================================
|
|
|
|
.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
|
|
:name: amdgpu-ef-amdgpu-mach-table
|
|
|
|
==================================== ========== =============================
|
|
Name Value Description (see
|
|
:ref:`amdgpu-processor-table`)
|
|
==================================== ========== =============================
|
|
``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
|
|
``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
|
|
``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
|
|
``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
|
|
``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
|
|
``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
|
|
``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
|
|
``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
|
|
``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
|
|
``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
|
|
``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
|
|
``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
|
|
``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
|
|
``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
|
|
``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
|
|
``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
|
|
``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
|
|
*reserved* 0x011 - Reserved for ``r600``
|
|
0x01f architecture processors.
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
|
|
*reserved* 0x027 Reserved.
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
|
|
*reserved* 0x040 Reserved.
|
|
*reserved* 0x041 Reserved.
|
|
``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
|
|
*reserved* 0x043 Reserved.
|
|
*reserved* 0x044 Reserved.
|
|
*reserved* 0x045 Reserved.
|
|
==================================== ========== =============================
|
|
|
|
Sections
|
|
--------
|
|
|
|
An AMDGPU target ELF code object has the standard ELF sections which include:
|
|
|
|
.. table:: AMDGPU ELF Sections
|
|
:name: amdgpu-elf-sections-table
|
|
|
|
================== ================ =================================
|
|
Name Type Attributes
|
|
================== ================ =================================
|
|
``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
|
|
``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
|
|
``.debug_``\ *\** ``SHT_PROGBITS`` *none*
|
|
``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
|
|
``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
|
|
``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
|
|
``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
|
|
``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
|
|
``.note`` ``SHT_NOTE`` *none*
|
|
``.rela``\ *name* ``SHT_RELA`` *none*
|
|
``.rela.dyn`` ``SHT_RELA`` *none*
|
|
``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
|
|
``.shstrtab`` ``SHT_STRTAB`` *none*
|
|
``.strtab`` ``SHT_STRTAB`` *none*
|
|
``.symtab`` ``SHT_SYMTAB`` *none*
|
|
``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
|
|
================== ================ =================================
|
|
|
|
These sections have their standard meanings (see [ELF]_) and are only generated
|
|
if needed.
|
|
|
|
``.debug``\ *\**
|
|
The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
|
|
information on the DWARF produced by the AMDGPU backend.
|
|
|
|
``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
|
|
The standard sections used by a dynamic loader.
|
|
|
|
``.note``
|
|
See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
|
|
backend.
|
|
|
|
``.rela``\ *name*, ``.rela.dyn``
|
|
For relocatable code objects, *name* is the name of the section that the
|
|
relocation records apply. For example, ``.rela.text`` is the section name for
|
|
relocation records associated with the ``.text`` section.
|
|
|
|
For linked shared code objects, ``.rela.dyn`` contains all the relocation
|
|
records from each of the relocatable code object's ``.rela``\ *name* sections.
|
|
|
|
See :ref:`amdgpu-relocation-records` for the relocation records supported by
|
|
the AMDGPU backend.
|
|
|
|
``.text``
|
|
The executable machine code for the kernels and functions they call. Generated
|
|
as position independent code. See :ref:`amdgpu-code-conventions` for
|
|
information on conventions used in the isa generation.
|
|
|
|
.. _amdgpu-note-records:
|
|
|
|
Note Records
|
|
------------
|
|
|
|
The AMDGPU backend code object contains ELF note records in the ``.note``
|
|
section. The set of generated notes and their semantics depend on the code
|
|
object version; see :ref:`amdgpu-note-records-v2` and
|
|
:ref:`amdgpu-note-records-v3-onwards`.
|
|
|
|
As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
|
|
must be generated after the ``name`` field to ensure the ``desc`` field is 4
|
|
byte aligned. In addition, minimal zero-byte padding must be generated to
|
|
ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
|
|
field of the ``.note`` section must be at least 4 to indicate at least 8 byte
|
|
alignment.
|
|
|
|
.. _amdgpu-note-records-v2:
|
|
|
|
Code Object V2 Note Records
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. warning::
|
|
Code object V2 is not the default code object version emitted by
|
|
this version of LLVM.
|
|
|
|
The AMDGPU backend code object uses the following ELF note record in the
|
|
``.note`` section when compiling for code object V2.
|
|
|
|
The note record vendor field is "AMD".
|
|
|
|
Additional note records may be present, but any which are not documented here
|
|
are deprecated and should not be used.
|
|
|
|
.. table:: AMDGPU Code Object V2 ELF Note Records
|
|
:name: amdgpu-elf-note-records-v2-table
|
|
|
|
===== ===================================== ======================================
|
|
Name Type Description
|
|
===== ===================================== ======================================
|
|
"AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
|
|
"AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
|
|
Finalizer and not the LLVM compiler.
|
|
"AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
|
|
"AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
|
|
YAML [YAML]_ textual format.
|
|
"AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
|
|
===== ===================================== ======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
|
|
:name: amdgpu-elf-note-record-enumeration-values-v2-table
|
|
|
|
===================================== =====
|
|
Name Value
|
|
===================================== =====
|
|
``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
|
|
``NT_AMD_HSA_HSAIL`` 2
|
|
``NT_AMD_HSA_ISA_VERSION`` 3
|
|
*reserved* 4-9
|
|
``NT_AMD_HSA_METADATA`` 10
|
|
``NT_AMD_HSA_ISA_NAME`` 11
|
|
===================================== =====
|
|
|
|
``NT_AMD_HSA_CODE_OBJECT_VERSION``
|
|
Specifies the code object version number. The description field has the
|
|
following layout:
|
|
|
|
.. code:: c
|
|
|
|
struct amdgpu_hsa_note_code_object_version_s {
|
|
uint32_t major_version;
|
|
uint32_t minor_version;
|
|
};
|
|
|
|
The ``major_version`` has a value less than or equal to 2.
|
|
|
|
``NT_AMD_HSA_HSAIL``
|
|
Specifies the HSAIL properties used by the HSAIL Finalizer. The description
|
|
field has the following layout:
|
|
|
|
.. code:: c
|
|
|
|
struct amdgpu_hsa_note_hsail_s {
|
|
uint32_t hsail_major_version;
|
|
uint32_t hsail_minor_version;
|
|
uint8_t profile;
|
|
uint8_t machine_model;
|
|
uint8_t default_float_round;
|
|
};
|
|
|
|
``NT_AMD_HSA_ISA_VERSION``
|
|
Specifies the target ISA version. The description field has the following layout:
|
|
|
|
.. code:: c
|
|
|
|
struct amdgpu_hsa_note_isa_s {
|
|
uint16_t vendor_name_size;
|
|
uint16_t architecture_name_size;
|
|
uint32_t major;
|
|
uint32_t minor;
|
|
uint32_t stepping;
|
|
char vendor_and_architecture_name[1];
|
|
};
|
|
|
|
``vendor_name_size`` and ``architecture_name_size`` are the length of the
|
|
vendor and architecture names respectively, including the NUL character.
|
|
|
|
``vendor_and_architecture_name`` contains the NUL terminates string for the
|
|
vendor, immediately followed by the NUL terminated string for the
|
|
architecture.
|
|
|
|
This note record is used by the HSA runtime loader.
|
|
|
|
Code object V2 only supports a limited number of processors and has fixed
|
|
settings for target features. See
|
|
:ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
|
|
processors and the corresponding target ID. In the table the note record ISA
|
|
name is a concatenation of the vendor name, architecture name, major, minor,
|
|
and stepping separated by a ":".
|
|
|
|
The target ID column shows the processor name and fixed target features used
|
|
by the LLVM compiler. The LLVM compiler does not generate a
|
|
``NT_AMD_HSA_HSAIL`` note record.
|
|
|
|
A code object generated by the Finalizer also uses code object V2 and always
|
|
generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
|
|
``sramecc`` target feature is as shown in
|
|
:ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
|
|
target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
|
|
bit.
|
|
|
|
``NT_AMD_HSA_ISA_NAME``
|
|
Specifies the target ISA name as a non-NUL terminated string.
|
|
|
|
This note record is not used by the HSA runtime loader.
|
|
|
|
See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
|
|
V2's limited support of processors and fixed settings for target features.
|
|
|
|
See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
|
|
from the string to the corresponding target ID. If the ``xnack`` target
|
|
feature is supported and enabled, the string produced by the LLVM compiler
|
|
will may have a ``+xnack`` appended. The Finlizer did not do the appending and
|
|
instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
|
|
|
|
``NT_AMD_HSA_METADATA``
|
|
Specifies extensible metadata associated with the code objects executed on HSA
|
|
[HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
|
|
target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
|
|
metadata string.
|
|
|
|
.. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
|
|
:name: amdgpu-elf-note-record-supported_processors-v2-table
|
|
|
|
===================== ==========================
|
|
Note Record ISA Name Target ID
|
|
===================== ==========================
|
|
``AMD:AMDGPU:6:0:0`` ``gfx600``
|
|
``AMD:AMDGPU:6:0:1`` ``gfx601``
|
|
``AMD:AMDGPU:6:0:2`` ``gfx602``
|
|
``AMD:AMDGPU:7:0:0`` ``gfx700``
|
|
``AMD:AMDGPU:7:0:1`` ``gfx701``
|
|
``AMD:AMDGPU:7:0:2`` ``gfx702``
|
|
``AMD:AMDGPU:7:0:3`` ``gfx703``
|
|
``AMD:AMDGPU:7:0:4`` ``gfx704``
|
|
``AMD:AMDGPU:7:0:5`` ``gfx705``
|
|
``AMD:AMDGPU:8:0:0`` ``gfx802``
|
|
``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
|
|
``AMD:AMDGPU:8:0:2`` ``gfx802``
|
|
``AMD:AMDGPU:8:0:3`` ``gfx803``
|
|
``AMD:AMDGPU:8:0:4`` ``gfx803``
|
|
``AMD:AMDGPU:8:0:5`` ``gfx805``
|
|
``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
|
|
``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
|
|
``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
|
|
``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
|
|
``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
|
|
``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
|
|
``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
|
|
``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
|
|
``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
|
|
``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
|
|
===================== ==========================
|
|
|
|
.. _amdgpu-note-records-v3-onwards:
|
|
|
|
Code Object V3 and Above Note Records
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The AMDGPU backend code object uses the following ELF note record in the
|
|
``.note`` section when compiling for code object V3 and above.
|
|
|
|
The note record vendor field is "AMDGPU".
|
|
|
|
Additional note records may be present, but any which are not documented here
|
|
are deprecated and should not be used.
|
|
|
|
.. table:: AMDGPU Code Object V3 and Above ELF Note Records
|
|
:name: amdgpu-elf-note-records-table-v3-onwards
|
|
|
|
======== ============================== ======================================
|
|
Name Type Description
|
|
======== ============================== ======================================
|
|
"AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
|
|
binary format.
|
|
======== ============================== ======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
|
|
:name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
|
|
|
|
============================== =====
|
|
Name Value
|
|
============================== =====
|
|
*reserved* 0-31
|
|
``NT_AMDGPU_METADATA`` 32
|
|
============================== =====
|
|
|
|
``NT_AMDGPU_METADATA``
|
|
Specifies extensible metadata associated with an AMDGPU code object. It is
|
|
encoded as a map in the Message Pack [MsgPack]_ binary data format. See
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v3`,
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v4` and
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
|
|
``amdhsa`` OS.
|
|
|
|
.. _amdgpu-symbols:
|
|
|
|
Symbols
|
|
-------
|
|
|
|
Symbols include the following:
|
|
|
|
.. table:: AMDGPU ELF Symbols
|
|
:name: amdgpu-elf-symbols-table
|
|
|
|
===================== ================== ================ ==================
|
|
Name Type Section Description
|
|
===================== ================== ================ ==================
|
|
*link-name* ``STT_OBJECT`` - ``.data`` Global variable
|
|
- ``.rodata``
|
|
- ``.bss``
|
|
*link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
|
|
*link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
|
|
*link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
|
|
===================== ================== ================ ==================
|
|
|
|
Global variable
|
|
Global variables both used and defined by the compilation unit.
|
|
|
|
If the symbol is defined in the compilation unit then it is allocated in the
|
|
appropriate section according to if it has initialized data or is readonly.
|
|
|
|
If the symbol is external then its section is ``STN_UNDEF`` and the loader
|
|
will resolve relocations using the definition provided by another code object
|
|
or explicitly defined by the runtime.
|
|
|
|
If the symbol resides in local/group memory (LDS) then its section is the
|
|
special processor specific section name ``SHN_AMDGPU_LDS``, and the
|
|
``st_value`` field describes alignment requirements as it does for common
|
|
symbols.
|
|
|
|
.. TODO::
|
|
|
|
Add description of linked shared object symbols. Seems undefined symbols
|
|
are marked as STT_NOTYPE.
|
|
|
|
Kernel descriptor
|
|
Every HSA kernel has an associated kernel descriptor. It is the address of the
|
|
kernel descriptor that is used in the AQL dispatch packet used to invoke the
|
|
kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
|
|
defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
|
|
|
|
Kernel entry point
|
|
Every HSA kernel also has a symbol for its machine code entry point.
|
|
|
|
.. _amdgpu-relocation-records:
|
|
|
|
Relocation Records
|
|
------------------
|
|
|
|
AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
|
|
relocatable fields are:
|
|
|
|
``word32``
|
|
This specifies a 32-bit field occupying 4 bytes with arbitrary byte
|
|
alignment. These values use the same byte order as other word values in the
|
|
AMDGPU architecture.
|
|
|
|
``word64``
|
|
This specifies a 64-bit field occupying 8 bytes with arbitrary byte
|
|
alignment. These values use the same byte order as other word values in the
|
|
AMDGPU architecture.
|
|
|
|
Following notations are used for specifying relocation calculations:
|
|
|
|
**A**
|
|
Represents the addend used to compute the value of the relocatable field.
|
|
|
|
**G**
|
|
Represents the offset into the global offset table at which the relocation
|
|
entry's symbol will reside during execution.
|
|
|
|
**GOT**
|
|
Represents the address of the global offset table.
|
|
|
|
**P**
|
|
Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
|
|
of the storage unit being relocated (computed using ``r_offset``).
|
|
|
|
**S**
|
|
Represents the value of the symbol whose index resides in the relocation
|
|
entry. Relocations not using this must specify a symbol index of
|
|
``STN_UNDEF``.
|
|
|
|
**B**
|
|
Represents the base address of a loaded executable or shared object which is
|
|
the difference between the ELF address and the actual load address.
|
|
Relocations using this are only valid in executable or shared objects.
|
|
|
|
The following relocation types are supported:
|
|
|
|
.. table:: AMDGPU ELF Relocation Records
|
|
:name: amdgpu-elf-relocation-records-table
|
|
|
|
========================== ======= ===== ========== ==============================
|
|
Relocation Type Kind Value Field Calculation
|
|
========================== ======= ===== ========== ==============================
|
|
``R_AMDGPU_NONE`` 0 *none* *none*
|
|
``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
|
|
Dynamic
|
|
``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
|
|
Dynamic
|
|
``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
|
|
Dynamic
|
|
``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
|
|
``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
|
|
``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
|
|
Dynamic
|
|
``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
|
|
``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
|
|
``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
|
|
``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
|
|
``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
|
|
*reserved* 12
|
|
``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
|
|
``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
|
|
========================== ======= ===== ========== ==============================
|
|
|
|
``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
|
|
the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
|
|
|
|
There is no current OS loader support for 32-bit programs and so
|
|
``R_AMDGPU_ABS32`` is not used.
|
|
|
|
.. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
|
|
|
|
Loaded Code Object Path Uniform Resource Identifier (URI)
|
|
---------------------------------------------------------
|
|
|
|
The AMD GPU code object loader represents the path of the ELF shared object from
|
|
which the code object was loaded as a textual Uniform Resource Identifier (URI).
|
|
Note that the code object is the in memory loaded relocated form of the ELF
|
|
shared object. Multiple code objects may be loaded at different memory
|
|
addresses in the same process from the same ELF shared object.
|
|
|
|
The loaded code object path URI syntax is defined by the following BNF syntax:
|
|
|
|
.. code::
|
|
|
|
code_object_uri ::== file_uri | memory_uri
|
|
file_uri ::== "file://" file_path [ range_specifier ]
|
|
memory_uri ::== "memory://" process_id range_specifier
|
|
range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
|
|
file_path ::== URI_ENCODED_OS_FILE_PATH
|
|
process_id ::== DECIMAL_NUMBER
|
|
number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
|
|
|
|
**number**
|
|
Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
|
|
and octal values by "0".
|
|
|
|
**file_path**
|
|
Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
|
|
every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
|
|
encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
|
|
the path are separated by "/".
|
|
|
|
**offset**
|
|
Is a 0-based byte offset to the start of the code object. For a file URI, it
|
|
is from the start of the file specified by the ``file_path``, and if omitted
|
|
defaults to 0. For a memory URI, it is the memory address and is required.
|
|
|
|
**size**
|
|
Is the number of bytes in the code object. For a file URI, if omitted it
|
|
defaults to the size of the file. It is required for a memory URI.
|
|
|
|
**process_id**
|
|
Is the identity of the process owning the memory. For Linux it is the C
|
|
unsigned integral decimal literal for the process ID (PID).
|
|
|
|
For example:
|
|
|
|
.. code::
|
|
|
|
file:///dir1/dir2/file1
|
|
file:///dir3/dir4/file2#offset=0x2000&size=3000
|
|
memory://1234#offset=0x20000&size=3000
|
|
|
|
.. _amdgpu-dwarf-debug-information:
|
|
|
|
DWARF Debug Information
|
|
=======================
|
|
|
|
.. warning::
|
|
|
|
This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
|
|
is not currently fully implemented and is subject to change.
|
|
|
|
AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
|
|
:ref:`amdgpu-elf-code-object`) which contain information that maps the code
|
|
object executable code and data to the source language constructs. It can be
|
|
used by tools such as debuggers and profilers. It uses features defined in
|
|
:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
|
|
DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
|
|
|
|
This section defines the AMDGPU target architecture specific DWARF mappings.
|
|
|
|
.. _amdgpu-dwarf-register-identifier:
|
|
|
|
Register Identifier
|
|
-------------------
|
|
|
|
This section defines the AMDGPU target architecture register numbers used in
|
|
DWARF operation expressions (see DWARF Version 5 section 2.5 and
|
|
:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
|
|
instructions (see DWARF Version 5 section 6.4 and
|
|
:ref:`amdgpu-dwarf-call-frame-information`).
|
|
|
|
A single code object can contain code for kernels that have different wavefront
|
|
sizes. The vector registers and some scalar registers are based on the wavefront
|
|
size. AMDGPU defines distinct DWARF registers for each wavefront size. This
|
|
simplifies the consumer of the DWARF so that each register has a fixed size,
|
|
rather than being dynamic according to the wavefront size mode. Similarly,
|
|
distinct DWARF registers are defined for those registers that vary in size
|
|
according to the process address size. This allows a consumer to treat a
|
|
specific AMDGPU processor as a single architecture regardless of how it is
|
|
configured at run time. The compiler explicitly specifies the DWARF registers
|
|
that match the mode in which the code it is generating will be executed.
|
|
|
|
DWARF registers are encoded as numbers, which are mapped to architecture
|
|
registers. The mapping for AMDGPU is defined in
|
|
:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
|
|
mapping.
|
|
|
|
.. table:: AMDGPU DWARF Register Mapping
|
|
:name: amdgpu-dwarf-register-mapping-table
|
|
|
|
============== ================= ======== ==================================
|
|
DWARF Register AMDGPU Register Bit Size Description
|
|
============== ================= ======== ==================================
|
|
0 PC_32 32 Program Counter (PC) when
|
|
executing in a 32-bit process
|
|
address space. Used in the CFI to
|
|
describe the PC of the calling
|
|
frame.
|
|
1 EXEC_MASK_32 32 Execution Mask Register when
|
|
executing in wavefront 32 mode.
|
|
2-15 *Reserved* *Reserved for highly accessed
|
|
registers using DWARF shortcut.*
|
|
16 PC_64 64 Program Counter (PC) when
|
|
executing in a 64-bit process
|
|
address space. Used in the CFI to
|
|
describe the PC of the calling
|
|
frame.
|
|
17 EXEC_MASK_64 64 Execution Mask Register when
|
|
executing in wavefront 64 mode.
|
|
18-31 *Reserved* *Reserved for highly accessed
|
|
registers using DWARF shortcut.*
|
|
32-95 SGPR0-SGPR63 32 Scalar General Purpose
|
|
Registers.
|
|
96-127 *Reserved* *Reserved for frequently accessed
|
|
registers using DWARF 1-byte ULEB.*
|
|
128 STATUS 32 Status Register.
|
|
129-511 *Reserved* *Reserved for future Scalar
|
|
Architectural Registers.*
|
|
512 VCC_32 32 Vector Condition Code Register
|
|
when executing in wavefront 32
|
|
mode.
|
|
513-767 *Reserved* *Reserved for future Vector
|
|
Architectural Registers when
|
|
executing in wavefront 32 mode.*
|
|
768 VCC_64 64 Vector Condition Code Register
|
|
when executing in wavefront 64
|
|
mode.
|
|
769-1023 *Reserved* *Reserved for future Vector
|
|
Architectural Registers when
|
|
executing in wavefront 64 mode.*
|
|
1024-1087 *Reserved* *Reserved for padding.*
|
|
1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
|
|
1130-1535 *Reserved* *Reserved for future Scalar
|
|
General Purpose Registers.*
|
|
1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
|
|
when executing in wavefront 32
|
|
mode.
|
|
1792-2047 *Reserved* *Reserved for future Vector
|
|
General Purpose Registers when
|
|
executing in wavefront 32 mode.*
|
|
2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
|
|
when executing in wavefront 32
|
|
mode.
|
|
2304-2559 *Reserved* *Reserved for future Vector
|
|
Accumulation Registers when
|
|
executing in wavefront 32 mode.*
|
|
2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
|
|
when executing in wavefront 64
|
|
mode.
|
|
2816-3071 *Reserved* *Reserved for future Vector
|
|
General Purpose Registers when
|
|
executing in wavefront 64 mode.*
|
|
3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
|
|
when executing in wavefront 64
|
|
mode.
|
|
3328-3583 *Reserved* *Reserved for future Vector
|
|
Accumulation Registers when
|
|
executing in wavefront 64 mode.*
|
|
============== ================= ======== ==================================
|
|
|
|
The vector registers are represented as the full size for the wavefront. They
|
|
are organized as consecutive dwords (32-bits), one per lane, with the dword at
|
|
the least significant bit position corresponding to lane 0 and so forth. DWARF
|
|
location expressions involving the ``DW_OP_LLVM_offset`` and
|
|
``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
|
|
register corresponding to the lane that is executing the current thread of
|
|
execution in languages that are implemented using a SIMD or SIMT execution
|
|
model.
|
|
|
|
If the wavefront size is 32 lanes then the wavefront 32 mode register
|
|
definitions are used. If the wavefront size is 64 lanes then the wavefront 64
|
|
mode register definitions are used. Some AMDGPU targets support executing in
|
|
both wavefront 32 and wavefront 64 mode. The register definitions corresponding
|
|
to the wavefront mode of the generated code will be used.
|
|
|
|
If code is generated to execute in a 32-bit process address space, then the
|
|
32-bit process address space register definitions are used. If code is generated
|
|
to execute in a 64-bit process address space, then the 64-bit process address
|
|
space register definitions are used. The ``amdgcn`` target only supports the
|
|
64-bit process address space.
|
|
|
|
.. _amdgpu-dwarf-address-class-identifier:
|
|
|
|
Address Class Identifier
|
|
------------------------
|
|
|
|
The DWARF address class represents the source language memory space. See DWARF
|
|
Version 5 section 2.12 which is updated by the *DWARF Extensions For
|
|
Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
|
|
|
|
The DWARF address class mapping used for AMDGPU is defined in
|
|
:ref:`amdgpu-dwarf-address-class-mapping-table`.
|
|
|
|
.. table:: AMDGPU DWARF Address Class Mapping
|
|
:name: amdgpu-dwarf-address-class-mapping-table
|
|
|
|
========================= ====== =================
|
|
DWARF AMDGPU
|
|
-------------------------------- -----------------
|
|
Address Class Name Value Address Space
|
|
========================= ====== =================
|
|
``DW_ADDR_none`` 0x0000 Generic (Flat)
|
|
``DW_ADDR_LLVM_global`` 0x0001 Global
|
|
``DW_ADDR_LLVM_constant`` 0x0002 Global
|
|
``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS)
|
|
``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch)
|
|
``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
|
|
========================= ====== =================
|
|
|
|
The DWARF address class values defined in the *DWARF Extensions For
|
|
Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
|
|
|
|
In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
|
|
available for use for the AMD extension for access to the hardware GDS memory
|
|
which is scratchpad memory allocated per device.
|
|
|
|
For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
|
|
address class of ``DW_ADDR_none`` is used.
|
|
|
|
See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
|
|
mapping of DWARF address classes to DWARF address spaces, including address size
|
|
and NULL value.
|
|
|
|
.. _amdgpu-dwarf-address-space-identifier:
|
|
|
|
Address Space Identifier
|
|
------------------------
|
|
|
|
DWARF address spaces correspond to target architecture specific linear
|
|
addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
|
|
For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
|
|
|
|
The DWARF address space mapping used for AMDGPU is defined in
|
|
:ref:`amdgpu-dwarf-address-space-mapping-table`.
|
|
|
|
.. table:: AMDGPU DWARF Address Space Mapping
|
|
:name: amdgpu-dwarf-address-space-mapping-table
|
|
|
|
======================================= ===== ======= ======== ================= =======================
|
|
DWARF AMDGPU Notes
|
|
--------------------------------------- ----- ---------------- ----------------- -----------------------
|
|
Address Space Name Value Address Bit Size Address Space
|
|
--------------------------------------- ----- ------- -------- ----------------- -----------------------
|
|
.. 64-bit 32-bit
|
|
process process
|
|
address address
|
|
space space
|
|
======================================= ===== ======= ======== ================= =======================
|
|
``DW_ASPACE_none`` 0x00 64 32 Global *default address space*
|
|
``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
|
|
``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
|
|
``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
|
|
*Reserved* 0x04
|
|
``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
|
|
``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
|
|
======================================= ===== ======= ======== ================= =======================
|
|
|
|
See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
|
|
including address size and NULL value.
|
|
|
|
The ``DW_ASPACE_none`` address space is the default target architecture address
|
|
space used in DWARF operations that do not specify an address space. It
|
|
therefore has to map to the global address space so that the ``DW_OP_addr*`` and
|
|
related operations can refer to addresses in the program code.
|
|
|
|
The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
|
|
specify the flat address space. If the address corresponds to an address in the
|
|
local address space, then it corresponds to the wavefront that is executing the
|
|
focused thread of execution. If the address corresponds to an address in the
|
|
private address space, then it corresponds to the lane that is executing the
|
|
focused thread of execution for languages that are implemented using a SIMD or
|
|
SIMT execution model.
|
|
|
|
.. note::
|
|
|
|
CUDA-like languages such as HIP that do not have address spaces in the
|
|
language type system, but do allow variables to be allocated in different
|
|
address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
|
|
address space in the DWARF expression operations as the default address space
|
|
is the global address space.
|
|
|
|
The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
|
|
specify the local address space corresponding to the wavefront that is executing
|
|
the focused thread of execution.
|
|
|
|
The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
|
|
to specify the private address space corresponding to the lane that is executing
|
|
the focused thread of execution for languages that are implemented using a SIMD
|
|
or SIMT execution model.
|
|
|
|
The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
|
|
to specify the unswizzled private address space corresponding to the wavefront
|
|
that is executing the focused thread of execution. The wavefront view of private
|
|
memory is the per wavefront unswizzled backing memory layout defined in
|
|
:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
|
|
location for the backing memory of the wavefront (namely the address is not
|
|
offset by ``wavefront-scratch-base``). The following formula can be used to
|
|
convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
|
|
``DW_ASPACE_AMDGPU_private_wave`` address:
|
|
|
|
::
|
|
|
|
private-address-wavefront =
|
|
((private-address-lane / 4) * wavefront-size * 4) +
|
|
(wavefront-lane-id * 4) + (private-address-lane % 4)
|
|
|
|
If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
|
|
of the dwords for each lane starting with lane 0 is required, then this
|
|
simplifies to:
|
|
|
|
::
|
|
|
|
private-address-wavefront =
|
|
private-address-lane * wavefront-size
|
|
|
|
A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
|
|
complete spilled vector register back into a complete vector register in the
|
|
CFI. The frame pointer can be a private lane address which is dword aligned,
|
|
which can be shifted to multiply by the wavefront size, and then used to form a
|
|
private wavefront address that gives a location for a contiguous set of dwords,
|
|
one per lane, where the vector register dwords are spilled. The compiler knows
|
|
the wavefront size since it generates the code. Note that the type of the
|
|
address may have to be converted as the size of a
|
|
``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
|
|
``DW_ASPACE_AMDGPU_private_wave`` address.
|
|
|
|
.. _amdgpu-dwarf-lane-identifier:
|
|
|
|
Lane identifier
|
|
---------------
|
|
|
|
DWARF lane identifies specify a target architecture lane position for hardware
|
|
that executes in a SIMD or SIMT manner, and on which a source language maps its
|
|
threads of execution onto those lanes. The DWARF lane identifier is pushed by
|
|
the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
|
|
section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
|
|
section :ref:`amdgpu-dwarf-operation-expressions`.
|
|
|
|
For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
|
|
wavefront. It is numbered from 0 to the wavefront size minus 1.
|
|
|
|
Operation Expressions
|
|
---------------------
|
|
|
|
DWARF expressions are used to compute program values and the locations of
|
|
program objects. See DWARF Version 5 section 2.5 and
|
|
:ref:`amdgpu-dwarf-operation-expressions`.
|
|
|
|
DWARF location descriptions describe how to access storage which includes memory
|
|
and registers. When accessing storage on AMDGPU, bytes are ordered with least
|
|
significant bytes first, and bits are ordered within bytes with least
|
|
significant bits first.
|
|
|
|
For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
|
|
unwinding vector registers that are spilled under the execution mask to memory:
|
|
the zero-single location description is the vector register, and the one-single
|
|
location description is the spilled memory location description. The
|
|
``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
|
|
memory location description.
|
|
|
|
In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
|
|
``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
|
|
controlled by the execution mask. An undefined location description together
|
|
with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
|
|
to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
|
|
|
|
Debugger Information Entry Attributes
|
|
-------------------------------------
|
|
|
|
This section describes how certain debugger information entry attributes are
|
|
used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
|
|
which are updated by *DWARF Extensions For Heterogeneous Debugging* section
|
|
:ref:`amdgpu-dwarf-low-level-information` and
|
|
:ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
|
|
|
|
.. _amdgpu-dwarf-dw-at-llvm-lane-pc:
|
|
|
|
``DW_AT_LLVM_lane_pc``
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
|
|
location of the separate lanes of a SIMT thread.
|
|
|
|
If the lane is an active lane then this will be the same as the current program
|
|
location.
|
|
|
|
If the lane is inactive, but was active on entry to the subprogram, then this is
|
|
the program location in the subprogram at which execution of the lane is
|
|
conceptual positioned.
|
|
|
|
If the lane was not active on entry to the subprogram, then this will be the
|
|
undefined location. A client debugger can check if the lane is part of a valid
|
|
work-group by checking that the lane is in the range of the associated
|
|
work-group within the grid, accounting for partial work-groups. If it is not,
|
|
then the debugger can omit any information for the lane. Otherwise, the debugger
|
|
may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
|
|
calling subprogram until it finds a non-undefined location. Conceptually the
|
|
lane only has the call frames that it has a non-undefined
|
|
``DW_AT_LLVM_lane_pc``.
|
|
|
|
The following example illustrates how the AMDGPU backend can generate a DWARF
|
|
location list expression for the nested ``IF/THEN/ELSE`` structures of the
|
|
following subprogram pseudo code for a target with 64 lanes per wavefront.
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
SUBPROGRAM X
|
|
BEGIN
|
|
a;
|
|
IF (c1) THEN
|
|
b;
|
|
IF (c2) THEN
|
|
c;
|
|
ELSE
|
|
d;
|
|
ENDIF
|
|
e;
|
|
ELSE
|
|
f;
|
|
ENDIF
|
|
g;
|
|
END
|
|
|
|
The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
|
|
execution mask (``EXEC``) to linearize the control flow. The condition is
|
|
evaluated to make a mask of the lanes for which the condition evaluates to true.
|
|
First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
|
|
logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
|
|
``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
|
|
the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
|
|
region the ``EXEC`` mask is restored to the value it had at the beginning of the
|
|
region. This is shown below. Other approaches are possible, but the basic
|
|
concept is the same.
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
$lex_start:
|
|
a;
|
|
%1 = EXEC
|
|
%2 = c1
|
|
$lex_1_start:
|
|
EXEC = %1 & %2
|
|
$if_1_then:
|
|
b;
|
|
%3 = EXEC
|
|
%4 = c2
|
|
$lex_1_1_start:
|
|
EXEC = %3 & %4
|
|
$lex_1_1_then:
|
|
c;
|
|
EXEC = ~EXEC & %3
|
|
$lex_1_1_else:
|
|
d;
|
|
EXEC = %3
|
|
$lex_1_1_end:
|
|
e;
|
|
EXEC = ~EXEC & %1
|
|
$lex_1_else:
|
|
f;
|
|
EXEC = %1
|
|
$lex_1_end:
|
|
g;
|
|
$lex_end:
|
|
|
|
To create the DWARF location list expression that defines the location
|
|
description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
|
|
pseudo instruction can be used to annotate the linearized control flow. This can
|
|
be done by defining an artificial variable for the lane PC. The DWARF location
|
|
list expression created for it is used as the value of the
|
|
``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
|
|
|
|
A DWARF procedure is defined for each well nested structured control flow region
|
|
which provides the conceptual lane program location for a lane if it is not
|
|
active (namely it is divergent). The DWARF operation expression for each region
|
|
conceptually inherits the value of the immediately enclosing region and modifies
|
|
it according to the semantics of the region.
|
|
|
|
For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
|
|
the region for the ``THEN`` region since it is executed first. For the ``ELSE``
|
|
region the divergent program location is at the end of the ``IF/THEN/ELSE``
|
|
region since the ``THEN`` region has completed.
|
|
|
|
The lane PC artificial variable is assigned at each region transition. It uses
|
|
the immediately enclosing region's DWARF procedure to compute the program
|
|
location for each lane assuming they are divergent, and then modifies the result
|
|
by inserting the current program location for each lane that the ``EXEC`` mask
|
|
indicates is active.
|
|
|
|
By having separate DWARF procedures for each region, they can be reused to
|
|
define the value for any nested region. This reduces the total size of the DWARF
|
|
operation expressions.
|
|
|
|
The following provides an example using pseudo LLVM MIR.
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
$lex_start:
|
|
DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
|
|
DW_AT_name = "__uint64";
|
|
DW_AT_byte_size = 8;
|
|
DW_AT_encoding = DW_ATE_unsigned;
|
|
];
|
|
DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
|
|
DW_AT_name = "__active_lane_pc";
|
|
DW_AT_location = [
|
|
DW_OP_regx PC;
|
|
DW_OP_LLVM_extend 64, 64;
|
|
DW_OP_regval_type EXEC, %uint_64;
|
|
DW_OP_LLVM_select_bit_piece 64, 64;
|
|
];
|
|
];
|
|
DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
|
|
DW_AT_name = "__divergent_lane_pc";
|
|
DW_AT_location = [
|
|
DW_OP_LLVM_undefined;
|
|
DW_OP_LLVM_extend 64, 64;
|
|
];
|
|
];
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
a;
|
|
%1 = EXEC;
|
|
DBG_VALUE %1, $noreg, %__lex_1_save_exec;
|
|
%2 = c1;
|
|
$lex_1_start:
|
|
EXEC = %1 & %2;
|
|
$lex_1_then:
|
|
DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
|
|
DW_AT_name = "__divergent_lane_pc_1_then";
|
|
DW_AT_location = DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc;
|
|
DW_OP_addrx &lex_1_start;
|
|
DW_OP_stack_value;
|
|
DW_OP_LLVM_extend 64, 64;
|
|
DW_OP_call_ref %__lex_1_save_exec;
|
|
DW_OP_deref_type 64, %__uint_64;
|
|
DW_OP_LLVM_select_bit_piece 64, 64;
|
|
];
|
|
];
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc_1_then;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
b;
|
|
%3 = EXEC;
|
|
DBG_VALUE %3, %__lex_1_1_save_exec;
|
|
%4 = c2;
|
|
$lex_1_1_start:
|
|
EXEC = %3 & %4;
|
|
$lex_1_1_then:
|
|
DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
|
|
DW_AT_name = "__divergent_lane_pc_1_1_then";
|
|
DW_AT_location = DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc_1_then;
|
|
DW_OP_addrx &lex_1_1_start;
|
|
DW_OP_stack_value;
|
|
DW_OP_LLVM_extend 64, 64;
|
|
DW_OP_call_ref %__lex_1_1_save_exec;
|
|
DW_OP_deref_type 64, %__uint_64;
|
|
DW_OP_LLVM_select_bit_piece 64, 64;
|
|
];
|
|
];
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc_1_1_then;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
c;
|
|
EXEC = ~EXEC & %3;
|
|
$lex_1_1_else:
|
|
DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
|
|
DW_AT_name = "__divergent_lane_pc_1_1_else";
|
|
DW_AT_location = DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc_1_then;
|
|
DW_OP_addrx &lex_1_1_end;
|
|
DW_OP_stack_value;
|
|
DW_OP_LLVM_extend 64, 64;
|
|
DW_OP_call_ref %__lex_1_1_save_exec;
|
|
DW_OP_deref_type 64, %__uint_64;
|
|
DW_OP_LLVM_select_bit_piece 64, 64;
|
|
];
|
|
];
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc_1_1_else;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
d;
|
|
EXEC = %3;
|
|
$lex_1_1_end:
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
e;
|
|
EXEC = ~EXEC & %1;
|
|
$lex_1_else:
|
|
DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
|
|
DW_AT_name = "__divergent_lane_pc_1_else";
|
|
DW_AT_location = DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc;
|
|
DW_OP_addrx &lex_1_end;
|
|
DW_OP_stack_value;
|
|
DW_OP_LLVM_extend 64, 64;
|
|
DW_OP_call_ref %__lex_1_save_exec;
|
|
DW_OP_deref_type 64, %__uint_64;
|
|
DW_OP_LLVM_select_bit_piece 64, 64;
|
|
];
|
|
];
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc_1_else;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
f;
|
|
EXEC = %1;
|
|
$lex_1_end:
|
|
DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
|
|
DW_OP_call_ref %__divergent_lane_pc;
|
|
DW_OP_call_ref %__active_lane_pc;
|
|
];
|
|
g;
|
|
$lex_end:
|
|
|
|
The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
|
|
that are active, with the current program location.
|
|
|
|
Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
|
|
the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
|
|
instruction, location list entries will be created that describe where the
|
|
artificial variables are allocated at any given program location. The compiler
|
|
may allocate them to registers or spill them to memory.
|
|
|
|
The DWARF procedures for each region use the values of the saved execution mask
|
|
artificial variables to only update the lanes that are active on entry to the
|
|
region. All other lanes retain the value of the enclosing region where they were
|
|
last active. If they were not active on entry to the subprogram, then will have
|
|
the undefined location description.
|
|
|
|
Other structured control flow regions can be handled similarly. For example,
|
|
loops would set the divergent program location for the region at the end of the
|
|
loop. Any lanes active will be in the loop, and any lanes not active must have
|
|
exited the loop.
|
|
|
|
An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
|
|
``IF/THEN/ELSE`` regions.
|
|
|
|
The DWARF procedures can use the active lane artificial variable described in
|
|
:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
|
|
``EXEC`` mask in order to support whole or quad wavefront mode.
|
|
|
|
.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
|
|
|
|
``DW_AT_LLVM_active_lane``
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
|
|
entry is used to specify the lanes that are conceptually active for a SIMT
|
|
thread.
|
|
|
|
The execution mask may be modified to implement whole or quad wavefront mode
|
|
operations. For example, all lanes may need to temporarily be made active to
|
|
execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
|
|
update it to enable the necessary lanes, perform the operations, and then
|
|
restore the ``EXEC`` mask from the saved value. While executing the whole
|
|
wavefront region, the conceptual execution mask is the saved value, not the
|
|
``EXEC`` value.
|
|
|
|
This is handled by defining an artificial variable for the active lane mask. The
|
|
active lane mask artificial variable would be the actual ``EXEC`` mask for
|
|
normal regions, and the saved execution mask for regions where the mask is
|
|
temporarily updated. The location list expression created for this artificial
|
|
variable is used to define the value of the ``DW_AT_LLVM_active_lane``
|
|
attribute.
|
|
|
|
``DW_AT_LLVM_augmentation``
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
|
|
debugger information entry has the following value for the augmentation string:
|
|
|
|
::
|
|
|
|
[amdgpu:v0.0]
|
|
|
|
The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
|
|
extensions used in the DWARF of the compilation unit. The version number
|
|
conforms to [SEMVER]_.
|
|
|
|
Call Frame Information
|
|
----------------------
|
|
|
|
DWARF Call Frame Information (CFI) describes how a consumer can virtually
|
|
*unwind* call frames in a running process or core dump. See DWARF Version 5
|
|
section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
|
|
|
|
For AMDGPU, the Common Information Entry (CIE) fields have the following values:
|
|
|
|
1. ``augmentation`` string contains the following null-terminated UTF-8 string:
|
|
|
|
::
|
|
|
|
[amd:v0.0]
|
|
|
|
The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
|
|
extensions used in this CIE or to the FDEs that use it. The version number
|
|
conforms to [SEMVER]_.
|
|
|
|
2. ``address_size`` for the ``Global`` address space is defined in
|
|
:ref:`amdgpu-dwarf-address-space-identifier`.
|
|
|
|
3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
|
|
|
|
4. ``code_alignment_factor`` is 4 bytes.
|
|
|
|
.. TODO::
|
|
|
|
Add to :ref:`amdgpu-processor-table` table.
|
|
|
|
5. ``data_alignment_factor`` is 4 bytes.
|
|
|
|
.. TODO::
|
|
|
|
Add to :ref:`amdgpu-processor-table` table.
|
|
|
|
6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
|
|
for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
|
|
|
|
7. ``initial_instructions`` Since a subprogram X with fewer registers can be
|
|
called from subprogram Y that has more allocated, X will not change any of
|
|
the extra registers as it cannot access them. Therefore, the default rule
|
|
for all columns is ``same value``.
|
|
|
|
For AMDGPU the register number follows the numbering defined in
|
|
:ref:`amdgpu-dwarf-register-identifier`.
|
|
|
|
For AMDGPU the instructions are variable size. A consumer can subtract 1 from
|
|
the return address to get the address of a byte within the call site
|
|
instructions. See DWARF Version 5 section 6.4.4.
|
|
|
|
Accelerated Access
|
|
------------------
|
|
|
|
See DWARF Version 5 section 6.1.
|
|
|
|
Lookup By Name Section Header
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
|
|
|
|
For AMDGPU the lookup by name section header table:
|
|
|
|
``augmentation_string_size`` (uword)
|
|
|
|
Set to the length of the ``augmentation_string`` value which is always a
|
|
multiple of 4.
|
|
|
|
``augmentation_string`` (sequence of UTF-8 characters)
|
|
|
|
Contains the following UTF-8 string null padded to a multiple of 4 bytes:
|
|
|
|
::
|
|
|
|
[amdgpu:v0.0]
|
|
|
|
The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
|
|
extensions used in the DWARF of this index. The version number conforms to
|
|
[SEMVER]_.
|
|
|
|
.. note::
|
|
|
|
This is different to the DWARF Version 5 definition that requires the first
|
|
4 characters to be the vendor ID. But this is consistent with the other
|
|
augmentation strings and does allow multiple vendor contributions. However,
|
|
backwards compatibility may be more desirable.
|
|
|
|
Lookup By Address Section Header
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
See DWARF Version 5 section 6.1.2.
|
|
|
|
For AMDGPU the lookup by address section header table:
|
|
|
|
``address_size`` (ubyte)
|
|
|
|
Match the address size for the ``Global`` address space defined in
|
|
:ref:`amdgpu-dwarf-address-space-identifier`.
|
|
|
|
``segment_selector_size`` (ubyte)
|
|
|
|
AMDGPU does not use a segment selector so this is 0. The entries in the
|
|
``.debug_aranges`` do not have a segment selector.
|
|
|
|
Line Number Information
|
|
-----------------------
|
|
|
|
See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
|
|
|
|
AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
|
|
The instruction set must be obtained from the ELF file header ``e_flags`` field
|
|
in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
|
|
<amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
|
|
|
|
.. TODO::
|
|
|
|
Should the ``isa`` state machine register be used to indicate if the code is
|
|
in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
|
|
|
|
For AMDGPU the line number program header fields have the following values (see
|
|
DWARF Version 5 section 6.2.4):
|
|
|
|
``address_size`` (ubyte)
|
|
Matches the address size for the ``Global`` address space defined in
|
|
:ref:`amdgpu-dwarf-address-space-identifier`.
|
|
|
|
``segment_selector_size`` (ubyte)
|
|
AMDGPU does not use a segment selector so this is 0.
|
|
|
|
``minimum_instruction_length`` (ubyte)
|
|
For GFX9-GFX10 this is 4.
|
|
|
|
``maximum_operations_per_instruction`` (ubyte)
|
|
For GFX9-GFX10 this is 1.
|
|
|
|
Source text for online-compiled programs (for example, those compiled by the
|
|
OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
|
|
See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
|
|
Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
|
|
<amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
|
|
|
|
The Clang option used to control source embedding in AMDGPU is defined in
|
|
:ref:`amdgpu-clang-debug-options-table`.
|
|
|
|
.. table:: AMDGPU Clang Debug Options
|
|
:name: amdgpu-clang-debug-options-table
|
|
|
|
==================== ==================================================
|
|
Debug Flag Description
|
|
==================== ==================================================
|
|
-g[no-]embed-source Enable/disable embedding source text in DWARF
|
|
debug sections. Useful for environments where
|
|
source cannot be written to disk, such as
|
|
when performing online compilation.
|
|
==================== ==================================================
|
|
|
|
For example:
|
|
|
|
``-gembed-source``
|
|
Enable the embedded source.
|
|
|
|
``-gno-embed-source``
|
|
Disable the embedded source.
|
|
|
|
32-Bit and 64-Bit DWARF Formats
|
|
-------------------------------
|
|
|
|
See DWARF Version 5 section 7.4 and
|
|
:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
|
|
|
|
For AMDGPU:
|
|
|
|
* For the ``amdgcn`` target architecture only the 64-bit process address space
|
|
is supported.
|
|
|
|
* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
|
|
the 32-bit DWARF format.
|
|
|
|
Unit Headers
|
|
------------
|
|
|
|
For AMDGPU the following values apply for each of the unit headers described in
|
|
DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
|
|
|
|
``address_size`` (ubyte)
|
|
Matches the address size for the ``Global`` address space defined in
|
|
:ref:`amdgpu-dwarf-address-space-identifier`.
|
|
|
|
.. _amdgpu-code-conventions:
|
|
|
|
Code Conventions
|
|
================
|
|
|
|
This section provides code conventions used for each supported target triple OS
|
|
(see :ref:`amdgpu-target-triples`).
|
|
|
|
AMDHSA
|
|
------
|
|
|
|
This section provides code conventions used when the target triple OS is
|
|
``amdhsa`` (see :ref:`amdgpu-target-triples`).
|
|
|
|
.. _amdgpu-amdhsa-code-object-metadata:
|
|
|
|
Code Object Metadata
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The code object metadata specifies extensible metadata associated with the code
|
|
objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
|
|
encoding and semantics of this metadata depends on the code object version; see
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v2`,
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v3`,
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v4` and
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
|
|
|
|
Code object metadata is specified in a note record (see
|
|
:ref:`amdgpu-note-records`) and is required when the target triple OS is
|
|
``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
|
|
information necessary to support the HSA compatible runtime kernel queries. For
|
|
example, the segment sizes needed in a dispatch packet. In addition, a
|
|
high-level language runtime may require other information to be included. For
|
|
example, the AMD OpenCL runtime records kernel argument information.
|
|
|
|
.. _amdgpu-amdhsa-code-object-metadata-v2:
|
|
|
|
Code Object V2 Metadata
|
|
+++++++++++++++++++++++
|
|
|
|
.. warning::
|
|
Code object V2 is not the default code object version emitted by this version
|
|
of LLVM.
|
|
|
|
Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
|
|
(see :ref:`amdgpu-note-records-v2`).
|
|
|
|
The metadata is specified as a YAML formatted string (see [YAML]_ and
|
|
:doc:`YamlIO`).
|
|
|
|
.. TODO::
|
|
|
|
Is the string null terminated? It probably should not if YAML allows it to
|
|
contain null characters, otherwise it should be.
|
|
|
|
The metadata is represented as a single YAML document comprised of the mapping
|
|
defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
|
|
referenced tables.
|
|
|
|
For boolean values, the string values of ``false`` and ``true`` are used for
|
|
false and true respectively.
|
|
|
|
Additional information can be added to the mappings. To avoid conflicts, any
|
|
non-AMD key names should be prefixed by "*vendor-name*.".
|
|
|
|
.. table:: AMDHSA Code Object V2 Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-metadata-map-v2-table
|
|
|
|
========== ============== ========= =======================================
|
|
String Key Value Type Required? Description
|
|
========== ============== ========= =======================================
|
|
"Version" sequence of Required - The first integer is the major
|
|
2 integers version. Currently 1.
|
|
- The second integer is the minor
|
|
version. Currently 0.
|
|
"Printf" sequence of Each string is encoded information
|
|
strings about a printf function call. The
|
|
encoded information is organized as
|
|
fields separated by colon (':'):
|
|
|
|
``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
|
|
|
|
where:
|
|
|
|
``ID``
|
|
A 32-bit integer as a unique id for
|
|
each printf function call
|
|
|
|
``N``
|
|
A 32-bit integer equal to the number
|
|
of arguments of printf function call
|
|
minus 1
|
|
|
|
``S[i]`` (where i = 0, 1, ... , N-1)
|
|
32-bit integers for the size in bytes
|
|
of the i-th FormatString argument of
|
|
the printf function call
|
|
|
|
FormatString
|
|
The format string passed to the
|
|
printf function call.
|
|
"Kernels" sequence of Required Sequence of the mappings for each
|
|
mapping kernel in the code object. See
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
|
|
for the definition of the mapping.
|
|
========== ============== ========= =======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V2 Kernel Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
|
|
|
|
================= ============== ========= ================================
|
|
String Key Value Type Required? Description
|
|
================= ============== ========= ================================
|
|
"Name" string Required Source name of the kernel.
|
|
"SymbolName" string Required Name of the kernel
|
|
descriptor ELF symbol.
|
|
"Language" string Source language of the kernel.
|
|
Values include:
|
|
|
|
- "OpenCL C"
|
|
- "OpenCL C++"
|
|
- "HCC"
|
|
- "OpenMP"
|
|
|
|
"LanguageVersion" sequence of - The first integer is the major
|
|
2 integers version.
|
|
- The second integer is the
|
|
minor version.
|
|
"Attrs" mapping Mapping of kernel attributes.
|
|
See
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
|
|
for the mapping definition.
|
|
"Args" sequence of Sequence of mappings of the
|
|
mapping kernel arguments. See
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
|
|
for the definition of the mapping.
|
|
"CodeProps" mapping Mapping of properties related to
|
|
the kernel code. See
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
|
|
for the mapping definition.
|
|
================= ============== ========= ================================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
|
|
|
|
=================== ============== ========= ==============================
|
|
String Key Value Type Required? Description
|
|
=================== ============== ========= ==============================
|
|
"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
|
|
3 integers must be >=1 and the dispatch
|
|
work-group size X, Y, Z must
|
|
correspond to the specified
|
|
values. Defaults to 0, 0, 0.
|
|
|
|
Corresponds to the OpenCL
|
|
``reqd_work_group_size``
|
|
attribute.
|
|
"WorkGroupSizeHint" sequence of The dispatch work-group size
|
|
3 integers X, Y, Z is likely to be the
|
|
specified values.
|
|
|
|
Corresponds to the OpenCL
|
|
``work_group_size_hint``
|
|
attribute.
|
|
"VecTypeHint" string The name of a scalar or vector
|
|
type.
|
|
|
|
Corresponds to the OpenCL
|
|
``vec_type_hint`` attribute.
|
|
|
|
"RuntimeHandle" string The external symbol name
|
|
associated with a kernel.
|
|
OpenCL runtime allocates a
|
|
global buffer for the symbol
|
|
and saves the kernel's address
|
|
to it, which is used for
|
|
device side enqueueing. Only
|
|
available for device side
|
|
enqueued kernels.
|
|
=================== ============== ========= ==============================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
|
|
|
|
================= ============== ========= ================================
|
|
String Key Value Type Required? Description
|
|
================= ============== ========= ================================
|
|
"Name" string Kernel argument name.
|
|
"TypeName" string Kernel argument type name.
|
|
"Size" integer Required Kernel argument size in bytes.
|
|
"Align" integer Required Kernel argument alignment in
|
|
bytes. Must be a power of two.
|
|
"ValueKind" string Required Kernel argument kind that
|
|
specifies how to set up the
|
|
corresponding argument.
|
|
Values include:
|
|
|
|
"ByValue"
|
|
The argument is copied
|
|
directly into the kernarg.
|
|
|
|
"GlobalBuffer"
|
|
A global address space pointer
|
|
to the buffer data is passed
|
|
in the kernarg.
|
|
|
|
"DynamicSharedPointer"
|
|
A group address space pointer
|
|
to dynamically allocated LDS
|
|
is passed in the kernarg.
|
|
|
|
"Sampler"
|
|
A global address space
|
|
pointer to a S# is passed in
|
|
the kernarg.
|
|
|
|
"Image"
|
|
A global address space
|
|
pointer to a T# is passed in
|
|
the kernarg.
|
|
|
|
"Pipe"
|
|
A global address space pointer
|
|
to an OpenCL pipe is passed in
|
|
the kernarg.
|
|
|
|
"Queue"
|
|
A global address space pointer
|
|
to an OpenCL device enqueue
|
|
queue is passed in the
|
|
kernarg.
|
|
|
|
"HiddenGlobalOffsetX"
|
|
The OpenCL grid dispatch
|
|
global offset for the X
|
|
dimension is passed in the
|
|
kernarg.
|
|
|
|
"HiddenGlobalOffsetY"
|
|
The OpenCL grid dispatch
|
|
global offset for the Y
|
|
dimension is passed in the
|
|
kernarg.
|
|
|
|
"HiddenGlobalOffsetZ"
|
|
The OpenCL grid dispatch
|
|
global offset for the Z
|
|
dimension is passed in the
|
|
kernarg.
|
|
|
|
"HiddenNone"
|
|
An argument that is not used
|
|
by the kernel. Space needs to
|
|
be left for it, but it does
|
|
not need to be set up.
|
|
|
|
"HiddenPrintfBuffer"
|
|
A global address space pointer
|
|
to the runtime printf buffer
|
|
is passed in kernarg.
|
|
|
|
"HiddenHostcallBuffer"
|
|
A global address space pointer
|
|
to the runtime hostcall buffer
|
|
is passed in kernarg.
|
|
|
|
"HiddenDefaultQueue"
|
|
A global address space pointer
|
|
to the OpenCL device enqueue
|
|
queue that should be used by
|
|
the kernel by default is
|
|
passed in the kernarg.
|
|
|
|
"HiddenCompletionAction"
|
|
A global address space pointer
|
|
to help link enqueued kernels into
|
|
the ancestor tree for determining
|
|
when the parent kernel has finished.
|
|
|
|
"HiddenMultiGridSyncArg"
|
|
A global address space pointer for
|
|
multi-grid synchronization is
|
|
passed in the kernarg.
|
|
|
|
"ValueType" string Unused and deprecated. This should no longer
|
|
be emitted, but is accepted for compatibility.
|
|
|
|
|
|
"PointeeAlign" integer Alignment in bytes of pointee
|
|
type for pointer type kernel
|
|
argument. Must be a power
|
|
of 2. Only present if
|
|
"ValueKind" is
|
|
"DynamicSharedPointer".
|
|
"AddrSpaceQual" string Kernel argument address space
|
|
qualifier. Only present if
|
|
"ValueKind" is "GlobalBuffer" or
|
|
"DynamicSharedPointer". Values
|
|
are:
|
|
|
|
- "Private"
|
|
- "Global"
|
|
- "Constant"
|
|
- "Local"
|
|
- "Generic"
|
|
- "Region"
|
|
|
|
.. TODO::
|
|
|
|
Is GlobalBuffer only Global
|
|
or Constant? Is
|
|
DynamicSharedPointer always
|
|
Local? Can HCC allow Generic?
|
|
How can Private or Region
|
|
ever happen?
|
|
|
|
"AccQual" string Kernel argument access
|
|
qualifier. Only present if
|
|
"ValueKind" is "Image" or
|
|
"Pipe". Values
|
|
are:
|
|
|
|
- "ReadOnly"
|
|
- "WriteOnly"
|
|
- "ReadWrite"
|
|
|
|
.. TODO::
|
|
|
|
Does this apply to
|
|
GlobalBuffer?
|
|
|
|
"ActualAccQual" string The actual memory accesses
|
|
performed by the kernel on the
|
|
kernel argument. Only present if
|
|
"ValueKind" is "GlobalBuffer",
|
|
"Image", or "Pipe". This may be
|
|
more restrictive than indicated
|
|
by "AccQual" to reflect what the
|
|
kernel actual does. If not
|
|
present then the runtime must
|
|
assume what is implied by
|
|
"AccQual" and "IsConst". Values
|
|
are:
|
|
|
|
- "ReadOnly"
|
|
- "WriteOnly"
|
|
- "ReadWrite"
|
|
|
|
"IsConst" boolean Indicates if the kernel argument
|
|
is const qualified. Only present
|
|
if "ValueKind" is
|
|
"GlobalBuffer".
|
|
|
|
"IsRestrict" boolean Indicates if the kernel argument
|
|
is restrict qualified. Only
|
|
present if "ValueKind" is
|
|
"GlobalBuffer".
|
|
|
|
"IsVolatile" boolean Indicates if the kernel argument
|
|
is volatile qualified. Only
|
|
present if "ValueKind" is
|
|
"GlobalBuffer".
|
|
|
|
"IsPipe" boolean Indicates if the kernel argument
|
|
is pipe qualified. Only present
|
|
if "ValueKind" is "Pipe".
|
|
|
|
.. TODO::
|
|
|
|
Can GlobalBuffer be pipe
|
|
qualified?
|
|
|
|
================= ============== ========= ================================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
|
|
|
|
============================ ============== ========= =====================
|
|
String Key Value Type Required? Description
|
|
============================ ============== ========= =====================
|
|
"KernargSegmentSize" integer Required The size in bytes of
|
|
the kernarg segment
|
|
that holds the values
|
|
of the arguments to
|
|
the kernel.
|
|
"GroupSegmentFixedSize" integer Required The amount of group
|
|
segment memory
|
|
required by a
|
|
work-group in
|
|
bytes. This does not
|
|
include any
|
|
dynamically allocated
|
|
group segment memory
|
|
that may be added
|
|
when the kernel is
|
|
dispatched.
|
|
"PrivateSegmentFixedSize" integer Required The amount of fixed
|
|
private address space
|
|
memory required for a
|
|
work-item in
|
|
bytes. If the kernel
|
|
uses a dynamic call
|
|
stack then additional
|
|
space must be added
|
|
to this value for the
|
|
call stack.
|
|
"KernargSegmentAlign" integer Required The maximum byte
|
|
alignment of
|
|
arguments in the
|
|
kernarg segment. Must
|
|
be a power of 2.
|
|
"WavefrontSize" integer Required Wavefront size. Must
|
|
be a power of 2.
|
|
"NumSGPRs" integer Required Number of scalar
|
|
registers used by a
|
|
wavefront for
|
|
GFX6-GFX10. This
|
|
includes the special
|
|
SGPRs for VCC, Flat
|
|
Scratch (GFX7-GFX10)
|
|
and XNACK (for
|
|
GFX8-GFX10). It does
|
|
not include the 16
|
|
SGPR added if a trap
|
|
handler is
|
|
enabled. It is not
|
|
rounded up to the
|
|
allocation
|
|
granularity.
|
|
"NumVGPRs" integer Required Number of vector
|
|
registers used by
|
|
each work-item for
|
|
GFX6-GFX10
|
|
"MaxFlatWorkGroupSize" integer Required Maximum flat
|
|
work-group size
|
|
supported by the
|
|
kernel in work-items.
|
|
Must be >=1 and
|
|
consistent with
|
|
ReqdWorkGroupSize if
|
|
not 0, 0, 0.
|
|
"NumSpilledSGPRs" integer Number of stores from
|
|
a scalar register to
|
|
a register allocator
|
|
created spill
|
|
location.
|
|
"NumSpilledVGPRs" integer Number of stores from
|
|
a vector register to
|
|
a register allocator
|
|
created spill
|
|
location.
|
|
============================ ============== ========= =====================
|
|
|
|
.. _amdgpu-amdhsa-code-object-metadata-v3:
|
|
|
|
Code Object V3 Metadata
|
|
+++++++++++++++++++++++
|
|
|
|
.. warning::
|
|
Code object V3 is not the default code object version emitted by this version
|
|
of LLVM.
|
|
|
|
Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
|
|
record (see :ref:`amdgpu-note-records-v3-onwards`).
|
|
|
|
The metadata is represented as Message Pack formatted binary data (see
|
|
[MsgPack]_). The top level is a Message Pack map that includes the
|
|
keys defined in table
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
|
|
tables.
|
|
|
|
Additional information can be added to the maps. To avoid conflicts,
|
|
any key names should be prefixed by "*vendor-name*." where
|
|
``vendor-name`` can be the name of the vendor and specific vendor
|
|
tool that generates the information. The prefix is abbreviated to
|
|
simply "." when it appears within a map that has been added by the
|
|
same *vendor-name*.
|
|
|
|
.. table:: AMDHSA Code Object V3 Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-metadata-map-table-v3
|
|
|
|
================= ============== ========= =======================================
|
|
String Key Value Type Required? Description
|
|
================= ============== ========= =======================================
|
|
"amdhsa.version" sequence of Required - The first integer is the major
|
|
2 integers version. Currently 1.
|
|
- The second integer is the minor
|
|
version. Currently 0.
|
|
"amdhsa.printf" sequence of Each string is encoded information
|
|
strings about a printf function call. The
|
|
encoded information is organized as
|
|
fields separated by colon (':'):
|
|
|
|
``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
|
|
|
|
where:
|
|
|
|
``ID``
|
|
A 32-bit integer as a unique id for
|
|
each printf function call
|
|
|
|
``N``
|
|
A 32-bit integer equal to the number
|
|
of arguments of printf function call
|
|
minus 1
|
|
|
|
``S[i]`` (where i = 0, 1, ... , N-1)
|
|
32-bit integers for the size in bytes
|
|
of the i-th FormatString argument of
|
|
the printf function call
|
|
|
|
FormatString
|
|
The format string passed to the
|
|
printf function call.
|
|
"amdhsa.kernels" sequence of Required Sequence of the maps for each
|
|
map kernel in the code object. See
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
|
|
for the definition of the keys included
|
|
in that map.
|
|
================= ============== ========= =======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V3 Kernel Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
|
|
|
|
=================================== ============== ========= ================================
|
|
String Key Value Type Required? Description
|
|
=================================== ============== ========= ================================
|
|
".name" string Required Source name of the kernel.
|
|
".symbol" string Required Name of the kernel
|
|
descriptor ELF symbol.
|
|
".language" string Source language of the kernel.
|
|
Values include:
|
|
|
|
- "OpenCL C"
|
|
- "OpenCL C++"
|
|
- "HCC"
|
|
- "HIP"
|
|
- "OpenMP"
|
|
- "Assembler"
|
|
|
|
".language_version" sequence of - The first integer is the major
|
|
2 integers version.
|
|
- The second integer is the
|
|
minor version.
|
|
".args" sequence of Sequence of maps of the
|
|
map kernel arguments. See
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
|
|
for the definition of the keys
|
|
included in that map.
|
|
".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
|
|
3 integers must be >=1 and the dispatch
|
|
work-group size X, Y, Z must
|
|
correspond to the specified
|
|
values. Defaults to 0, 0, 0.
|
|
|
|
Corresponds to the OpenCL
|
|
``reqd_work_group_size``
|
|
attribute.
|
|
".workgroup_size_hint" sequence of The dispatch work-group size
|
|
3 integers X, Y, Z is likely to be the
|
|
specified values.
|
|
|
|
Corresponds to the OpenCL
|
|
``work_group_size_hint``
|
|
attribute.
|
|
".vec_type_hint" string The name of a scalar or vector
|
|
type.
|
|
|
|
Corresponds to the OpenCL
|
|
``vec_type_hint`` attribute.
|
|
|
|
".device_enqueue_symbol" string The external symbol name
|
|
associated with a kernel.
|
|
OpenCL runtime allocates a
|
|
global buffer for the symbol
|
|
and saves the kernel's address
|
|
to it, which is used for
|
|
device side enqueueing. Only
|
|
available for device side
|
|
enqueued kernels.
|
|
".kernarg_segment_size" integer Required The size in bytes of
|
|
the kernarg segment
|
|
that holds the values
|
|
of the arguments to
|
|
the kernel.
|
|
".group_segment_fixed_size" integer Required The amount of group
|
|
segment memory
|
|
required by a
|
|
work-group in
|
|
bytes. This does not
|
|
include any
|
|
dynamically allocated
|
|
group segment memory
|
|
that may be added
|
|
when the kernel is
|
|
dispatched.
|
|
".private_segment_fixed_size" integer Required The amount of fixed
|
|
private address space
|
|
memory required for a
|
|
work-item in
|
|
bytes. If the kernel
|
|
uses a dynamic call
|
|
stack then additional
|
|
space must be added
|
|
to this value for the
|
|
call stack.
|
|
".kernarg_segment_align" integer Required The maximum byte
|
|
alignment of
|
|
arguments in the
|
|
kernarg segment. Must
|
|
be a power of 2.
|
|
".wavefront_size" integer Required Wavefront size. Must
|
|
be a power of 2.
|
|
".sgpr_count" integer Required Number of scalar
|
|
registers required by a
|
|
wavefront for
|
|
GFX6-GFX9. A register
|
|
is required if it is
|
|
used explicitly, or
|
|
if a higher numbered
|
|
register is used
|
|
explicitly. This
|
|
includes the special
|
|
SGPRs for VCC, Flat
|
|
Scratch (GFX7-GFX9)
|
|
and XNACK (for
|
|
GFX8-GFX9). It does
|
|
not include the 16
|
|
SGPR added if a trap
|
|
handler is
|
|
enabled. It is not
|
|
rounded up to the
|
|
allocation
|
|
granularity.
|
|
".vgpr_count" integer Required Number of vector
|
|
registers required by
|
|
each work-item for
|
|
GFX6-GFX9. A register
|
|
is required if it is
|
|
used explicitly, or
|
|
if a higher numbered
|
|
register is used
|
|
explicitly.
|
|
".max_flat_workgroup_size" integer Required Maximum flat
|
|
work-group size
|
|
supported by the
|
|
kernel in work-items.
|
|
Must be >=1 and
|
|
consistent with
|
|
ReqdWorkGroupSize if
|
|
not 0, 0, 0.
|
|
".sgpr_spill_count" integer Number of stores from
|
|
a scalar register to
|
|
a register allocator
|
|
created spill
|
|
location.
|
|
".vgpr_spill_count" integer Number of stores from
|
|
a vector register to
|
|
a register allocator
|
|
created spill
|
|
location.
|
|
".kind" string The kind of the kernel
|
|
with the following
|
|
values:
|
|
|
|
"normal"
|
|
Regular kernels.
|
|
|
|
"init"
|
|
These kernels must be
|
|
invoked after loading
|
|
the containing code
|
|
object and must
|
|
complete before any
|
|
normal and fini
|
|
kernels in the same
|
|
code object are
|
|
invoked.
|
|
|
|
"fini"
|
|
These kernels must be
|
|
invoked before
|
|
unloading the
|
|
containing code object
|
|
and after all init and
|
|
normal kernels in the
|
|
same code object have
|
|
been invoked and
|
|
completed.
|
|
|
|
If omitted, "normal" is
|
|
assumed.
|
|
=================================== ============== ========= ================================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
|
|
:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
|
|
|
|
====================== ============== ========= ================================
|
|
String Key Value Type Required? Description
|
|
====================== ============== ========= ================================
|
|
".name" string Kernel argument name.
|
|
".type_name" string Kernel argument type name.
|
|
".size" integer Required Kernel argument size in bytes.
|
|
".offset" integer Required Kernel argument offset in
|
|
bytes. The offset must be a
|
|
multiple of the alignment
|
|
required by the argument.
|
|
".value_kind" string Required Kernel argument kind that
|
|
specifies how to set up the
|
|
corresponding argument.
|
|
Values include:
|
|
|
|
"by_value"
|
|
The argument is copied
|
|
directly into the kernarg.
|
|
|
|
"global_buffer"
|
|
A global address space pointer
|
|
to the buffer data is passed
|
|
in the kernarg.
|
|
|
|
"dynamic_shared_pointer"
|
|
A group address space pointer
|
|
to dynamically allocated LDS
|
|
is passed in the kernarg.
|
|
|
|
"sampler"
|
|
A global address space
|
|
pointer to a S# is passed in
|
|
the kernarg.
|
|
|
|
"image"
|
|
A global address space
|
|
pointer to a T# is passed in
|
|
the kernarg.
|
|
|
|
"pipe"
|
|
A global address space pointer
|
|
to an OpenCL pipe is passed in
|
|
the kernarg.
|
|
|
|
"queue"
|
|
A global address space pointer
|
|
to an OpenCL device enqueue
|
|
queue is passed in the
|
|
kernarg.
|
|
|
|
"hidden_global_offset_x"
|
|
The OpenCL grid dispatch
|
|
global offset for the X
|
|
dimension is passed in the
|
|
kernarg.
|
|
|
|
"hidden_global_offset_y"
|
|
The OpenCL grid dispatch
|
|
global offset for the Y
|
|
dimension is passed in the
|
|
kernarg.
|
|
|
|
"hidden_global_offset_z"
|
|
The OpenCL grid dispatch
|
|
global offset for the Z
|
|
dimension is passed in the
|
|
kernarg.
|
|
|
|
"hidden_none"
|
|
An argument that is not used
|
|
by the kernel. Space needs to
|
|
be left for it, but it does
|
|
not need to be set up.
|
|
|
|
"hidden_printf_buffer"
|
|
A global address space pointer
|
|
to the runtime printf buffer
|
|
is passed in kernarg.
|
|
|
|
"hidden_hostcall_buffer"
|
|
A global address space pointer
|
|
to the runtime hostcall buffer
|
|
is passed in kernarg.
|
|
|
|
"hidden_default_queue"
|
|
A global address space pointer
|
|
to the OpenCL device enqueue
|
|
queue that should be used by
|
|
the kernel by default is
|
|
passed in the kernarg.
|
|
|
|
"hidden_completion_action"
|
|
A global address space pointer
|
|
to help link enqueued kernels into
|
|
the ancestor tree for determining
|
|
when the parent kernel has finished.
|
|
|
|
"hidden_multigrid_sync_arg"
|
|
A global address space pointer for
|
|
multi-grid synchronization is
|
|
passed in the kernarg.
|
|
|
|
".value_type" string Unused and deprecated. This should no longer
|
|
be emitted, but is accepted for compatibility.
|
|
|
|
".pointee_align" integer Alignment in bytes of pointee
|
|
type for pointer type kernel
|
|
argument. Must be a power
|
|
of 2. Only present if
|
|
".value_kind" is
|
|
"dynamic_shared_pointer".
|
|
".address_space" string Kernel argument address space
|
|
qualifier. Only present if
|
|
".value_kind" is "global_buffer" or
|
|
"dynamic_shared_pointer". Values
|
|
are:
|
|
|
|
- "private"
|
|
- "global"
|
|
- "constant"
|
|
- "local"
|
|
- "generic"
|
|
- "region"
|
|
|
|
.. TODO::
|
|
|
|
Is "global_buffer" only "global"
|
|
or "constant"? Is
|
|
"dynamic_shared_pointer" always
|
|
"local"? Can HCC allow "generic"?
|
|
How can "private" or "region"
|
|
ever happen?
|
|
|
|
".access" string Kernel argument access
|
|
qualifier. Only present if
|
|
".value_kind" is "image" or
|
|
"pipe". Values
|
|
are:
|
|
|
|
- "read_only"
|
|
- "write_only"
|
|
- "read_write"
|
|
|
|
.. TODO::
|
|
|
|
Does this apply to
|
|
"global_buffer"?
|
|
|
|
".actual_access" string The actual memory accesses
|
|
performed by the kernel on the
|
|
kernel argument. Only present if
|
|
".value_kind" is "global_buffer",
|
|
"image", or "pipe". This may be
|
|
more restrictive than indicated
|
|
by ".access" to reflect what the
|
|
kernel actual does. If not
|
|
present then the runtime must
|
|
assume what is implied by
|
|
".access" and ".is_const" . Values
|
|
are:
|
|
|
|
- "read_only"
|
|
- "write_only"
|
|
- "read_write"
|
|
|
|
".is_const" boolean Indicates if the kernel argument
|
|
is const qualified. Only present
|
|
if ".value_kind" is
|
|
"global_buffer".
|
|
|
|
".is_restrict" boolean Indicates if the kernel argument
|
|
is restrict qualified. Only
|
|
present if ".value_kind" is
|
|
"global_buffer".
|
|
|
|
".is_volatile" boolean Indicates if the kernel argument
|
|
is volatile qualified. Only
|
|
present if ".value_kind" is
|
|
"global_buffer".
|
|
|
|
".is_pipe" boolean Indicates if the kernel argument
|
|
is pipe qualified. Only present
|
|
if ".value_kind" is "pipe".
|
|
|
|
.. TODO::
|
|
|
|
Can "global_buffer" be pipe
|
|
qualified?
|
|
|
|
====================== ============== ========= ================================
|
|
|
|
.. _amdgpu-amdhsa-code-object-metadata-v4:
|
|
|
|
Code Object V4 Metadata
|
|
+++++++++++++++++++++++
|
|
|
|
Code object V4 metadata is the same as
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
|
|
defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
|
|
|
|
.. table:: AMDHSA Code Object V4 Metadata Map Changes
|
|
:name: amdgpu-amdhsa-code-object-metadata-map-table-v4
|
|
|
|
================= ============== ========= =======================================
|
|
String Key Value Type Required? Description
|
|
================= ============== ========= =======================================
|
|
"amdhsa.version" sequence of Required - The first integer is the major
|
|
2 integers version. Currently 1.
|
|
- The second integer is the minor
|
|
version. Currently 1.
|
|
"amdhsa.target" string Required The target name of the code using the syntax:
|
|
|
|
.. code::
|
|
|
|
<target-triple> [ "-" <target-id> ]
|
|
|
|
A canonical target ID must be
|
|
used. See :ref:`amdgpu-target-triples`
|
|
and :ref:`amdgpu-target-id`.
|
|
================= ============== ========= =======================================
|
|
|
|
.. _amdgpu-amdhsa-code-object-metadata-v5:
|
|
|
|
Code Object V5 Metadata
|
|
+++++++++++++++++++++++
|
|
|
|
.. warning::
|
|
Code object V5 is not the default code object version emitted by this version
|
|
of LLVM.
|
|
|
|
|
|
Code object V5 metadata is the same as
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5` and table
|
|
:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
|
|
|
|
.. table:: AMDHSA Code Object V5 Metadata Map Changes
|
|
:name: amdgpu-amdhsa-code-object-metadata-map-table-v5
|
|
|
|
================= ============== ========= =======================================
|
|
String Key Value Type Required? Description
|
|
================= ============== ========= =======================================
|
|
"amdhsa.version" sequence of Required - The first integer is the major
|
|
2 integers version. Currently 1.
|
|
- The second integer is the minor
|
|
version. Currently 2.
|
|
================= ============== ========= =======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
|
|
:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
|
|
|
|
====================== ============== ========= ================================
|
|
String Key Value Type Required? Description
|
|
====================== ============== ========= ================================
|
|
".value_kind" string Required Kernel argument kind that
|
|
specifies how to set up the
|
|
corresponding argument.
|
|
Values include:
|
|
the same as code object V3 metadata
|
|
(see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
|
|
with the following additions:
|
|
|
|
"hidden_block_count_x"
|
|
The grid dispatch work-group count for the X dimension
|
|
is passed in the kernarg. Some languages, such as OpenCL,
|
|
support a last work-group in each dimension being partial.
|
|
This count only includes the non-partial work-group count.
|
|
This is not the same as the value in the AQL dispatch packet,
|
|
which has the grid size in work-items.
|
|
|
|
"hidden_block_count_y"
|
|
The grid dispatch work-group count for the Y dimension
|
|
is passed in the kernarg. Some languages, such as OpenCL,
|
|
support a last work-group in each dimension being partial.
|
|
This count only includes the non-partial work-group count.
|
|
This is not the same as the value in the AQL dispatch packet,
|
|
which has the grid size in work-items. If the grid dimensionality
|
|
is 1, then must be 1.
|
|
|
|
"hidden_block_count_z"
|
|
The grid dispatch work-group count for the Z dimension
|
|
is passed in the kernarg. Some languages, such as OpenCL,
|
|
support a last work-group in each dimension being partial.
|
|
This count only includes the non-partial work-group count.
|
|
This is not the same as the value in the AQL dispatch packet,
|
|
which has the grid size in work-items. If the grid dimensionality
|
|
is 1 or 2, then must be 1.
|
|
|
|
"hidden_group_size_x"
|
|
The grid dispatch work-group size for the X dimension is
|
|
passed in the kernarg. This size only applies to the
|
|
non-partial work-groups. This is the same value as the AQL
|
|
dispatch packet work-group size.
|
|
|
|
"hidden_group_size_y"
|
|
The grid dispatch work-group size for the Y dimension is
|
|
passed in the kernarg. This size only applies to the
|
|
non-partial work-groups. This is the same value as the AQL
|
|
dispatch packet work-group size. If the grid dimensionality
|
|
is 1, then must be 1.
|
|
|
|
"hidden_group_size_z"
|
|
The grid dispatch work-group size for the Z dimension is
|
|
passed in the kernarg. This size only applies to the
|
|
non-partial work-groups. This is the same value as the AQL
|
|
dispatch packet work-group size. If the grid dimensionality
|
|
is 1 or 2, then must be 1.
|
|
|
|
"hidden_remainder_x"
|
|
The grid dispatch work group size of the the partial work group
|
|
of the X dimension, if it exists. Must be zero if a partial
|
|
work group does not exist in the X dimension.
|
|
|
|
"hidden_remainder_y"
|
|
The grid dispatch work group size of the the partial work group
|
|
of the Y dimension, if it exists. Must be zero if a partial
|
|
work group does not exist in the Y dimension.
|
|
|
|
"hidden_remainder_z"
|
|
The grid dispatch work group size of the the partial work group
|
|
of the Z dimension, if it exists. Must be zero if a partial
|
|
work group does not exist in the Z dimension.
|
|
|
|
"hidden_grid_dims"
|
|
The grid dispatch dimensionality. This is the same value
|
|
as the AQL dispatch packet dimensionality. Must be a value
|
|
between 1 and 3.
|
|
|
|
"hidden_private_base"
|
|
The high 32 bits of the flat addressing private aperture base.
|
|
Only used by GFX8 to allow conversion between private segment
|
|
and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
|
|
|
|
"hidden_shared_base"
|
|
The high 32 bits of the flat addressing shared aperture base.
|
|
Only used by GFX8 to allow conversion between shared segment
|
|
and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
|
|
|
|
"hidden_queue_ptr"
|
|
A global memory address space pointer to the ROCm runtime
|
|
``struct amd_queue_t`` structure for the HSA queue of the
|
|
associated dispatch AQL packet. It is only required for pre-GFX9
|
|
devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
|
|
|
|
====================== ============== ========= ================================
|
|
|
|
..
|
|
|
|
Kernel Dispatch
|
|
~~~~~~~~~~~~~~~
|
|
|
|
The HSA architected queuing language (AQL) defines a user space memory interface
|
|
that can be used to control the dispatch of kernels, in an agent independent
|
|
way. An agent can have zero or more AQL queues created for it using an HSA
|
|
compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
|
|
are 64 bytes) can be placed. See the *HSA Platform System Architecture
|
|
Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
|
|
|
|
The packet processor of a kernel agent is responsible for detecting and
|
|
dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
|
|
packet processor is implemented by the hardware command processor (CP),
|
|
asynchronous dispatch controller (ADC) and shader processor input controller
|
|
(SPI).
|
|
|
|
An HSA compatible runtime can be used to allocate an AQL queue object. It uses
|
|
the kernel mode driver to initialize and register the AQL queue with CP.
|
|
|
|
To dispatch a kernel the following actions are performed. This can occur in the
|
|
CPU host program, or from an HSA kernel executing on a GPU.
|
|
|
|
1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
|
|
executed is obtained.
|
|
2. A pointer to the kernel descriptor (see
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
|
|
It must be for a kernel that is contained in a code object that was loaded
|
|
by an HSA compatible runtime on the kernel agent with which the AQL queue is
|
|
associated.
|
|
3. Space is allocated for the kernel arguments using the HSA compatible runtime
|
|
allocator for a memory region with the kernarg property for the kernel agent
|
|
that will execute the kernel. It must be at least 16-byte aligned.
|
|
4. Kernel argument values are assigned to the kernel argument memory
|
|
allocation. The layout is defined in the *HSA Programmer's Language
|
|
Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
|
|
kernel argument memory in the same way constant memory is accessed. (Note
|
|
that the HSA specification allows an implementation to copy the kernel
|
|
argument contents to another location that is accessed by the kernel.)
|
|
5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
|
|
runtime api uses 64-bit atomic operations to reserve space in the AQL queue
|
|
for the packet. The packet must be set up, and the final write must use an
|
|
atomic store release to set the packet kind to ensure the packet contents are
|
|
visible to the kernel agent. AQL defines a doorbell signal mechanism to
|
|
notify the kernel agent that the AQL queue has been updated. These rules, and
|
|
the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
|
|
System Architecture Specification* [HSA]_.
|
|
6. A kernel dispatch packet includes information about the actual dispatch,
|
|
such as grid and work-group size, together with information from the code
|
|
object about the kernel, such as segment sizes. The HSA compatible runtime
|
|
queries on the kernel symbol can be used to obtain the code object values
|
|
which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
|
|
7. CP executes micro-code and is responsible for detecting and setting up the
|
|
GPU to execute the wavefronts of a kernel dispatch.
|
|
8. CP ensures that when the a wavefront starts executing the kernel machine
|
|
code, the scalar general purpose registers (SGPR) and vector general purpose
|
|
registers (VGPR) are set up as required by the machine code. The required
|
|
setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
|
|
register state is defined in
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
|
|
9. The prolog of the kernel machine code (see
|
|
:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
|
|
before continuing executing the machine code that corresponds to the kernel.
|
|
10. When the kernel dispatch has completed execution, CP signals the completion
|
|
signal specified in the kernel dispatch packet if not 0.
|
|
|
|
.. _amdgpu-amdhsa-memory-spaces:
|
|
|
|
Memory Spaces
|
|
~~~~~~~~~~~~~
|
|
|
|
The memory space properties are:
|
|
|
|
.. table:: AMDHSA Memory Spaces
|
|
:name: amdgpu-amdhsa-memory-spaces-table
|
|
|
|
================= =========== ======== ======= ==================
|
|
Memory Space Name HSA Segment Hardware Address NULL Value
|
|
Name Name Size
|
|
================= =========== ======== ======= ==================
|
|
Private private scratch 32 0x00000000
|
|
Local group LDS 32 0xFFFFFFFF
|
|
Global global global 64 0x0000000000000000
|
|
Constant constant *same as 64 0x0000000000000000
|
|
global*
|
|
Generic flat flat 64 0x0000000000000000
|
|
Region N/A GDS 32 *not implemented
|
|
for AMDHSA*
|
|
================= =========== ======== ======= ==================
|
|
|
|
The global and constant memory spaces both use global virtual addresses, which
|
|
are the same virtual address space used by the CPU. However, some virtual
|
|
addresses may only be accessible to the CPU, some only accessible by the GPU,
|
|
and some by both.
|
|
|
|
Using the constant memory space indicates that the data will not change during
|
|
the execution of the kernel. This allows scalar read instructions to be
|
|
used. The vector and scalar L1 caches are invalidated of volatile data before
|
|
each kernel dispatch execution to allow constant memory to change values between
|
|
kernel dispatches.
|
|
|
|
The local memory space uses the hardware Local Data Store (LDS) which is
|
|
automatically allocated when the hardware creates work-groups of wavefronts, and
|
|
freed when all the wavefronts of a work-group have terminated. The data store
|
|
(DS) instructions can be used to access it.
|
|
|
|
The private memory space uses the hardware scratch memory support. If the kernel
|
|
uses scratch, then the hardware allocates memory that is accessed using
|
|
wavefront lane dword (4 byte) interleaving. The mapping used from private
|
|
address to physical address is:
|
|
|
|
``wavefront-scratch-base +
|
|
(private-address * wavefront-size * 4) +
|
|
(wavefront-lane-id * 4)``
|
|
|
|
There are different ways that the wavefront scratch base address is determined
|
|
by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
|
|
memory can be accessed in an interleaved manner using buffer instruction with
|
|
the scratch buffer descriptor and per wavefront scratch offset, by the scratch
|
|
instructions, or by flat instructions. If each lane of a wavefront accesses the
|
|
same private address, the interleaving results in adjacent dwords being accessed
|
|
and hence requires fewer cache lines to be fetched. Multi-dword access is not
|
|
supported except by flat and scratch instructions in GFX9-GFX10.
|
|
|
|
The generic address space uses the hardware flat address support available in
|
|
GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
|
|
local apertures), that are outside the range of addressible global memory, to
|
|
map from a flat address to a private or local address.
|
|
|
|
FLAT instructions can take a flat address and access global, private (scratch)
|
|
and group (LDS) memory depending on if the address is within one of the
|
|
aperture ranges. Flat access to scratch requires hardware aperture setup and
|
|
setup in the kernel prologue (see
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
|
|
hardware aperture setup and M0 (GFX7-GFX8) register setup (see
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
|
|
|
|
To convert between a segment address and a flat address the base address of the
|
|
apertures address can be used. For GFX7-GFX8 these are available in the
|
|
:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
|
|
Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
|
|
GFX9-GFX10 the aperture base addresses are directly available as inline constant
|
|
registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
|
|
address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
|
|
which makes it easier to convert from flat to segment or segment to flat.
|
|
|
|
Image and Samplers
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
Image and sample handles created by an HSA compatible runtime (see
|
|
:ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
|
|
object respectively. In order to support the HSA ``query_sampler`` operations
|
|
two extra dwords are used to store the HSA BRIG enumeration values for the
|
|
queries that are not trivially deducible from the S# representation.
|
|
|
|
HSA Signals
|
|
~~~~~~~~~~~
|
|
|
|
HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
|
|
are 64-bit addresses of a structure allocated in memory accessible from both the
|
|
CPU and GPU. The structure is defined by the runtime and subject to change
|
|
between releases. For example, see [AMD-ROCm-github]_.
|
|
|
|
.. _amdgpu-amdhsa-hsa-aql-queue:
|
|
|
|
HSA AQL Queue
|
|
~~~~~~~~~~~~~
|
|
|
|
The HSA AQL queue structure is defined by an HSA compatible runtime (see
|
|
:ref:`amdgpu-os`) and subject to change between releases. For example, see
|
|
[AMD-ROCm-github]_. For some processors it contains fields needed to implement
|
|
certain language features such as the flat address aperture bases. It also
|
|
contains fields used by CP such as managing the allocation of scratch memory.
|
|
|
|
.. _amdgpu-amdhsa-kernel-descriptor:
|
|
|
|
Kernel Descriptor
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
A kernel descriptor consists of the information needed by CP to initiate the
|
|
execution of a kernel, including the entry point address of the machine code
|
|
that implements the kernel.
|
|
|
|
Code Object V3 Kernel Descriptor
|
|
++++++++++++++++++++++++++++++++
|
|
|
|
CP microcode requires the Kernel descriptor to be allocated on 64-byte
|
|
alignment.
|
|
|
|
The fields used by CP for code objects before V3 also match those specified in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
|
|
.. table:: Code Object V3 Kernel Descriptor
|
|
:name: amdgpu-amdhsa-kernel-descriptor-v3-table
|
|
|
|
======= ======= =============================== ============================
|
|
Bits Size Field Name Description
|
|
======= ======= =============================== ============================
|
|
31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
|
|
address space memory
|
|
required for a work-group
|
|
in bytes. This does not
|
|
include any dynamically
|
|
allocated local address
|
|
space memory that may be
|
|
added when the kernel is
|
|
dispatched.
|
|
63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
|
|
private address space
|
|
memory required for a
|
|
work-item in bytes.
|
|
Additional space may need to
|
|
be added to this value if
|
|
the call stack has
|
|
non-inlined function calls.
|
|
95:64 4 bytes KERNARG_SIZE The size of the kernarg
|
|
memory pointed to by the
|
|
AQL dispatch packet. The
|
|
kernarg memory is used to
|
|
pass arguments to the
|
|
kernel.
|
|
|
|
* If the kernarg pointer in
|
|
the dispatch packet is NULL
|
|
then there are no kernel
|
|
arguments.
|
|
* If the kernarg pointer in
|
|
the dispatch packet is
|
|
not NULL and this value
|
|
is 0 then the kernarg
|
|
memory size is
|
|
unspecified.
|
|
* If the kernarg pointer in
|
|
the dispatch packet is
|
|
not NULL and this value
|
|
is not 0 then the value
|
|
specifies the kernarg
|
|
memory size in bytes. It
|
|
is recommended to provide
|
|
a value as it may be used
|
|
by CP to optimize making
|
|
the kernarg memory
|
|
visible to the kernel
|
|
code.
|
|
|
|
127:96 4 bytes Reserved, must be 0.
|
|
191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
|
|
negative) from base
|
|
address of kernel
|
|
descriptor to kernel's
|
|
entry point instruction
|
|
which must be 256 byte
|
|
aligned.
|
|
351:272 20 Reserved, must be 0.
|
|
bytes
|
|
383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
|
|
Reserved, must be 0.
|
|
GFX90A
|
|
Compute Shader (CS)
|
|
program settings used by
|
|
CP to set up
|
|
``COMPUTE_PGM_RSRC3``
|
|
configuration
|
|
register. See
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
|
|
GFX10
|
|
Compute Shader (CS)
|
|
program settings used by
|
|
CP to set up
|
|
``COMPUTE_PGM_RSRC3``
|
|
configuration
|
|
register. See
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
|
|
415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
|
|
program settings used by
|
|
CP to set up
|
|
``COMPUTE_PGM_RSRC1``
|
|
configuration
|
|
register. See
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
|
|
program settings used by
|
|
CP to set up
|
|
``COMPUTE_PGM_RSRC2``
|
|
configuration
|
|
register. See
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
458:448 7 bits *See separate bits below.* Enable the setup of the
|
|
SGPR user data registers
|
|
(see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
The total number of SGPR
|
|
user data registers
|
|
requested must not exceed
|
|
16 and match value in
|
|
``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
|
|
Any requests beyond 16
|
|
will be ignored.
|
|
>448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
|
|
_BUFFER column of
|
|
:ref:`amdgpu-processor-table`
|
|
specifies *Architected flat
|
|
scratch* then not supported
|
|
and must be 0,
|
|
>449 1 bit ENABLE_SGPR_DISPATCH_PTR
|
|
>450 1 bit ENABLE_SGPR_QUEUE_PTR
|
|
>451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
|
|
>452 1 bit ENABLE_SGPR_DISPATCH_ID
|
|
>453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
|
|
column of
|
|
:ref:`amdgpu-processor-table`
|
|
specifies *Architected flat
|
|
scratch* then not supported
|
|
and must be 0,
|
|
>454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
|
|
_SIZE
|
|
457:455 3 bits Reserved, must be 0.
|
|
458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
|
|
Reserved, must be 0.
|
|
GFX10
|
|
- If 0 execute in
|
|
wavefront size 64 mode.
|
|
- If 1 execute in
|
|
native wavefront size
|
|
32 mode.
|
|
463:459 1 bit Reserved, must be 0.
|
|
464 1 bit RESERVED_464 Deprecated, must be 0.
|
|
467:465 3 bits Reserved, must be 0.
|
|
468 1 bit RESERVED_468 Deprecated, must be 0.
|
|
469:471 3 bits Reserved, must be 0.
|
|
511:472 5 bytes Reserved, must be 0.
|
|
512 **Total size 64 bytes.**
|
|
======= ====================================================================
|
|
|
|
..
|
|
|
|
.. table:: compute_pgm_rsrc1 for GFX6-GFX10
|
|
:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
|
|
|
|
======= ======= =============================== ===========================================================================
|
|
Bits Size Field Name Description
|
|
======= ======= =============================== ===========================================================================
|
|
5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
|
|
blocks used by each work-item;
|
|
granularity is device
|
|
specific:
|
|
|
|
GFX6-GFX9
|
|
- vgprs_used 0..256
|
|
- max(0, ceil(vgprs_used / 4) - 1)
|
|
GFX90A
|
|
- vgprs_used 0..512
|
|
- vgprs_used = align(arch_vgprs, 4)
|
|
+ acc_vgprs
|
|
- max(0, ceil(vgprs_used / 8) - 1)
|
|
GFX10 (wavefront size 64)
|
|
- max_vgpr 1..256
|
|
- max(0, ceil(vgprs_used / 4) - 1)
|
|
GFX10 (wavefront size 32)
|
|
- max_vgpr 1..256
|
|
- max(0, ceil(vgprs_used / 8) - 1)
|
|
|
|
Where vgprs_used is defined
|
|
as the highest VGPR number
|
|
explicitly referenced plus
|
|
one.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.VGPRS``.
|
|
|
|
The
|
|
:ref:`amdgpu-assembler`
|
|
calculates this
|
|
automatically for the
|
|
selected processor from
|
|
values provided to the
|
|
`.amdhsa_kernel` directive
|
|
by the
|
|
`.amdhsa_next_free_vgpr`
|
|
nested directive (see
|
|
:ref:`amdhsa-kernel-directives-table`).
|
|
9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
|
|
blocks used by a wavefront;
|
|
granularity is device
|
|
specific:
|
|
|
|
GFX6-GFX8
|
|
- sgprs_used 0..112
|
|
- max(0, ceil(sgprs_used / 8) - 1)
|
|
GFX9
|
|
- sgprs_used 0..112
|
|
- 2 * max(0, ceil(sgprs_used / 16) - 1)
|
|
GFX10
|
|
Reserved, must be 0.
|
|
(128 SGPRs always
|
|
allocated.)
|
|
|
|
Where sgprs_used is
|
|
defined as the highest
|
|
SGPR number explicitly
|
|
referenced plus one, plus
|
|
a target specific number
|
|
of additional special
|
|
SGPRs for VCC,
|
|
FLAT_SCRATCH (GFX7+) and
|
|
XNACK_MASK (GFX8+), and
|
|
any additional
|
|
target specific
|
|
limitations. It does not
|
|
include the 16 SGPRs added
|
|
if a trap handler is
|
|
enabled.
|
|
|
|
The target specific
|
|
limitations and special
|
|
SGPR layout are defined in
|
|
the hardware
|
|
documentation, which can
|
|
be found in the
|
|
:ref:`amdgpu-processors`
|
|
table.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.SGPRS``.
|
|
|
|
The
|
|
:ref:`amdgpu-assembler`
|
|
calculates this
|
|
automatically for the
|
|
selected processor from
|
|
values provided to the
|
|
`.amdhsa_kernel` directive
|
|
by the
|
|
`.amdhsa_next_free_sgpr`
|
|
and `.amdhsa_reserve_*`
|
|
nested directives (see
|
|
:ref:`amdhsa-kernel-directives-table`).
|
|
11:10 2 bits PRIORITY Must be 0.
|
|
|
|
Start executing wavefront
|
|
at the specified priority.
|
|
|
|
CP is responsible for
|
|
filling in
|
|
``COMPUTE_PGM_RSRC1.PRIORITY``.
|
|
13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
|
|
with specified rounding
|
|
mode for single (32
|
|
bit) floating point
|
|
precision floating point
|
|
operations.
|
|
|
|
Floating point rounding
|
|
mode values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
|
|
15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
|
|
with specified rounding
|
|
denorm mode for half/double (16
|
|
and 64-bit) floating point
|
|
precision floating point
|
|
operations.
|
|
|
|
Floating point rounding
|
|
mode values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
|
|
17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
|
|
with specified denorm mode
|
|
for single (32
|
|
bit) floating point
|
|
precision floating point
|
|
operations.
|
|
|
|
Floating point denorm mode
|
|
values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
|
|
19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
|
|
with specified denorm mode
|
|
for half/double (16
|
|
and 64-bit) floating point
|
|
precision floating point
|
|
operations.
|
|
|
|
Floating point denorm mode
|
|
values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
|
|
20 1 bit PRIV Must be 0.
|
|
|
|
Start executing wavefront
|
|
in privilege trap handler
|
|
mode.
|
|
|
|
CP is responsible for
|
|
filling in
|
|
``COMPUTE_PGM_RSRC1.PRIV``.
|
|
21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
|
|
with DX10 clamp mode
|
|
enabled. Used by the vector
|
|
ALU to force DX10 style
|
|
treatment of NaN's (when
|
|
set, clamp NaN to zero,
|
|
otherwise pass NaN
|
|
through).
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
|
|
22 1 bit DEBUG_MODE Must be 0.
|
|
|
|
Start executing wavefront
|
|
in single step mode.
|
|
|
|
CP is responsible for
|
|
filling in
|
|
``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
|
|
23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
|
|
with IEEE mode
|
|
enabled. Floating point
|
|
opcodes that support
|
|
exception flag gathering
|
|
will quiet and propagate
|
|
signaling-NaN inputs per
|
|
IEEE 754-2008. Min_dx10 and
|
|
max_dx10 become IEEE
|
|
754-2008 compliant due to
|
|
signaling-NaN propagation
|
|
and quieting.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.IEEE_MODE``.
|
|
24 1 bit BULKY Must be 0.
|
|
|
|
Only one work-group allowed
|
|
to execute on a compute
|
|
unit.
|
|
|
|
CP is responsible for
|
|
filling in
|
|
``COMPUTE_PGM_RSRC1.BULKY``.
|
|
25 1 bit CDBG_USER Must be 0.
|
|
|
|
Flag that can be used to
|
|
control debugging code.
|
|
|
|
CP is responsible for
|
|
filling in
|
|
``COMPUTE_PGM_RSRC1.CDBG_USER``.
|
|
26 1 bit FP16_OVFL GFX6-GFX8
|
|
Reserved, must be 0.
|
|
GFX9-GFX10
|
|
Wavefront starts execution
|
|
with specified fp16 overflow
|
|
mode.
|
|
|
|
- If 0, fp16 overflow generates
|
|
+/-INF values.
|
|
- If 1, fp16 overflow that is the
|
|
result of an +/-INF input value
|
|
or divide by 0 produces a +/-INF,
|
|
otherwise clamps computed
|
|
overflow to +/-MAX_FP16 as
|
|
appropriate.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.FP16_OVFL``.
|
|
28:27 2 bits Reserved, must be 0.
|
|
29 1 bit WGP_MODE GFX6-GFX9
|
|
Reserved, must be 0.
|
|
GFX10
|
|
- If 0 execute work-groups in
|
|
CU wavefront execution mode.
|
|
- If 1 execute work-groups on
|
|
in WGP wavefront execution mode.
|
|
|
|
See :ref:`amdgpu-amdhsa-memory-model`.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.WGP_MODE``.
|
|
30 1 bit MEM_ORDERED GFX6-GFX9
|
|
Reserved, must be 0.
|
|
GFX10
|
|
Controls the behavior of the
|
|
s_waitcnt's vmcnt and vscnt
|
|
counters.
|
|
|
|
- If 0 vmcnt reports completion
|
|
of load and atomic with return
|
|
out of order with sample
|
|
instructions, and the vscnt
|
|
reports the completion of
|
|
store and atomic without
|
|
return in order.
|
|
- If 1 vmcnt reports completion
|
|
of load, atomic with return
|
|
and sample instructions in
|
|
order, and the vscnt reports
|
|
the completion of store and
|
|
atomic without return in order.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
|
|
31 1 bit FWD_PROGRESS GFX6-GFX9
|
|
Reserved, must be 0.
|
|
GFX10
|
|
- If 0 execute SIMD wavefronts
|
|
using oldest first policy.
|
|
- If 1 execute SIMD wavefronts to
|
|
ensure wavefronts will make some
|
|
forward progress.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
|
|
32 **Total size 4 bytes**
|
|
======= ===================================================================================================================
|
|
|
|
..
|
|
|
|
.. table:: compute_pgm_rsrc2 for GFX6-GFX10
|
|
:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
|
|
|
|
======= ======= =============================== ===========================================================================
|
|
Bits Size Field Name Description
|
|
======= ======= =============================== ===========================================================================
|
|
0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
|
|
private segment.
|
|
* If the *Target Properties*
|
|
column of
|
|
:ref:`amdgpu-processor-table`
|
|
does not specify
|
|
*Architected flat
|
|
scratch* then enable the
|
|
setup of the SGPR
|
|
wavefront scratch offset
|
|
system register (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
* If the *Target Properties*
|
|
column of
|
|
:ref:`amdgpu-processor-table`
|
|
specifies *Architected
|
|
flat scratch* then enable
|
|
the setup of the
|
|
FLAT_SCRATCH register
|
|
pair (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
|
|
5:1 5 bits USER_SGPR_COUNT The total number of SGPR
|
|
user data
|
|
registers requested. This
|
|
number must be greater than
|
|
or equal to the number of user
|
|
data registers enabled.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.USER_SGPR``.
|
|
6 1 bit ENABLE_TRAP_HANDLER Must be 0.
|
|
|
|
This bit represents
|
|
``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
|
|
which is set by the CP if
|
|
the runtime has installed a
|
|
trap handler.
|
|
7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
|
|
system SGPR register for
|
|
the work-group id in the X
|
|
dimension (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.TGID_X_EN``.
|
|
8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
|
|
system SGPR register for
|
|
the work-group id in the Y
|
|
dimension (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
|
|
9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
|
|
system SGPR register for
|
|
the work-group id in the Z
|
|
dimension (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
|
|
10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
|
|
system SGPR register for
|
|
work-group information (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
|
|
12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
|
|
VGPR system registers used
|
|
for the work-item ID.
|
|
:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
|
|
defines the values.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
|
|
13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
|
|
|
|
Wavefront starts execution
|
|
with address watch
|
|
exceptions enabled which
|
|
are generated when L1 has
|
|
witnessed a thread access
|
|
an *address of
|
|
interest*.
|
|
|
|
CP is responsible for
|
|
filling in the address
|
|
watch bit in
|
|
``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
|
|
according to what the
|
|
runtime requests.
|
|
14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
|
|
|
|
Wavefront starts execution
|
|
with memory violation
|
|
exceptions exceptions
|
|
enabled which are generated
|
|
when a memory violation has
|
|
occurred for this wavefront from
|
|
L1 or LDS
|
|
(write-to-read-only-memory,
|
|
mis-aligned atomic, LDS
|
|
address out of range,
|
|
illegal address, etc.).
|
|
|
|
CP sets the memory
|
|
violation bit in
|
|
``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
|
|
according to what the
|
|
runtime requests.
|
|
23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
|
|
|
|
CP uses the rounded value
|
|
from the dispatch packet,
|
|
not this value, as the
|
|
dispatch may contain
|
|
dynamically allocated group
|
|
segment memory. CP writes
|
|
directly to
|
|
``COMPUTE_PGM_RSRC2.LDS_SIZE``.
|
|
|
|
Amount of group segment
|
|
(LDS) to allocate for each
|
|
work-group. Granularity is
|
|
device specific:
|
|
|
|
GFX6
|
|
roundup(lds-size / (64 * 4))
|
|
GFX7-GFX10
|
|
roundup(lds-size / (128 * 4))
|
|
|
|
24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
|
|
_INVALID_OPERATION with specified exceptions
|
|
enabled.
|
|
|
|
Used by CP to set up
|
|
``COMPUTE_PGM_RSRC2.EXCP_EN``
|
|
(set from bits 0..6).
|
|
|
|
IEEE 754 FP Invalid
|
|
Operation
|
|
25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
|
|
_SOURCE input operands is a
|
|
denormal number
|
|
26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
|
|
_DIVISION_BY_ZERO Zero
|
|
27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
|
|
_OVERFLOW
|
|
28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
|
|
_UNDERFLOW
|
|
29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
|
|
_INEXACT
|
|
30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
|
|
_ZERO (rcp_iflag_f32 instruction
|
|
only)
|
|
31 1 bit Reserved, must be 0.
|
|
32 **Total size 4 bytes.**
|
|
======= ===================================================================================================================
|
|
|
|
..
|
|
|
|
.. table:: compute_pgm_rsrc3 for GFX90A
|
|
:name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
|
|
|
|
======= ======= =============================== ===========================================================================
|
|
Bits Size Field Name Description
|
|
======= ======= =============================== ===========================================================================
|
|
5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
|
|
Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
|
|
63 - accum-offset = 256.
|
|
6:15 10 Reserved, must be 0.
|
|
bits
|
|
16 1 bit TG_SPLIT - If 0 the waves of a work-group are
|
|
launched in the same CU.
|
|
- If 1 the waves of a work-group can be
|
|
launched in different CUs. The waves
|
|
cannot use S_BARRIER or LDS.
|
|
17:31 15 Reserved, must be 0.
|
|
bits
|
|
32 **Total size 4 bytes.**
|
|
======= ===================================================================================================================
|
|
|
|
..
|
|
|
|
.. table:: compute_pgm_rsrc3 for GFX10
|
|
:name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
|
|
|
|
======= ======= =============================== ===========================================================================
|
|
Bits Size Field Name Description
|
|
======= ======= =============================== ===========================================================================
|
|
3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
|
|
compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
|
|
31:4 28 Reserved, must be 0.
|
|
bits
|
|
32 **Total size 4 bytes.**
|
|
======= ===================================================================================================================
|
|
|
|
..
|
|
|
|
.. table:: Floating Point Rounding Mode Enumeration Values
|
|
:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
|
|
|
|
====================================== ===== ==============================
|
|
Enumeration Name Value Description
|
|
====================================== ===== ==============================
|
|
FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
|
|
FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
|
|
FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
|
|
FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
|
|
====================================== ===== ==============================
|
|
|
|
..
|
|
|
|
.. table:: Floating Point Denorm Mode Enumeration Values
|
|
:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
|
|
|
|
====================================== ===== ==============================
|
|
Enumeration Name Value Description
|
|
====================================== ===== ==============================
|
|
FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
|
|
Denorms
|
|
FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
|
|
FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
|
|
FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
|
|
====================================== ===== ==============================
|
|
|
|
..
|
|
|
|
.. table:: System VGPR Work-Item ID Enumeration Values
|
|
:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
|
|
|
|
======================================== ===== ============================
|
|
Enumeration Name Value Description
|
|
======================================== ===== ============================
|
|
SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
|
|
ID.
|
|
SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
|
|
dimensions ID.
|
|
SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
|
|
dimensions ID.
|
|
SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
|
|
======================================== ===== ============================
|
|
|
|
.. _amdgpu-amdhsa-initial-kernel-execution-state:
|
|
|
|
Initial Kernel Execution State
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
This section defines the register state that will be set up by the packet
|
|
processor prior to the start of execution of every wavefront. This is limited by
|
|
the constraints of the hardware controllers of CP/ADC/SPI.
|
|
|
|
The order of the SGPR registers is defined, but the compiler can specify which
|
|
ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
|
|
fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
|
|
for enabled registers are dense starting at SGPR0: the first enabled register is
|
|
SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
|
|
an SGPR number.
|
|
|
|
The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
|
|
all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
|
|
using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
|
|
actually initialized. These are then immediately followed by the System SGPRs
|
|
that are set up by ADC/SPI and can have different values for each wavefront of
|
|
the grid dispatch.
|
|
|
|
SGPR register initial state is defined in
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
.. table:: SGPR Register Set Up Order
|
|
:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
|
|
|
|
========== ========================== ====== ==============================
|
|
SGPR Order Name Number Description
|
|
(kernel descriptor enable of
|
|
field) SGPRs
|
|
========== ========================== ====== ==============================
|
|
First Private Segment Buffer 4 See
|
|
(enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
|
|
_segment_buffer)
|
|
then Dispatch Ptr 2 64-bit address of AQL dispatch
|
|
(enable_sgpr_dispatch_ptr) packet for kernel dispatch
|
|
actually executing.
|
|
then Queue Ptr 2 64-bit address of amd_queue_t
|
|
(enable_sgpr_queue_ptr) object for AQL queue on which
|
|
the dispatch packet was
|
|
queued.
|
|
then Kernarg Segment Ptr 2 64-bit address of Kernarg
|
|
(enable_sgpr_kernarg segment. This is directly
|
|
_segment_ptr) copied from the
|
|
kernarg_address in the kernel
|
|
dispatch packet.
|
|
|
|
Having CP load it once avoids
|
|
loading it at the beginning of
|
|
every wavefront.
|
|
then Dispatch Id 2 64-bit Dispatch ID of the
|
|
(enable_sgpr_dispatch_id) dispatch packet being
|
|
executed.
|
|
then Flat Scratch Init 2 See
|
|
(enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
|
|
_init)
|
|
then Private Segment Size 1 The 32-bit byte size of a
|
|
(enable_sgpr_private single work-item's memory
|
|
_segment_size) allocation. This is the
|
|
value from the kernel
|
|
dispatch packet Private
|
|
Segment Byte Size rounded up
|
|
by CP to a multiple of
|
|
DWORD.
|
|
|
|
Having CP load it once avoids
|
|
loading it at the beginning of
|
|
every wavefront.
|
|
|
|
This is not used for
|
|
GFX7-GFX8 since it is the same
|
|
value as the second SGPR of
|
|
Flat Scratch Init. However, it
|
|
may be needed for GFX9-GFX10 which
|
|
changes the meaning of the
|
|
Flat Scratch Init value.
|
|
then Work-Group Id X 1 32-bit work-group id in X
|
|
(enable_sgpr_workgroup_id dimension of grid for
|
|
_X) wavefront.
|
|
then Work-Group Id Y 1 32-bit work-group id in Y
|
|
(enable_sgpr_workgroup_id dimension of grid for
|
|
_Y) wavefront.
|
|
then Work-Group Id Z 1 32-bit work-group id in Z
|
|
(enable_sgpr_workgroup_id dimension of grid for
|
|
_Z) wavefront.
|
|
then Work-Group Info 1 {first_wavefront, 14'b0000,
|
|
(enable_sgpr_workgroup ordered_append_term[10:0],
|
|
_info) threadgroup_size_in_wavefronts[5:0]}
|
|
then Scratch Wavefront Offset 1 See
|
|
(enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
|
|
_segment_wavefront_offset) and
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
|
|
========== ========================== ====== ==============================
|
|
|
|
The order of the VGPR registers is defined, but the compiler can specify which
|
|
ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
|
|
fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
|
|
for enabled registers are dense starting at VGPR0: the first enabled register is
|
|
VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
|
|
VGPR number.
|
|
|
|
There are different methods used for the VGPR initial state:
|
|
|
|
* Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
|
|
specifies otherwise, a separate VGPR register is used per work-item ID. The
|
|
VGPR register initial state for this method is defined in
|
|
:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
|
|
* If *Target Properties* column of :ref:`amdgpu-processor-table`
|
|
specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
|
|
for all work-item IDs. The register layout for this method is defined in
|
|
:ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
|
|
|
|
.. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
|
|
:name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
|
|
|
|
========== ========================== ====== ==============================
|
|
VGPR Order Name Number Description
|
|
(kernel descriptor enable of
|
|
field) VGPRs
|
|
========== ========================== ====== ==============================
|
|
First Work-Item Id X 1 32-bit work-item id in X
|
|
(Always initialized) dimension of work-group for
|
|
wavefront lane.
|
|
then Work-Item Id Y 1 32-bit work-item id in Y
|
|
(enable_vgpr_workitem_id dimension of work-group for
|
|
> 0) wavefront lane.
|
|
then Work-Item Id Z 1 32-bit work-item id in Z
|
|
(enable_vgpr_workitem_id dimension of work-group for
|
|
> 1) wavefront lane.
|
|
========== ========================== ====== ==============================
|
|
|
|
..
|
|
|
|
.. table:: Register Layout for Packed Work-Item ID Method
|
|
:name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
|
|
|
|
======= ======= ================ =========================================
|
|
Bits Size Field Name Description
|
|
======= ======= ================ =========================================
|
|
0:9 10 bits Work-Item Id X Work-item id in X
|
|
dimension of work-group for
|
|
wavefront lane.
|
|
|
|
Always initialized.
|
|
|
|
10:19 10 bits Work-Item Id Y Work-item id in Y
|
|
dimension of work-group for
|
|
wavefront lane.
|
|
|
|
Initialized if enable_vgpr_workitem_id >
|
|
0, otherwise set to 0.
|
|
20:29 10 bits Work-Item Id Z Work-item id in Z
|
|
dimension of work-group for
|
|
wavefront lane.
|
|
|
|
Initialized if enable_vgpr_workitem_id >
|
|
1, otherwise set to 0.
|
|
30:31 2 bits Reserved, set to 0.
|
|
======= ======= ================ =========================================
|
|
|
|
The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
|
|
|
|
1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
|
|
registers.
|
|
2. Work-group Id registers X, Y, Z are set by ADC which supports any
|
|
combination including none.
|
|
3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
|
|
its value cannot be included with the flat scratch init value which is per
|
|
queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
|
|
4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
|
|
or (X, Y, Z).
|
|
5. Flat Scratch register pair initialization is described in
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
|
|
|
|
The global segment can be accessed either using buffer instructions (GFX6 which
|
|
has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
|
|
instructions (GFX9-GFX10).
|
|
|
|
If buffer operations are used, then the compiler can generate a V# with the
|
|
following properties:
|
|
|
|
* base address of 0
|
|
* no swizzle
|
|
* ATC: 1 if IOMMU present (such as APU)
|
|
* ptr64: 1
|
|
* MTYPE set to support memory coherence that matches the runtime (such as CC for
|
|
APU and NC for dGPU).
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog:
|
|
|
|
Kernel Prolog
|
|
~~~~~~~~~~~~~
|
|
|
|
The compiler performs initialization in the kernel prologue depending on the
|
|
target and information about things like stack usage in the kernel and called
|
|
functions. Some of this initialization requires the compiler to request certain
|
|
User and System SGPRs be present in the
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor`.
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog-cfi:
|
|
|
|
CFI
|
|
+++
|
|
|
|
1. The CFI return address is undefined.
|
|
|
|
2. The CFI CFA is defined using an expression which evaluates to a location
|
|
description that comprises one memory location description for the
|
|
``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog-m0:
|
|
|
|
M0
|
|
++
|
|
|
|
GFX6-GFX8
|
|
The M0 register must be initialized with a value at least the total LDS size
|
|
if the kernel may access LDS via DS or flat operations. Total LDS size is
|
|
available in dispatch packet. For M0, it is also possible to use maximum
|
|
possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
|
|
GFX7-GFX8).
|
|
GFX9-GFX10
|
|
The M0 register is not used for range checking LDS accesses and so does not
|
|
need to be initialized in the prolog.
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
|
|
|
|
Stack Pointer
|
|
+++++++++++++
|
|
|
|
If the kernel has function calls it must set up the ABI stack pointer described
|
|
in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
|
|
SGPR32 to the unswizzled scratch offset of the address past the last local
|
|
allocation.
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
|
|
|
|
Frame Pointer
|
|
+++++++++++++
|
|
|
|
If the kernel needs a frame pointer for the reasons defined in
|
|
``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
|
|
kernel prolog. If a frame pointer is not required then all uses of the frame
|
|
pointer are replaced with immediate ``0`` offsets.
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
|
|
|
|
Flat Scratch
|
|
++++++++++++
|
|
|
|
There are different methods used for initializing flat scratch:
|
|
|
|
* If the *Target Properties* column of :ref:`amdgpu-processor-table`
|
|
specifies *Does not support generic address space*:
|
|
|
|
Flat scratch is not supported and there is no flat scratch register pair.
|
|
|
|
* If the *Target Properties* column of :ref:`amdgpu-processor-table`
|
|
specifies *Offset flat scratch*:
|
|
|
|
If the kernel or any function it calls may use flat operations to access
|
|
scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
|
|
(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
|
|
Scratch Wavefront Offset SGPR registers (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
|
|
|
|
1. The low word of Flat Scratch Init is the 32-bit byte offset from
|
|
``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
|
|
being managed by SPI for the queue executing the kernel dispatch. This is
|
|
the same value used in the Scratch Segment Buffer V# base address.
|
|
|
|
CP obtains this from the runtime. (The Scratch Segment Buffer base address
|
|
is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
|
|
|
|
The prolog must add the value of Scratch Wavefront Offset to get the
|
|
wavefront's byte scratch backing memory offset from
|
|
``SH_HIDDEN_PRIVATE_BASE_VIMID``.
|
|
|
|
The Scratch Wavefront Offset must also be used as an offset with Private
|
|
segment address when using the Scratch Segment Buffer.
|
|
|
|
Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
|
|
shifted by 8 before moving into FLAT_SCRATCH_HI.
|
|
|
|
FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
|
|
SGPRn is the highest numbered SGPR allocated to the wavefront).
|
|
FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
|
|
added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
|
|
FLAT SCRATCH BASE in flat memory instructions that access the scratch
|
|
aperture.
|
|
2. The second word of Flat Scratch Init is 32-bit byte size of a single
|
|
work-items scratch memory usage.
|
|
|
|
CP obtains this from the runtime, and it is always a multiple of DWORD. CP
|
|
checks that the value in the kernel dispatch packet Private Segment Byte
|
|
Size is not larger and requests the runtime to increase the queue's scratch
|
|
size if necessary.
|
|
|
|
CP directly loads from the kernel dispatch packet Private Segment Byte Size
|
|
field and rounds up to a multiple of DWORD. Having CP load it once avoids
|
|
loading it at the beginning of every wavefront.
|
|
|
|
The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
|
|
GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
|
|
in flat memory instructions.
|
|
|
|
* If the *Target Properties* column of :ref:`amdgpu-processor-table`
|
|
specifies *Absolute flat scratch*:
|
|
|
|
If the kernel or any function it calls may use flat operations to access
|
|
scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
|
|
(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
|
|
uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
|
|
|
|
The Flat Scratch Init is the 64-bit address of the base of scratch backing
|
|
memory being managed by SPI for the queue executing the kernel dispatch.
|
|
|
|
CP obtains this from the runtime.
|
|
|
|
The kernel prolog must add the value of the wave's Scratch Wavefront Offset
|
|
and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
|
|
which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
|
|
memory instructions.
|
|
|
|
The Scratch Wavefront Offset must also be used as an offset with Private
|
|
segment address when using the Scratch Segment Buffer (see
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
|
|
|
|
* If the *Target Properties* column of :ref:`amdgpu-processor-table`
|
|
specifies *Architected flat scratch*:
|
|
|
|
If ENABLE_PRIVATE_SEGMENT is enabled in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
|
|
register pair will be initialized to the 64-bit address of the base of scratch
|
|
backing memory being managed by SPI for the queue executing the kernel
|
|
dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
|
|
flat scratch base in flat memory instructions.
|
|
|
|
.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
|
|
|
|
Private Segment Buffer
|
|
++++++++++++++++++++++
|
|
|
|
If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
|
|
*Architected flat scratch* then a Private Segment Buffer is not supported.
|
|
Instead the flat SCRATCH instructions are used.
|
|
|
|
Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
|
|
that are used as a V# to access scratch. CP uses the value provided by the
|
|
runtime. It is used, together with Scratch Wavefront Offset as an offset, to
|
|
access the private memory space using a segment address. See
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
|
|
|
|
The scratch V# is a four-aligned SGPR and always selected for the kernel as
|
|
follows:
|
|
|
|
- If it is known during instruction selection that there is stack usage,
|
|
SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
|
|
optimizations are disabled (``-O0``), if stack objects already exist (for
|
|
locals, etc.), or if there are any function calls.
|
|
|
|
- Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
|
|
are reserved for the tentative scratch V#. These will be used if it is
|
|
determined that spilling is needed.
|
|
|
|
- If no use is made of the tentative scratch V#, then it is unreserved,
|
|
and the register count is determined ignoring it.
|
|
- If use is made of the tentative scratch V#, then its register numbers
|
|
are shifted to the first four-aligned SGPR index after the highest one
|
|
allocated by the register allocator, and all uses are updated. The
|
|
register count includes them in the shifted location.
|
|
- In either case, if the processor has the SGPR allocation bug, the
|
|
tentative allocation is not shifted or unreserved in order to ensure
|
|
the register count is higher to workaround the bug.
|
|
|
|
.. note::
|
|
|
|
This approach of using a tentative scratch V# and shifting the register
|
|
numbers if used avoids having to perform register allocation a second
|
|
time if the tentative V# is eliminated. This is more efficient and
|
|
avoids the problem that the second register allocation may perform
|
|
spilling which will fail as there is no longer a scratch V#.
|
|
|
|
When the kernel prolog code is being emitted it is known whether the scratch V#
|
|
described above is actually used. If it is, the prolog code must set it up by
|
|
copying the Private Segment Buffer to the scratch V# registers and then adding
|
|
the Private Segment Wavefront Offset to the queue base address in the V#. The
|
|
result is a V# with a base address pointing to the beginning of the wavefront
|
|
scratch backing memory.
|
|
|
|
The Private Segment Buffer is always requested, but the Private Segment
|
|
Wavefront Offset is only requested if it is used (see
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
|
|
|
|
.. _amdgpu-amdhsa-memory-model:
|
|
|
|
Memory Model
|
|
~~~~~~~~~~~~
|
|
|
|
This section describes the mapping of the LLVM memory model onto AMDGPU machine
|
|
code (see :ref:`memmodel`).
|
|
|
|
The AMDGPU backend supports the memory synchronization scopes specified in
|
|
:ref:`amdgpu-memory-scopes`.
|
|
|
|
The code sequences used to implement the memory model specify the order of
|
|
instructions that a single thread must execute. The ``s_waitcnt`` and cache
|
|
management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
|
|
to other memory instructions executed by the same thread. This allows them to be
|
|
moved earlier or later which can allow them to be combined with other instances
|
|
of the same instruction, or hoisted/sunk out of loops to improve performance.
|
|
Only the instructions related to the memory model are given; additional
|
|
``s_waitcnt`` instructions are required to ensure registers are defined before
|
|
being used. These may be able to be combined with the memory model ``s_waitcnt``
|
|
instructions as described above.
|
|
|
|
The AMDGPU backend supports the following memory models:
|
|
|
|
HSA Memory Model [HSA]_
|
|
The HSA memory model uses a single happens-before relation for all address
|
|
spaces (see :ref:`amdgpu-address-spaces`).
|
|
OpenCL Memory Model [OpenCL]_
|
|
The OpenCL memory model which has separate happens-before relations for the
|
|
global and local address spaces. Only a fence specifying both global and
|
|
local address space, and seq_cst instructions join the relationships. Since
|
|
the LLVM ``memfence`` instruction does not allow an address space to be
|
|
specified the OpenCL fence has to conservatively assume both local and
|
|
global address space was specified. However, optimizations can often be
|
|
done to eliminate the additional ``s_waitcnt`` instructions when there are
|
|
no intervening memory instructions which access the corresponding address
|
|
space. The code sequences in the table indicate what can be omitted for the
|
|
OpenCL memory. The target triple environment is used to determine if the
|
|
source language is OpenCL (see :ref:`amdgpu-opencl`).
|
|
|
|
``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
|
|
operations.
|
|
|
|
``buffer/global/flat_load/store/atomic`` instructions to global memory are
|
|
termed vector memory operations.
|
|
|
|
Private address space uses ``buffer_load/store`` using the scratch V#
|
|
(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
|
|
is accessing the memory, atomic memory orderings are not meaningful, and all
|
|
accesses are treated as non-atomic.
|
|
|
|
Constant address space uses ``buffer/global_load`` instructions (or equivalent
|
|
scalar memory instructions). Since the constant address space contents do not
|
|
change during the execution of a kernel dispatch it is not legal to perform
|
|
stores, and atomic memory orderings are not meaningful, and all accesses are
|
|
treated as non-atomic.
|
|
|
|
A memory synchronization scope wider than work-group is not meaningful for the
|
|
group (LDS) address space and is treated as work-group.
|
|
|
|
The memory model does not support the region address space which is treated as
|
|
non-atomic.
|
|
|
|
Acquire memory ordering is not meaningful on store atomic instructions and is
|
|
treated as non-atomic.
|
|
|
|
Release memory ordering is not meaningful on load atomic instructions and is
|
|
treated a non-atomic.
|
|
|
|
Acquire-release memory ordering is not meaningful on load or store atomic
|
|
instructions and is treated as acquire and release respectively.
|
|
|
|
The memory order also adds the single thread optimization constraints defined in
|
|
table
|
|
:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
|
|
|
|
.. table:: AMDHSA Memory Model Single Thread Optimization Constraints
|
|
:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
|
|
|
|
============ ==============================================================
|
|
LLVM Memory Optimization Constraints
|
|
Ordering
|
|
============ ==============================================================
|
|
unordered *none*
|
|
monotonic *none*
|
|
acquire - If a load atomic/atomicrmw then no following load/load
|
|
atomic/store/store atomic/atomicrmw/fence instruction can be
|
|
moved before the acquire.
|
|
- If a fence then same as load atomic, plus no preceding
|
|
associated fence-paired-atomic can be moved after the fence.
|
|
release - If a store atomic/atomicrmw then no preceding load/load
|
|
atomic/store/store atomic/atomicrmw/fence instruction can be
|
|
moved after the release.
|
|
- If a fence then same as store atomic, plus no following
|
|
associated fence-paired-atomic can be moved before the
|
|
fence.
|
|
acq_rel Same constraints as both acquire and release.
|
|
seq_cst - If a load atomic then same constraints as acquire, plus no
|
|
preceding sequentially consistent load atomic/store
|
|
atomic/atomicrmw/fence instruction can be moved after the
|
|
seq_cst.
|
|
- If a store atomic then the same constraints as release, plus
|
|
no following sequentially consistent load atomic/store
|
|
atomic/atomicrmw/fence instruction can be moved before the
|
|
seq_cst.
|
|
- If an atomicrmw/fence then same constraints as acq_rel.
|
|
============ ==============================================================
|
|
|
|
The code sequences used to implement the memory model are defined in the
|
|
following sections:
|
|
|
|
* :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
|
|
* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
|
|
* :ref:`amdgpu-amdhsa-memory-model-gfx10`
|
|
|
|
.. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
|
|
|
|
Memory Model GFX6-GFX9
|
|
++++++++++++++++++++++
|
|
|
|
For GFX6-GFX9:
|
|
|
|
* Each agent has multiple shader arrays (SA).
|
|
* Each SA has multiple compute units (CU).
|
|
* Each CU has multiple SIMDs that execute wavefronts.
|
|
* The wavefronts for a single work-group are executed in the same CU but may be
|
|
executed by different SIMDs.
|
|
* Each CU has a single LDS memory shared by the wavefronts of the work-groups
|
|
executing on it.
|
|
* All LDS operations of a CU are performed as wavefront wide operations in a
|
|
global order and involve no caching. Completion is reported to a wavefront in
|
|
execution order.
|
|
* The LDS memory has multiple request queues shared by the SIMDs of a
|
|
CU. Therefore, the LDS operations performed by different wavefronts of a
|
|
work-group can be reordered relative to each other, which can result in
|
|
reordering the visibility of vector memory operations with respect to LDS
|
|
operations of other wavefronts in the same work-group. A ``s_waitcnt
|
|
lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
|
|
vector memory operations between wavefronts of a work-group, but not between
|
|
operations performed by the same wavefront.
|
|
* The vector memory operations are performed as wavefront wide operations and
|
|
completion is reported to a wavefront in execution order. The exception is
|
|
that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
|
|
vector memory order if they access LDS memory, and out of LDS operation order
|
|
if they access global memory.
|
|
* The vector memory operations access a single vector L1 cache shared by all
|
|
SIMDs a CU. Therefore, no special action is required for coherence between the
|
|
lanes of a single wavefront, or for coherence between wavefronts in the same
|
|
work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
|
|
wavefronts executing in different work-groups as they may be executing on
|
|
different CUs.
|
|
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
|
|
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
|
|
scalar operations are used in a restricted way so do not impact the memory
|
|
model. See :ref:`amdgpu-amdhsa-memory-spaces`.
|
|
* The vector and scalar memory operations use an L2 cache shared by all CUs on
|
|
the same agent.
|
|
* The L2 cache has independent channels to service disjoint ranges of virtual
|
|
addresses.
|
|
* Each CU has a separate request queue per channel. Therefore, the vector and
|
|
scalar memory operations performed by wavefronts executing in different
|
|
work-groups (which may be executing on different CUs) of an agent can be
|
|
reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
|
|
ensure synchronization between vector memory operations of different CUs. It
|
|
ensures a previous vector memory operation has completed before executing a
|
|
subsequent vector memory or LDS operation and so can be used to meet the
|
|
requirements of acquire and release.
|
|
* The L2 cache can be kept coherent with other agents on some targets, or ranges
|
|
of virtual addresses can be set up to bypass it to ensure system coherence.
|
|
|
|
Scalar memory operations are only used to access memory that is proven to not
|
|
change during the execution of the kernel dispatch. This includes constant
|
|
address space and global address space for program scope ``const`` variables.
|
|
Therefore, the kernel machine code does not have to maintain the scalar cache to
|
|
ensure it is coherent with the vector caches. The scalar and vector caches are
|
|
invalidated between kernel dispatches by CP since constant address space data
|
|
may change between kernel dispatch executions. See
|
|
:ref:`amdgpu-amdhsa-memory-spaces`.
|
|
|
|
The one exception is if scalar writes are used to spill SGPR registers. In this
|
|
case the AMDGPU backend ensures the memory location used to spill is never
|
|
accessed by vector memory operations at the same time. If scalar writes are used
|
|
then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
|
|
return since the locations may be used for vector memory instructions by a
|
|
future wavefront that uses the same scratch area, or a function call that
|
|
creates a frame at the same address, respectively. There is no need for a
|
|
``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
|
|
|
|
For kernarg backing memory:
|
|
|
|
* CP invalidates the L1 cache at the start of each kernel dispatch.
|
|
* On dGPU the kernarg backing memory is allocated in host memory accessed as
|
|
MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
|
|
causes it to be treated as non-volatile and so is not invalidated by
|
|
``*_vol``.
|
|
* On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
|
|
and so the L2 cache will be coherent with the CPU and other agents.
|
|
|
|
Scratch backing memory (which is used for the private address space) is accessed
|
|
with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
|
|
only accessed by a single thread, and is always write-before-read, there is
|
|
never a need to invalidate these entries from the L1 cache. Hence all cache
|
|
invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
|
|
|
|
The code sequences used to implement the memory model for GFX6-GFX9 are defined
|
|
in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
|
|
|
|
.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
|
|
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
|
|
|
|
============ ============ ============== ========== ================================
|
|
LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
|
|
Ordering Sync Scope Address GFX6-GFX9
|
|
Space
|
|
============ ============ ============== ========== ================================
|
|
**Non-Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load *none* *none* - global - !volatile & !nontemporal
|
|
- generic
|
|
- private 1. buffer/global/flat_load
|
|
- constant
|
|
- !volatile & nontemporal
|
|
|
|
1. buffer/global/flat_load
|
|
glc=1 slc=1
|
|
|
|
- volatile
|
|
|
|
1. buffer/global/flat_load
|
|
glc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
any following volatile
|
|
global/generic
|
|
load/store.
|
|
- Ensures that
|
|
volatile
|
|
operations to
|
|
different
|
|
addresses will not
|
|
be reordered by
|
|
hardware.
|
|
|
|
load *none* *none* - local 1. ds_load
|
|
store *none* *none* - global - !volatile & !nontemporal
|
|
- generic
|
|
- private 1. buffer/global/flat_store
|
|
- constant
|
|
- !volatile & nontemporal
|
|
|
|
1. buffer/global/flat_store
|
|
glc=1 slc=1
|
|
|
|
- volatile
|
|
|
|
1. buffer/global/flat_store
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
any following volatile
|
|
global/generic
|
|
load/store.
|
|
- Ensures that
|
|
volatile
|
|
operations to
|
|
different
|
|
addresses will not
|
|
be reordered by
|
|
hardware.
|
|
|
|
store *none* *none* - local 1. ds_store
|
|
**Unordered Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic unordered *any* *any* *Same as non-atomic*.
|
|
store atomic unordered *any* *any* *Same as non-atomic*.
|
|
atomicrmw unordered *any* *any* *Same as monotonic atomic*.
|
|
**Monotonic Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
|
|
- wavefront - local
|
|
- workgroup - generic
|
|
load atomic monotonic - agent - global 1. buffer/global/flat_load
|
|
- system - generic glc=1
|
|
store atomic monotonic - singlethread - global 1. buffer/global/flat_store
|
|
- wavefront - generic
|
|
- workgroup
|
|
- agent
|
|
- system
|
|
store atomic monotonic - singlethread - local 1. ds_store
|
|
- wavefront
|
|
- workgroup
|
|
atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
|
|
- wavefront - generic
|
|
- workgroup
|
|
- agent
|
|
- system
|
|
atomicrmw monotonic - singlethread - local 1. ds_atomic
|
|
- wavefront
|
|
- workgroup
|
|
**Acquire Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
|
|
- wavefront - local
|
|
- generic
|
|
load atomic acquire - workgroup - global 1. buffer/global_load
|
|
load atomic acquire - workgroup - local 1. ds/flat_load
|
|
- generic 2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
load atomic acquire - agent - global 1. buffer/global_load
|
|
- system glc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the load
|
|
has completed
|
|
before invalidating
|
|
the cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale global data.
|
|
|
|
load atomic acquire - agent - generic 1. flat_load glc=1
|
|
- system 2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If OpenCL omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the flat_load
|
|
has completed
|
|
before invalidating
|
|
the cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
|
|
- wavefront - local
|
|
- generic
|
|
atomicrmw acquire - workgroup - global 1. buffer/global_atomic
|
|
atomicrmw acquire - workgroup - local 1. ds/flat_atomic
|
|
- generic 2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
atomicrmw acquire - agent - global 1. buffer/global_atomic
|
|
- system 2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - agent - generic 1. flat_atomic
|
|
- system 2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
fence acquire - singlethread *none* *none*
|
|
- wavefront
|
|
fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit.
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the
|
|
value read by the
|
|
fence-paired-atomic.
|
|
|
|
fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
fence-paired atomic
|
|
has completed
|
|
before invalidating
|
|
the
|
|
cache. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
fence-paired-atomic.
|
|
|
|
2. buffer_wbinvl1_vol
|
|
|
|
- Must happen before any
|
|
following global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
**Release Atomic**
|
|
------------------------------------------------------------------------------------
|
|
store atomic release - singlethread - global 1. buffer/global/ds/flat_store
|
|
- wavefront - local
|
|
- generic
|
|
store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
|
|
- generic
|
|
- If OpenCL, omit.
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
to local have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. buffer/global/flat_store
|
|
store atomic release - workgroup - local 1. ds_store
|
|
store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
to memory have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. buffer/global/flat_store
|
|
atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
|
|
- wavefront - local
|
|
- generic
|
|
atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
|
|
- generic
|
|
- If OpenCL, omit.
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to local have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global/flat_atomic
|
|
atomicrmw release - workgroup - local 1. ds_atomic
|
|
atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global and local
|
|
have completed
|
|
before performing
|
|
the atomicrmw that
|
|
is being released.
|
|
|
|
2. buffer/global/flat_atomic
|
|
fence release - singlethread *none* *none*
|
|
- wavefront
|
|
fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit.
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
to local have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
**Acquire-Release Atomic**
|
|
------------------------------------------------------------------------------------
|
|
atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
|
|
- wavefront - local
|
|
- generic
|
|
atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to local have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global_atomic
|
|
|
|
atomicrmw acq_rel - workgroup - local 1. ds_atomic
|
|
2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to local have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. flat_atomic
|
|
3. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global_atomic
|
|
3. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
4. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. flat_atomic
|
|
3. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
4. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
fence acq_rel - singlethread *none* *none*
|
|
- wavefront
|
|
fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit.
|
|
- However,
|
|
since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to local have
|
|
completed before
|
|
performing any
|
|
following global
|
|
memory operations.
|
|
- Ensures that the
|
|
preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before following
|
|
global memory
|
|
operations. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
local/generic store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
|
|
fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
preceding
|
|
global/local/generic
|
|
load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before invalidating
|
|
the cache. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
global/local/generic
|
|
store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
|
|
2. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
|
|
**Sequential Consistent Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local load atomic acquire,
|
|
- generic except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
|
|
- generic
|
|
|
|
- Must
|
|
happen after
|
|
preceding
|
|
local/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
lgkmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent local
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
load atomic seq_cst - workgroup - local *Same as corresponding
|
|
load atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
|
|
load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0)
|
|
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0)
|
|
and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
lgkmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent global
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
store atomic seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local store atomic release,
|
|
- workgroup - generic except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
atomicrmw seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local atomicrmw acq_rel,
|
|
- workgroup - generic except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
fence seq_cst - singlethread *none* *Same as corresponding
|
|
- wavefront fence acq_rel,
|
|
- workgroup except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
============ ============ ============== ========== ================================
|
|
|
|
.. _amdgpu-amdhsa-memory-model-gfx90a:
|
|
|
|
Memory Model GFX90A
|
|
+++++++++++++++++++
|
|
|
|
For GFX90A:
|
|
|
|
* Each agent has multiple shader arrays (SA).
|
|
* Each SA has multiple compute units (CU).
|
|
* Each CU has multiple SIMDs that execute wavefronts.
|
|
* The wavefronts for a single work-group are executed in the same CU but may be
|
|
executed by different SIMDs. The exception is when in tgsplit execution mode
|
|
when the wavefronts may be executed by different SIMDs in different CUs.
|
|
* Each CU has a single LDS memory shared by the wavefronts of the work-groups
|
|
executing on it. The exception is when in tgsplit execution mode when no LDS
|
|
is allocated as wavefronts of the same work-group can be in different CUs.
|
|
* All LDS operations of a CU are performed as wavefront wide operations in a
|
|
global order and involve no caching. Completion is reported to a wavefront in
|
|
execution order.
|
|
* The LDS memory has multiple request queues shared by the SIMDs of a
|
|
CU. Therefore, the LDS operations performed by different wavefronts of a
|
|
work-group can be reordered relative to each other, which can result in
|
|
reordering the visibility of vector memory operations with respect to LDS
|
|
operations of other wavefronts in the same work-group. A ``s_waitcnt
|
|
lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
|
|
vector memory operations between wavefronts of a work-group, but not between
|
|
operations performed by the same wavefront.
|
|
* The vector memory operations are performed as wavefront wide operations and
|
|
completion is reported to a wavefront in execution order. The exception is
|
|
that ``flat_load/store/atomic`` instructions can report out of vector memory
|
|
order if they access LDS memory, and out of LDS operation order if they access
|
|
global memory.
|
|
* The vector memory operations access a single vector L1 cache shared by all
|
|
SIMDs a CU. Therefore:
|
|
|
|
* No special action is required for coherence between the lanes of a single
|
|
wavefront.
|
|
|
|
* No special action is required for coherence between wavefronts in the same
|
|
work-group since they execute on the same CU. The exception is when in
|
|
tgsplit execution mode as wavefronts of the same work-group can be in
|
|
different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
|
|
the following item.
|
|
|
|
* A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
|
|
executing in different work-groups as they may be executing on different
|
|
CUs.
|
|
|
|
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
|
|
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
|
|
scalar operations are used in a restricted way so do not impact the memory
|
|
model. See :ref:`amdgpu-amdhsa-memory-spaces`.
|
|
* The vector and scalar memory operations use an L2 cache shared by all CUs on
|
|
the same agent.
|
|
|
|
* The L2 cache has independent channels to service disjoint ranges of virtual
|
|
addresses.
|
|
* Each CU has a separate request queue per channel. Therefore, the vector and
|
|
scalar memory operations performed by wavefronts executing in different
|
|
work-groups (which may be executing on different CUs), or the same
|
|
work-group if executing in tgsplit mode, of an agent can be reordered
|
|
relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
|
|
synchronization between vector memory operations of different CUs. It
|
|
ensures a previous vector memory operation has completed before executing a
|
|
subsequent vector memory or LDS operation and so can be used to meet the
|
|
requirements of acquire and release.
|
|
* The L2 cache of one agent can be kept coherent with other agents by:
|
|
using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
|
|
C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
|
|
the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
|
|
|
|
* Any local memory cache lines will be automatically invalidated by writes
|
|
from CUs associated with other L2 caches, or writes from the CPU, due to
|
|
the cache probe caused by coherent requests. Coherent requests are caused
|
|
by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
|
|
XGMI, and by PCIe requests that are configured to be coherent requests.
|
|
* XGMI accesses from the CPU to local memory may be cached on the CPU.
|
|
Subsequent access from the GPU will automatically invalidate or writeback
|
|
the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
|
|
* Since all work-groups on the same agent share the same L2, no L2
|
|
invalidation or writeback is required for coherence.
|
|
* To ensure coherence of local and remote memory writes of work-groups in
|
|
different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
|
|
cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
|
|
()used for remote coarse grain memory). Note that MTYPE CC (used for local
|
|
fine grain memory) causes write through to DRAM, and MTYPE UC (used for
|
|
remote fine grain memory) bypasses the L2, so both will never result in
|
|
dirty L2 cache lines.
|
|
* To ensure coherence of local and remote memory reads of work-groups in
|
|
different agents a ``buffer_invl2`` is required. It will invalidate L2
|
|
cache lines with MTYPE NC (used for remote coarse grain memory). Note that
|
|
MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
|
|
coarse memory) cause local reads to be invalidated by remote writes with
|
|
with the PTE C-bit so these cache lines are not invalidated. Note that
|
|
MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
|
|
never result in L2 cache lines that need to be invalidated.
|
|
|
|
* PCIe access from the GPU to the CPU memory is kept coherent by using the
|
|
MTYPE UC (uncached) which bypasses the L2.
|
|
|
|
Scalar memory operations are only used to access memory that is proven to not
|
|
change during the execution of the kernel dispatch. This includes constant
|
|
address space and global address space for program scope ``const`` variables.
|
|
Therefore, the kernel machine code does not have to maintain the scalar cache to
|
|
ensure it is coherent with the vector caches. The scalar and vector caches are
|
|
invalidated between kernel dispatches by CP since constant address space data
|
|
may change between kernel dispatch executions. See
|
|
:ref:`amdgpu-amdhsa-memory-spaces`.
|
|
|
|
The one exception is if scalar writes are used to spill SGPR registers. In this
|
|
case the AMDGPU backend ensures the memory location used to spill is never
|
|
accessed by vector memory operations at the same time. If scalar writes are used
|
|
then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
|
|
return since the locations may be used for vector memory instructions by a
|
|
future wavefront that uses the same scratch area, or a function call that
|
|
creates a frame at the same address, respectively. There is no need for a
|
|
``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
|
|
|
|
For kernarg backing memory:
|
|
|
|
* CP invalidates the L1 cache at the start of each kernel dispatch.
|
|
* On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
|
|
memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
|
|
cache. This also causes it to be treated as non-volatile and so is not
|
|
invalidated by ``*_vol``.
|
|
* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
|
|
so the L2 cache will be coherent with the CPU and other agents.
|
|
|
|
Scratch backing memory (which is used for the private address space) is accessed
|
|
with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
|
|
only accessed by a single thread, and is always write-before-read, there is
|
|
never a need to invalidate these entries from the L1 cache. Hence all cache
|
|
invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
|
|
|
|
The code sequences used to implement the memory model for GFX90A are defined
|
|
in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
|
|
|
|
.. table:: AMDHSA Memory Model Code Sequences GFX90A
|
|
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
|
|
|
|
============ ============ ============== ========== ================================
|
|
LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
|
|
Ordering Sync Scope Address GFX90A
|
|
Space
|
|
============ ============ ============== ========== ================================
|
|
**Non-Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load *none* *none* - global - !volatile & !nontemporal
|
|
- generic
|
|
- private 1. buffer/global/flat_load
|
|
- constant
|
|
- !volatile & nontemporal
|
|
|
|
1. buffer/global/flat_load
|
|
glc=1 slc=1
|
|
|
|
- volatile
|
|
|
|
1. buffer/global/flat_load
|
|
glc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
any following volatile
|
|
global/generic
|
|
load/store.
|
|
- Ensures that
|
|
volatile
|
|
operations to
|
|
different
|
|
addresses will not
|
|
be reordered by
|
|
hardware.
|
|
|
|
load *none* *none* - local 1. ds_load
|
|
store *none* *none* - global - !volatile & !nontemporal
|
|
- generic
|
|
- private 1. buffer/global/flat_store
|
|
- constant
|
|
- !volatile & nontemporal
|
|
|
|
1. buffer/global/flat_store
|
|
glc=1 slc=1
|
|
|
|
- volatile
|
|
|
|
1. buffer/global/flat_store
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
any following volatile
|
|
global/generic
|
|
load/store.
|
|
- Ensures that
|
|
volatile
|
|
operations to
|
|
different
|
|
addresses will not
|
|
be reordered by
|
|
hardware.
|
|
|
|
store *none* *none* - local 1. ds_store
|
|
**Unordered Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic unordered *any* *any* *Same as non-atomic*.
|
|
store atomic unordered *any* *any* *Same as non-atomic*.
|
|
atomicrmw unordered *any* *any* *Same as monotonic atomic*.
|
|
**Monotonic Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic monotonic - singlethread - global 1. buffer/global/flat_load
|
|
- wavefront - generic
|
|
load atomic monotonic - workgroup - global 1. buffer/global/flat_load
|
|
- generic glc=1
|
|
|
|
- If not TgSplit execution
|
|
mode, omit glc=1.
|
|
|
|
load atomic monotonic - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
- workgroup be used.*
|
|
|
|
1. ds_load
|
|
load atomic monotonic - agent - global 1. buffer/global/flat_load
|
|
- generic glc=1
|
|
load atomic monotonic - system - global 1. buffer/global/flat_load
|
|
- generic glc=1
|
|
store atomic monotonic - singlethread - global 1. buffer/global/flat_store
|
|
- wavefront - generic
|
|
- workgroup
|
|
- agent
|
|
store atomic monotonic - system - global 1. buffer/global/flat_store
|
|
- generic
|
|
store atomic monotonic - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
- workgroup be used.*
|
|
|
|
1. ds_store
|
|
atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
|
|
- wavefront - generic
|
|
- workgroup
|
|
- agent
|
|
atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
|
|
- generic
|
|
atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
- workgroup be used.*
|
|
|
|
1. ds_atomic
|
|
**Acquire Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
|
|
- wavefront - local
|
|
- generic
|
|
load atomic acquire - workgroup - global 1. buffer/global_load glc=1
|
|
|
|
- If not TgSplit execution
|
|
mode, omit glc=1.
|
|
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Must happen before the
|
|
following buffer_wbinvl1_vol.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
load atomic acquire - workgroup - local *If TgSplit execution mode,
|
|
local address space cannot
|
|
be used.*
|
|
|
|
1. ds_load
|
|
2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
load atomic acquire - workgroup - generic 1. flat_load glc=1
|
|
|
|
- If not TgSplit execution
|
|
mode, omit glc=1.
|
|
|
|
2. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol and any
|
|
following global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
load atomic acquire - agent - global 1. buffer/global_load
|
|
glc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the load
|
|
has completed
|
|
before invalidating
|
|
the cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale global data.
|
|
|
|
load atomic acquire - system - global 1. buffer/global/flat_load
|
|
glc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the load
|
|
has completed
|
|
before invalidating
|
|
the cache.
|
|
|
|
3. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
load atomic acquire - agent - generic 1. flat_load glc=1
|
|
2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the flat_load
|
|
has completed
|
|
before invalidating
|
|
the cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
load atomic acquire - system - generic 1. flat_load glc=1
|
|
2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the flat_load
|
|
has completed
|
|
before invalidating
|
|
the caches.
|
|
|
|
3. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
|
|
- wavefront - generic
|
|
atomicrmw acquire - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
be used.*
|
|
|
|
1. ds_atomic
|
|
atomicrmw acquire - workgroup - global 1. buffer/global_atomic
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Must happen before the
|
|
following buffer_wbinvl1_vol.
|
|
- Ensures the atomicrmw
|
|
has completed
|
|
before invalidating
|
|
the cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - workgroup - local *If TgSplit execution mode,
|
|
local address space cannot
|
|
be used.*
|
|
|
|
1. ds_atomic
|
|
2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
atomicrmw acquire - workgroup - generic 1. flat_atomic
|
|
2. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol and
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acquire - agent - global 1. buffer/global_atomic
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - system - global 1. buffer/global_atomic
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
3. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
atomicrmw acquire - agent - generic 1. flat_atomic
|
|
2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - system - generic 1. flat_atomic
|
|
2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
3. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
fence acquire - singlethread *none* *none*
|
|
- wavefront
|
|
fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load
|
|
atomic/
|
|
atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol and
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the
|
|
value read by the
|
|
fence-paired-atomic.
|
|
|
|
2. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
fence-paired atomic
|
|
has completed
|
|
before invalidating
|
|
the
|
|
cache. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
fence-paired-atomic.
|
|
|
|
2. buffer_wbinvl1_vol
|
|
|
|
- Must happen before any
|
|
following global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
the following buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
fence-paired atomic
|
|
has completed
|
|
before invalidating
|
|
the
|
|
cache. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
fence-paired-atomic.
|
|
|
|
2. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before any
|
|
following global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
**Release Atomic**
|
|
------------------------------------------------------------------------------------
|
|
store atomic release - singlethread - global 1. buffer/global/flat_store
|
|
- wavefront - generic
|
|
store atomic release - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
be used.*
|
|
|
|
1. ds_store
|
|
store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
|
|
- generic
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/store/
|
|
load atomic/store atomic/
|
|
atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. buffer/global/flat_store
|
|
store atomic release - workgroup - local *If TgSplit execution mode,
|
|
local address space cannot
|
|
be used.*
|
|
|
|
1. ds_store
|
|
store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- generic vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
to memory have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. buffer/global/flat_store
|
|
store atomic release - system - global 1. buffer_wbl2
|
|
- generic
|
|
- Must happen before
|
|
following s_waitcnt.
|
|
- Performs L2 writeback to
|
|
ensure previous
|
|
global/generic
|
|
store/atomicrmw are
|
|
visible at system scope.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after any
|
|
preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after any
|
|
preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
to memory and the L2
|
|
writeback have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
3. buffer/global/flat_store
|
|
atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
|
|
- wavefront - generic
|
|
atomicrmw release - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
be used.*
|
|
|
|
1. ds_atomic
|
|
atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
|
|
- generic
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/store/
|
|
load atomic/store atomic/
|
|
atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global/flat_atomic
|
|
atomicrmw release - workgroup - local *If TgSplit execution mode,
|
|
local address space cannot
|
|
be used.*
|
|
|
|
1. ds_atomic
|
|
atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- generic vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global and local
|
|
have completed
|
|
before performing
|
|
the atomicrmw that
|
|
is being released.
|
|
|
|
2. buffer/global/flat_atomic
|
|
atomicrmw release - system - global 1. buffer_wbl2
|
|
- generic
|
|
- Must happen before
|
|
following s_waitcnt.
|
|
- Performs L2 writeback to
|
|
ensure previous
|
|
global/generic
|
|
store/atomicrmw are
|
|
visible at system scope.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to memory and the L2
|
|
writeback have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
3. buffer/global/flat_atomic
|
|
fence release - singlethread *none* *none*
|
|
- wavefront
|
|
fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/
|
|
load atomic/store atomic/
|
|
atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
fence release - system *none* 1. buffer_wbl2
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit.
|
|
- Must happen before
|
|
following s_waitcnt.
|
|
- Performs L2 writeback to
|
|
ensure previous
|
|
global/generic
|
|
store/atomicrmw are
|
|
visible at system scope.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
**Acquire-Release Atomic**
|
|
------------------------------------------------------------------------------------
|
|
atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
|
|
- wavefront - generic
|
|
atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
|
|
- wavefront local address space cannot
|
|
be used.*
|
|
|
|
1. ds_atomic
|
|
atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/store/
|
|
load atomic/store atomic/
|
|
atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global_atomic
|
|
3. s_waitcnt vmcnt(0)
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
4. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
|
|
local address space cannot
|
|
be used.*
|
|
|
|
1. ds_atomic
|
|
2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/store/
|
|
load atomic/store atomic/
|
|
atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. flat_atomic
|
|
3. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If not TgSplit execution
|
|
mode, omit vmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol and
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
3. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global_atomic
|
|
3. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
4. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acq_rel - system - global 1. buffer_wbl2
|
|
|
|
- Must happen before
|
|
following s_waitcnt.
|
|
- Performs L2 writeback to
|
|
ensure previous
|
|
global/generic
|
|
store/atomicrmw are
|
|
visible at system scope.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global and L2 writeback
|
|
have completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
3. buffer/global_atomic
|
|
4. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
5. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. flat_atomic
|
|
3. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
cache.
|
|
|
|
4. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acq_rel - system - generic 1. buffer_wbl2
|
|
|
|
- Must happen before
|
|
following s_waitcnt.
|
|
- Performs L2 writeback to
|
|
ensure previous
|
|
global/generic
|
|
store/atomicrmw are
|
|
visible at system scope.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global and L2 writeback
|
|
have completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
3. flat_atomic
|
|
4. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
5. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
fence acq_rel - singlethread *none* *none*
|
|
- wavefront
|
|
fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
|
|
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0).
|
|
- However,
|
|
since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/
|
|
load atomic/store atomic/
|
|
atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing any
|
|
following global
|
|
memory operations.
|
|
- Ensures that the
|
|
preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before following
|
|
global memory
|
|
operations. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
local/generic store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
acquire-fence-paired
|
|
atomic has completed
|
|
before invalidating
|
|
the
|
|
cache. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
acquire-fence-paired-atomic.
|
|
|
|
2. buffer_wbinvl1_vol
|
|
|
|
- If not TgSplit execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
preceding
|
|
global/local/generic
|
|
load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before invalidating
|
|
the cache. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
global/local/generic
|
|
store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
|
|
2. buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
|
|
fence acq_rel - system *none* 1. buffer_wbl2
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit.
|
|
- Must happen before
|
|
following s_waitcnt.
|
|
- Performs L2 writeback to
|
|
ensure previous
|
|
global/generic
|
|
store/atomicrmw are
|
|
visible at system scope.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and
|
|
s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following buffer_invl2 and
|
|
buffer_wbinvl1_vol.
|
|
- Ensures that the
|
|
preceding
|
|
global/local/generic
|
|
load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before invalidating
|
|
the cache. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
global/local/generic
|
|
store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
|
|
3. buffer_invl2;
|
|
buffer_wbinvl1_vol
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale L1 global data,
|
|
nor see stale L2 MTYPE
|
|
NC global data.
|
|
MTYPE RW and CC memory will
|
|
never be stale in L2 due to
|
|
the memory probes.
|
|
|
|
**Sequential Consistent Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local load atomic acquire,
|
|
- generic except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
|
|
- generic
|
|
- Use lgkmcnt(0) if not
|
|
TgSplit execution mode
|
|
and vmcnt(0) if TgSplit
|
|
execution mode.
|
|
- s_waitcnt lgkmcnt(0) must
|
|
happen after
|
|
preceding
|
|
local/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
lgkmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent global/local
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
load atomic seq_cst - workgroup - local *If TgSplit execution mode,
|
|
local address space cannot
|
|
be used.*
|
|
|
|
*Same as corresponding
|
|
load atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
|
|
load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0)
|
|
|
|
- If TgSplit execution mode,
|
|
omit lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0)
|
|
and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
lgkmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent global
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
store atomic seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local store atomic release,
|
|
- workgroup - generic except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
atomicrmw seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local atomicrmw acq_rel,
|
|
- workgroup - generic except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
fence seq_cst - singlethread *none* *Same as corresponding
|
|
- wavefront fence acq_rel,
|
|
- workgroup except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
============ ============ ============== ========== ================================
|
|
|
|
.. _amdgpu-amdhsa-memory-model-gfx10:
|
|
|
|
Memory Model GFX10
|
|
++++++++++++++++++
|
|
|
|
For GFX10:
|
|
|
|
* Each agent has multiple shader arrays (SA).
|
|
* Each SA has multiple work-group processors (WGP).
|
|
* Each WGP has multiple compute units (CU).
|
|
* Each CU has multiple SIMDs that execute wavefronts.
|
|
* The wavefronts for a single work-group are executed in the same
|
|
WGP. In CU wavefront execution mode the wavefronts may be executed by
|
|
different SIMDs in the same CU. In WGP wavefront execution mode the
|
|
wavefronts may be executed by different SIMDs in different CUs in the same
|
|
WGP.
|
|
* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
|
|
executing on it.
|
|
* All LDS operations of a WGP are performed as wavefront wide operations in a
|
|
global order and involve no caching. Completion is reported to a wavefront in
|
|
execution order.
|
|
* The LDS memory has multiple request queues shared by the SIMDs of a
|
|
WGP. Therefore, the LDS operations performed by different wavefronts of a
|
|
work-group can be reordered relative to each other, which can result in
|
|
reordering the visibility of vector memory operations with respect to LDS
|
|
operations of other wavefronts in the same work-group. A ``s_waitcnt
|
|
lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
|
|
vector memory operations between wavefronts of a work-group, but not between
|
|
operations performed by the same wavefront.
|
|
* The vector memory operations are performed as wavefront wide operations.
|
|
Completion of load/store/sample operations are reported to a wavefront in
|
|
execution order of other load/store/sample operations performed by that
|
|
wavefront.
|
|
* The vector memory operations access a vector L0 cache. There is a single L0
|
|
cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
|
|
special action is required for coherence between the lanes of a single
|
|
wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
|
|
wavefronts executing in the same work-group as they may be executing on SIMDs
|
|
of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
|
|
required for coherence between wavefronts executing in different work-groups
|
|
as they may be executing on different WGPs.
|
|
* The scalar memory operations access a scalar L0 cache shared by all wavefronts
|
|
on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
|
|
operations are used in a restricted way so do not impact the memory model. See
|
|
:ref:`amdgpu-amdhsa-memory-spaces`.
|
|
* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
|
|
the same SA. Therefore, no special action is required for coherence between
|
|
the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
|
|
required for coherence between wavefronts executing in different work-groups
|
|
as they may be executing on different SAs that access different L1s.
|
|
* The L1 caches have independent quadrants to service disjoint ranges of virtual
|
|
addresses.
|
|
* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
|
|
vector and scalar memory operations performed by different wavefronts, whether
|
|
executing in the same or different work-groups (which may be executing on
|
|
different CUs accessing different L0s), can be reordered relative to each
|
|
other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
|
|
synchronization between vector memory operations of different wavefronts. It
|
|
ensures a previous vector memory operation has completed before executing a
|
|
subsequent vector memory or LDS operation and so can be used to meet the
|
|
requirements of acquire, release and sequential consistency.
|
|
* The L1 caches use an L2 cache shared by all SAs on the same agent.
|
|
* The L2 cache has independent channels to service disjoint ranges of virtual
|
|
addresses.
|
|
* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
|
|
quadrant has a separate request queue per L2 channel. Therefore, the vector
|
|
and scalar memory operations performed by wavefronts executing in different
|
|
work-groups (which may be executing on different SAs) of an agent can be
|
|
reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
|
|
required to ensure synchronization between vector memory operations of
|
|
different SAs. It ensures a previous vector memory operation has completed
|
|
before executing a subsequent vector memory and so can be used to meet the
|
|
requirements of acquire, release and sequential consistency.
|
|
* The L2 cache can be kept coherent with other agents on some targets, or ranges
|
|
of virtual addresses can be set up to bypass it to ensure system coherence.
|
|
* On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory.
|
|
The MALL cache is fully coherent with GPU memory and has no impact on system
|
|
coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
|
|
|
|
Scalar memory operations are only used to access memory that is proven to not
|
|
change during the execution of the kernel dispatch. This includes constant
|
|
address space and global address space for program scope ``const`` variables.
|
|
Therefore, the kernel machine code does not have to maintain the scalar cache to
|
|
ensure it is coherent with the vector caches. The scalar and vector caches are
|
|
invalidated between kernel dispatches by CP since constant address space data
|
|
may change between kernel dispatch executions. See
|
|
:ref:`amdgpu-amdhsa-memory-spaces`.
|
|
|
|
The one exception is if scalar writes are used to spill SGPR registers. In this
|
|
case the AMDGPU backend ensures the memory location used to spill is never
|
|
accessed by vector memory operations at the same time. If scalar writes are used
|
|
then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
|
|
return since the locations may be used for vector memory instructions by a
|
|
future wavefront that uses the same scratch area, or a function call that
|
|
creates a frame at the same address, respectively. There is no need for a
|
|
``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
|
|
|
|
For kernarg backing memory:
|
|
|
|
* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
|
|
* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
|
|
needing to invalidate the L2 cache.
|
|
* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
|
|
so the L2 cache will be coherent with the CPU and other agents.
|
|
|
|
Scratch backing memory (which is used for the private address space) is accessed
|
|
with MTYPE NC (non-coherent). Since the private address space is only accessed
|
|
by a single thread, and is always write-before-read, there is never a need to
|
|
invalidate these entries from the L0 or L1 caches.
|
|
|
|
Wavefronts are executed in native mode with in-order reporting of loads and
|
|
sample instructions. In this mode vmcnt reports completion of load, atomic with
|
|
return and sample instructions in order, and the vscnt reports the completion of
|
|
store and atomic without return in order. See ``MEM_ORDERED`` field in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
|
|
Wavefronts can be executed in WGP or CU wavefront execution mode:
|
|
|
|
* In WGP wavefront execution mode the wavefronts of a work-group are executed
|
|
on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
|
|
CU L0 caches is required for work-group synchronization. Also accesses to L1
|
|
at work-group scope need to be explicitly ordered as the accesses from
|
|
different CUs are not ordered.
|
|
* In CU wavefront execution mode the wavefronts of a work-group are executed on
|
|
the SIMDs of a single CU of the WGP. Therefore, all global memory access by
|
|
the work-group access the same L0 which in turn ensures L1 accesses are
|
|
ordered and so do not require explicit management of the caches for
|
|
work-group synchronization.
|
|
|
|
See ``WGP_MODE`` field in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
|
|
:ref:`amdgpu-target-features`.
|
|
|
|
The code sequences used to implement the memory model for GFX10 are defined in
|
|
table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
|
|
|
|
.. table:: AMDHSA Memory Model Code Sequences GFX10
|
|
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
|
|
|
|
============ ============ ============== ========== ================================
|
|
LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
|
|
Ordering Sync Scope Address GFX10
|
|
Space
|
|
============ ============ ============== ========== ================================
|
|
**Non-Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load *none* *none* - global - !volatile & !nontemporal
|
|
- generic
|
|
- private 1. buffer/global/flat_load
|
|
- constant
|
|
- !volatile & nontemporal
|
|
|
|
1. buffer/global/flat_load
|
|
slc=1
|
|
|
|
- volatile
|
|
|
|
1. buffer/global/flat_load
|
|
glc=1 dlc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
any following volatile
|
|
global/generic
|
|
load/store.
|
|
- Ensures that
|
|
volatile
|
|
operations to
|
|
different
|
|
addresses will not
|
|
be reordered by
|
|
hardware.
|
|
|
|
load *none* *none* - local 1. ds_load
|
|
store *none* *none* - global - !volatile & !nontemporal
|
|
- generic
|
|
- private 1. buffer/global/flat_store
|
|
- constant
|
|
- !volatile & nontemporal
|
|
|
|
1. buffer/global/flat_store
|
|
glc=1 slc=1
|
|
|
|
- volatile
|
|
|
|
1. buffer/global/flat_store
|
|
2. s_waitcnt vscnt(0)
|
|
|
|
- Must happen before
|
|
any following volatile
|
|
global/generic
|
|
load/store.
|
|
- Ensures that
|
|
volatile
|
|
operations to
|
|
different
|
|
addresses will not
|
|
be reordered by
|
|
hardware.
|
|
|
|
store *none* *none* - local 1. ds_store
|
|
**Unordered Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic unordered *any* *any* *Same as non-atomic*.
|
|
store atomic unordered *any* *any* *Same as non-atomic*.
|
|
atomicrmw unordered *any* *any* *Same as monotonic atomic*.
|
|
**Monotonic Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic monotonic - singlethread - global 1. buffer/global/flat_load
|
|
- wavefront - generic
|
|
load atomic monotonic - workgroup - global 1. buffer/global/flat_load
|
|
- generic glc=1
|
|
|
|
- If CU wavefront execution
|
|
mode, omit glc=1.
|
|
|
|
load atomic monotonic - singlethread - local 1. ds_load
|
|
- wavefront
|
|
- workgroup
|
|
load atomic monotonic - agent - global 1. buffer/global/flat_load
|
|
- system - generic glc=1 dlc=1
|
|
store atomic monotonic - singlethread - global 1. buffer/global/flat_store
|
|
- wavefront - generic
|
|
- workgroup
|
|
- agent
|
|
- system
|
|
store atomic monotonic - singlethread - local 1. ds_store
|
|
- wavefront
|
|
- workgroup
|
|
atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
|
|
- wavefront - generic
|
|
- workgroup
|
|
- agent
|
|
- system
|
|
atomicrmw monotonic - singlethread - local 1. ds_atomic
|
|
- wavefront
|
|
- workgroup
|
|
**Acquire Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
|
|
- wavefront - local
|
|
- generic
|
|
load atomic acquire - workgroup - global 1. buffer/global_load glc=1
|
|
|
|
- If CU wavefront execution
|
|
mode, omit glc=1.
|
|
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Must happen before
|
|
the following buffer_gl0_inv
|
|
and before any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
load atomic acquire - workgroup - local 1. ds_load
|
|
2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
the following buffer_gl0_inv
|
|
and before any following
|
|
global/generic load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- If OpenCL, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
load atomic acquire - workgroup - generic 1. flat_load glc=1
|
|
|
|
- If CU wavefront execution
|
|
mode, omit glc=1.
|
|
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv and any
|
|
following global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
load atomic acquire - agent - global 1. buffer/global_load
|
|
- system glc=1 dlc=1
|
|
2. s_waitcnt vmcnt(0)
|
|
|
|
- Must happen before
|
|
following
|
|
buffer_gl*_inv.
|
|
- Ensures the load
|
|
has completed
|
|
before invalidating
|
|
the caches.
|
|
|
|
3. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale global data.
|
|
|
|
load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
|
|
- system 2. s_waitcnt vmcnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If OpenCL omit
|
|
lgkmcnt(0).
|
|
- Must happen before
|
|
following
|
|
buffer_gl*_invl.
|
|
- Ensures the flat_load
|
|
has completed
|
|
before invalidating
|
|
the caches.
|
|
|
|
3. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
|
|
- wavefront - local
|
|
- generic
|
|
atomicrmw acquire - workgroup - global 1. buffer/global_atomic
|
|
2. s_waitcnt vm/vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
the following buffer_gl0_inv
|
|
and before any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acquire - workgroup - local 1. ds_atomic
|
|
2. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If OpenCL omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acquire - workgroup - generic 1. flat_atomic
|
|
2. s_waitcnt lgkmcnt(0) &
|
|
vm/vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vm/vscnt(0).
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than a local
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acquire - agent - global 1. buffer/global_atomic
|
|
- system 2. s_waitcnt vm/vscnt(0)
|
|
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
following
|
|
buffer_gl*_inv.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
3. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acquire - agent - generic 1. flat_atomic
|
|
- system 2. s_waitcnt vm/vscnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
following
|
|
buffer_gl*_inv.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
3. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
fence acquire - singlethread *none* *none*
|
|
- wavefront
|
|
fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0) and vscnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load
|
|
atomic/
|
|
atomicrmw-with-return-value
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
atomicrmw-no-return-value
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures that the
|
|
fence-paired atomic
|
|
has completed
|
|
before invalidating
|
|
the
|
|
cache. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
fence-paired-atomic.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0) and vscnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load
|
|
atomic/
|
|
atomicrmw-with-return-value
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
atomicrmw-no-return-value
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Must happen before
|
|
the following
|
|
buffer_gl*_inv.
|
|
- Ensures that the
|
|
fence-paired atomic
|
|
has completed
|
|
before invalidating
|
|
the
|
|
caches. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
fence-paired-atomic.
|
|
|
|
2. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before any
|
|
following global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
**Release Atomic**
|
|
------------------------------------------------------------------------------------
|
|
store atomic release - singlethread - global 1. buffer/global/ds/flat_store
|
|
- wavefront - local
|
|
- generic
|
|
store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
|
|
- generic vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store
|
|
atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. buffer/global/flat_store
|
|
store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- If OpenCL, omit.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and s_waitcnt
|
|
vscnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
global memory
|
|
operations have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. ds_store
|
|
store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt vscnt(0)
|
|
and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. buffer/global/flat_store
|
|
atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
|
|
- wavefront - local
|
|
- generic
|
|
atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
|
|
- generic vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store
|
|
atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global/flat_atomic
|
|
atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- If OpenCL, omit.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and s_waitcnt
|
|
vscnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
global memory
|
|
operations have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. ds_atomic
|
|
atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global and local
|
|
have completed
|
|
before performing
|
|
the atomicrmw that
|
|
is being released.
|
|
|
|
2. buffer/global/flat_atomic
|
|
fence release - singlethread *none* *none*
|
|
- wavefront
|
|
fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0) and vscnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store atomic/
|
|
atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0) and vscnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate. If
|
|
fence had an
|
|
address space then
|
|
set to address
|
|
space of OpenCL
|
|
fence flag, or to
|
|
generic if both
|
|
local and global
|
|
flags are
|
|
specified.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
any following store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
fence-paired-atomic).
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
following
|
|
fence-paired-atomic.
|
|
|
|
**Acquire-Release Atomic**
|
|
------------------------------------------------------------------------------------
|
|
atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
|
|
- wavefront - local
|
|
- generic
|
|
atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0), and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store
|
|
atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global_atomic
|
|
3. s_waitcnt vm/vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the
|
|
atomicrmw value
|
|
being acquired.
|
|
|
|
4. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- If OpenCL, omit.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and s_waitcnt
|
|
vscnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- Must happen before
|
|
the following
|
|
store.
|
|
- Ensures that all
|
|
global memory
|
|
operations have
|
|
completed before
|
|
performing the
|
|
store that is being
|
|
released.
|
|
|
|
2. ds_atomic
|
|
3. s_waitcnt lgkmcnt(0)
|
|
|
|
- If OpenCL, omit.
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the local load
|
|
atomic value being
|
|
acquired.
|
|
|
|
4. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- If OpenCL omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store
|
|
atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. flat_atomic
|
|
3. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL, omit lgkmcnt(0).
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures any
|
|
following global
|
|
data read is no
|
|
older than the load
|
|
atomic value being
|
|
acquired.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
to global have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. buffer/global_atomic
|
|
3. s_waitcnt vm/vscnt(0)
|
|
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
following
|
|
buffer_gl*_inv.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
4. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0), and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load atomic
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing the
|
|
atomicrmw that is
|
|
being released.
|
|
|
|
2. flat_atomic
|
|
3. s_waitcnt vm/vscnt(0) &
|
|
lgkmcnt(0)
|
|
|
|
- If OpenCL, omit
|
|
lgkmcnt(0).
|
|
- Use vmcnt(0) if atomic with
|
|
return and vscnt(0) if
|
|
atomic with no-return.
|
|
- Must happen before
|
|
following
|
|
buffer_gl*_inv.
|
|
- Ensures the
|
|
atomicrmw has
|
|
completed before
|
|
invalidating the
|
|
caches.
|
|
|
|
4. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data.
|
|
|
|
fence acq_rel - singlethread *none* *none*
|
|
- wavefront
|
|
fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
|
|
vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0) and vscnt(0).
|
|
- However,
|
|
since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store atomic/
|
|
atomicrmw.
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that all
|
|
memory operations
|
|
have
|
|
completed before
|
|
performing any
|
|
following global
|
|
memory operations.
|
|
- Ensures that the
|
|
preceding
|
|
local/generic load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before following
|
|
global memory
|
|
operations. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
local/generic store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
- Must happen before
|
|
the following
|
|
buffer_gl0_inv.
|
|
- Ensures that the
|
|
acquire-fence-paired
|
|
atomic has completed
|
|
before invalidating
|
|
the
|
|
cache. Therefore
|
|
any following
|
|
locations read must
|
|
be no older than
|
|
the value read by
|
|
the
|
|
acquire-fence-paired-atomic.
|
|
|
|
3. buffer_gl0_inv
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Ensures that
|
|
following
|
|
loads will not see
|
|
stale data.
|
|
|
|
fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
|
|
- system vmcnt(0) & vscnt(0)
|
|
|
|
- If OpenCL and
|
|
address space is
|
|
not generic, omit
|
|
lgkmcnt(0).
|
|
- If OpenCL and
|
|
address space is
|
|
local, omit
|
|
vmcnt(0) and vscnt(0).
|
|
- However, since LLVM
|
|
currently has no
|
|
address space on
|
|
the fence need to
|
|
conservatively
|
|
always generate
|
|
(see comment for
|
|
previous fence).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
load/load
|
|
atomic/
|
|
atomicrmw-with-return-value.
|
|
- s_waitcnt vscnt(0)
|
|
must happen after
|
|
any preceding
|
|
global/generic
|
|
store/store atomic/
|
|
atomicrmw-no-return-value.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
any preceding
|
|
local/generic
|
|
load/store/load
|
|
atomic/store
|
|
atomic/atomicrmw.
|
|
- Must happen before
|
|
the following
|
|
buffer_gl*_inv.
|
|
- Ensures that the
|
|
preceding
|
|
global/local/generic
|
|
load
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
acquire-fence-paired-atomic)
|
|
has completed
|
|
before invalidating
|
|
the caches. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
- Ensures that all
|
|
previous memory
|
|
operations have
|
|
completed before a
|
|
following
|
|
global/local/generic
|
|
store
|
|
atomic/atomicrmw
|
|
with an equal or
|
|
wider sync scope
|
|
and memory ordering
|
|
stronger than
|
|
unordered (this is
|
|
termed the
|
|
release-fence-paired-atomic).
|
|
This satisfies the
|
|
requirements of
|
|
release.
|
|
|
|
2. buffer_gl0_inv;
|
|
buffer_gl1_inv
|
|
|
|
- Must happen before
|
|
any following
|
|
global/generic
|
|
load/load
|
|
atomic/store/store
|
|
atomic/atomicrmw.
|
|
- Ensures that
|
|
following loads
|
|
will not see stale
|
|
global data. This
|
|
satisfies the
|
|
requirements of
|
|
acquire.
|
|
|
|
**Sequential Consistent Atomic**
|
|
------------------------------------------------------------------------------------
|
|
load atomic seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local load atomic acquire,
|
|
- generic except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
|
|
- generic vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit vmcnt(0) and
|
|
vscnt(0).
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0), and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt lgkmcnt(0) must
|
|
happen after
|
|
preceding
|
|
local/generic load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
lgkmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/
|
|
atomicrmw-with-return-value
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vscnt(0)
|
|
Must happen after
|
|
preceding
|
|
global/generic store
|
|
atomic/
|
|
atomicrmw-no-return-value
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vscnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent global/local
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
load atomic seq_cst - workgroup - local
|
|
|
|
1. s_waitcnt vmcnt(0) & vscnt(0)
|
|
|
|
- If CU wavefront execution
|
|
mode, omit.
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0) and s_waitcnt
|
|
vscnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt vmcnt(0)
|
|
Must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/
|
|
atomicrmw-with-return-value
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vscnt(0)
|
|
Must happen after
|
|
preceding
|
|
global/generic store
|
|
atomic/
|
|
atomicrmw-no-return-value
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vscnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent global
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
|
|
load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
|
|
- system - generic vmcnt(0) & vscnt(0)
|
|
|
|
- Could be split into
|
|
separate s_waitcnt
|
|
vmcnt(0), s_waitcnt
|
|
vscnt(0) and s_waitcnt
|
|
lgkmcnt(0) to allow
|
|
them to be
|
|
independently moved
|
|
according to the
|
|
following rules.
|
|
- s_waitcnt lgkmcnt(0)
|
|
must happen after
|
|
preceding
|
|
local load
|
|
atomic/store
|
|
atomic/atomicrmw
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
lgkmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vmcnt(0)
|
|
must happen after
|
|
preceding
|
|
global/generic load
|
|
atomic/
|
|
atomicrmw-with-return-value
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vmcnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- s_waitcnt vscnt(0)
|
|
Must happen after
|
|
preceding
|
|
global/generic store
|
|
atomic/
|
|
atomicrmw-no-return-value
|
|
with memory
|
|
ordering of seq_cst
|
|
and with equal or
|
|
wider sync scope.
|
|
(Note that seq_cst
|
|
fences have their
|
|
own s_waitcnt
|
|
vscnt(0) and so do
|
|
not need to be
|
|
considered.)
|
|
- Ensures any
|
|
preceding
|
|
sequential
|
|
consistent global
|
|
memory instructions
|
|
have completed
|
|
before executing
|
|
this sequentially
|
|
consistent
|
|
instruction. This
|
|
prevents reordering
|
|
a seq_cst store
|
|
followed by a
|
|
seq_cst load. (Note
|
|
that seq_cst is
|
|
stronger than
|
|
acquire/release as
|
|
the reordering of
|
|
load acquire
|
|
followed by a store
|
|
release is
|
|
prevented by the
|
|
s_waitcnt of
|
|
the release, but
|
|
there is nothing
|
|
preventing a store
|
|
release followed by
|
|
load acquire from
|
|
completing out of
|
|
order. The s_waitcnt
|
|
could be placed after
|
|
seq_store or before
|
|
the seq_load. We
|
|
choose the load to
|
|
make the s_waitcnt be
|
|
as late as possible
|
|
so that the store
|
|
may have already
|
|
completed.)
|
|
|
|
2. *Following
|
|
instructions same as
|
|
corresponding load
|
|
atomic acquire,
|
|
except must generate
|
|
all instructions even
|
|
for OpenCL.*
|
|
store atomic seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local store atomic release,
|
|
- workgroup - generic except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
atomicrmw seq_cst - singlethread - global *Same as corresponding
|
|
- wavefront - local atomicrmw acq_rel,
|
|
- workgroup - generic except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
fence seq_cst - singlethread *none* *Same as corresponding
|
|
- wavefront fence acq_rel,
|
|
- workgroup except must generate
|
|
- agent all instructions even
|
|
- system for OpenCL.*
|
|
============ ============ ============== ========== ================================
|
|
|
|
.. _amdgpu-amdhsa-trap-handler-abi:
|
|
|
|
Trap Handler ABI
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
|
|
runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
|
|
supports the ``s_trap`` instruction. For usage see:
|
|
|
|
- :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
|
|
- :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
|
|
- :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
|
|
|
|
.. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
|
|
:name: amdgpu-trap-handler-for-amdhsa-os-v2-table
|
|
|
|
=================== =============== =============== =======================================
|
|
Usage Code Sequence Trap Handler Description
|
|
Inputs
|
|
=================== =============== =============== =======================================
|
|
reserved ``s_trap 0x00`` Reserved by hardware.
|
|
``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
|
|
``queue_ptr`` intrinsic (not implemented).
|
|
``VGPR0``:
|
|
``arg``
|
|
``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
|
|
``queue_ptr`` the trap instruction. The associated
|
|
queue is signalled to put it into the
|
|
error state. When the queue is put in
|
|
the error state, the waves executing
|
|
dispatches on the queue will be
|
|
terminated.
|
|
``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
|
|
as a no-operation. The trap handler
|
|
is entered and immediately returns to
|
|
continue execution of the wavefront.
|
|
- If the debugger is enabled, causes
|
|
the debug trap to be reported by the
|
|
debugger and the wavefront is put in
|
|
the halt state with the PC at the
|
|
instruction. The debugger must
|
|
increment the PC and resume the wave.
|
|
reserved ``s_trap 0x04`` Reserved.
|
|
reserved ``s_trap 0x05`` Reserved.
|
|
reserved ``s_trap 0x06`` Reserved.
|
|
reserved ``s_trap 0x07`` Reserved.
|
|
reserved ``s_trap 0x08`` Reserved.
|
|
reserved ``s_trap 0xfe`` Reserved.
|
|
reserved ``s_trap 0xff`` Reserved.
|
|
=================== =============== =============== =======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
|
|
:name: amdgpu-trap-handler-for-amdhsa-os-v3-table
|
|
|
|
=================== =============== =============== =======================================
|
|
Usage Code Sequence Trap Handler Description
|
|
Inputs
|
|
=================== =============== =============== =======================================
|
|
reserved ``s_trap 0x00`` Reserved by hardware.
|
|
debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
|
|
breakpoints. Causes wave to be halted
|
|
with the PC at the trap instruction.
|
|
The debugger is responsible to resume
|
|
the wave, including the instruction
|
|
that the breakpoint overwrote.
|
|
``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
|
|
``queue_ptr`` the trap instruction. The associated
|
|
queue is signalled to put it into the
|
|
error state. When the queue is put in
|
|
the error state, the waves executing
|
|
dispatches on the queue will be
|
|
terminated.
|
|
``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
|
|
as a no-operation. The trap handler
|
|
is entered and immediately returns to
|
|
continue execution of the wavefront.
|
|
- If the debugger is enabled, causes
|
|
the debug trap to be reported by the
|
|
debugger and the wavefront is put in
|
|
the halt state with the PC at the
|
|
instruction. The debugger must
|
|
increment the PC and resume the wave.
|
|
reserved ``s_trap 0x04`` Reserved.
|
|
reserved ``s_trap 0x05`` Reserved.
|
|
reserved ``s_trap 0x06`` Reserved.
|
|
reserved ``s_trap 0x07`` Reserved.
|
|
reserved ``s_trap 0x08`` Reserved.
|
|
reserved ``s_trap 0xfe`` Reserved.
|
|
reserved ``s_trap 0xff`` Reserved.
|
|
=================== =============== =============== =======================================
|
|
|
|
..
|
|
|
|
.. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
|
|
:name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
|
|
|
|
=================== =============== ================ ================= =======================================
|
|
Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
|
|
=================== =============== ================ ================= =======================================
|
|
reserved ``s_trap 0x00`` Reserved by hardware.
|
|
debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
|
|
breakpoints. Causes wave to be halted
|
|
with the PC at the trap instruction.
|
|
The debugger is responsible to resume
|
|
the wave, including the instruction
|
|
that the breakpoint overwrote.
|
|
``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
|
|
``queue_ptr`` the trap instruction. The associated
|
|
queue is signalled to put it into the
|
|
error state. When the queue is put in
|
|
the error state, the waves executing
|
|
dispatches on the queue will be
|
|
terminated.
|
|
``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
|
|
as a no-operation. The trap handler
|
|
is entered and immediately returns to
|
|
continue execution of the wavefront.
|
|
- If the debugger is enabled, causes
|
|
the debug trap to be reported by the
|
|
debugger and the wavefront is put in
|
|
the halt state with the PC at the
|
|
instruction. The debugger must
|
|
increment the PC and resume the wave.
|
|
reserved ``s_trap 0x04`` Reserved.
|
|
reserved ``s_trap 0x05`` Reserved.
|
|
reserved ``s_trap 0x06`` Reserved.
|
|
reserved ``s_trap 0x07`` Reserved.
|
|
reserved ``s_trap 0x08`` Reserved.
|
|
reserved ``s_trap 0xfe`` Reserved.
|
|
reserved ``s_trap 0xff`` Reserved.
|
|
=================== =============== ================ ================= =======================================
|
|
|
|
.. _amdgpu-amdhsa-function-call-convention:
|
|
|
|
Call Convention
|
|
~~~~~~~~~~~~~~~
|
|
|
|
.. note::
|
|
|
|
This section is currently incomplete and has inaccuracies. It is WIP that will
|
|
be updated as information is determined.
|
|
|
|
See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
|
|
addresses. Unswizzled addresses are normal linear addresses.
|
|
|
|
.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
|
|
|
|
Kernel Functions
|
|
++++++++++++++++
|
|
|
|
This section describes the call convention ABI for the outer kernel function.
|
|
|
|
See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
|
|
convention.
|
|
|
|
The following is not part of the AMDGPU kernel calling convention but describes
|
|
how the AMDGPU implements function calls:
|
|
|
|
1. Clang decides the kernarg layout to match the *HSA Programmer's Language
|
|
Reference* [HSA]_.
|
|
|
|
- All structs are passed directly.
|
|
- Lambda values are passed *TBA*.
|
|
|
|
.. TODO::
|
|
|
|
- Does this really follow HSA rules? Or are structs >16 bytes passed
|
|
by-value struct?
|
|
- What is ABI for lambda values?
|
|
|
|
4. The kernel performs certain setup in its prolog, as described in
|
|
:ref:`amdgpu-amdhsa-kernel-prolog`.
|
|
|
|
.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
|
|
|
|
Non-Kernel Functions
|
|
++++++++++++++++++++
|
|
|
|
This section describes the call convention ABI for functions other than the
|
|
outer kernel function.
|
|
|
|
If a kernel has function calls then scratch is always allocated and used for
|
|
the call stack which grows from low address to high address using the swizzled
|
|
scratch address space.
|
|
|
|
On entry to a function:
|
|
|
|
1. SGPR0-3 contain a V# with the following properties (see
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
|
|
|
|
* Base address pointing to the beginning of the wavefront scratch backing
|
|
memory.
|
|
* Swizzled with dword element size and stride of wavefront size elements.
|
|
|
|
2. The FLAT_SCRATCH register pair is setup. See
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
|
|
3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
|
|
:ref:`amdgpu-amdhsa-kernel-prolog-m0`.
|
|
4. The EXEC register is set to the lanes active on entry to the function.
|
|
5. MODE register: *TBD*
|
|
6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
|
|
below.
|
|
7. SGPR30-31 return address (RA). The code address that the function must
|
|
return to when it completes. The value is undefined if the function is *no
|
|
return*.
|
|
8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
|
|
offset relative to the beginning of the wavefront scratch backing memory.
|
|
|
|
The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
|
|
offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
|
|
manner.
|
|
|
|
The unswizzled SP value can be converted into the swizzled SP value by:
|
|
|
|
| swizzled SP = unswizzled SP / wavefront size
|
|
|
|
This may be used to obtain the private address space address of stack
|
|
objects and to convert this address to a flat address by adding the flat
|
|
scratch aperture base address.
|
|
|
|
The swizzled SP value is always 4 bytes aligned for the ``r600``
|
|
architecture and 16 byte aligned for the ``amdgcn`` architecture.
|
|
|
|
.. note::
|
|
|
|
The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
|
|
OpenCL language which has the largest base type defined as 16 bytes.
|
|
|
|
On entry, the swizzled SP value is the address of the first function
|
|
argument passed on the stack. Other stack passed arguments are positive
|
|
offsets from the entry swizzled SP value.
|
|
|
|
The function may use positive offsets beyond the last stack passed argument
|
|
for stack allocated local variables and register spill slots. If necessary,
|
|
the function may align these to greater alignment than 16 bytes. After these
|
|
the function may dynamically allocate space for such things as runtime sized
|
|
``alloca`` local allocations.
|
|
|
|
If the function calls another function, it will place any stack allocated
|
|
arguments after the last local allocation and adjust SGPR32 to the address
|
|
after the last local allocation.
|
|
|
|
9. All other registers are unspecified.
|
|
10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
|
|
to the function.
|
|
|
|
On exit from a function:
|
|
|
|
1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
|
|
described below. Any registers used are considered clobbered registers.
|
|
2. The following registers are preserved and have the same value as on entry:
|
|
|
|
* FLAT_SCRATCH
|
|
* EXEC
|
|
* GFX6-GFX8: M0
|
|
* All SGPR registers except the clobbered registers of SGPR4-31.
|
|
* VGPR40-47
|
|
* VGPR56-63
|
|
* VGPR72-79
|
|
* VGPR88-95
|
|
* VGPR104-111
|
|
* VGPR120-127
|
|
* VGPR136-143
|
|
* VGPR152-159
|
|
* VGPR168-175
|
|
* VGPR184-191
|
|
* VGPR200-207
|
|
* VGPR216-223
|
|
* VGPR232-239
|
|
* VGPR248-255
|
|
|
|
.. note::
|
|
|
|
Except the argument registers, the VGPRs clobbered and the preserved
|
|
registers are intermixed at regular intervals in order to keep a
|
|
similar ratio independent of the number of allocated VGPRs.
|
|
|
|
* GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
|
|
* Lanes of all VGPRs that are inactive at the call site.
|
|
|
|
For the AMDGPU backend, an inter-procedural register allocation (IPRA)
|
|
optimization may mark some of clobbered SGPR and VGPR registers as
|
|
preserved if it can be determined that the called function does not change
|
|
their value.
|
|
|
|
2. The PC is set to the RA provided on entry.
|
|
3. MODE register: *TBD*.
|
|
4. All other registers are clobbered.
|
|
5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
|
|
function is available to the caller.
|
|
|
|
.. TODO::
|
|
|
|
- How are function results returned? The address of structured types is passed
|
|
by reference, but what about other types?
|
|
|
|
The function input arguments are made up of the formal arguments explicitly
|
|
declared by the source language function plus the implicit input arguments used
|
|
by the implementation.
|
|
|
|
The source language input arguments are:
|
|
|
|
1. Any source language implicit ``this`` or ``self`` argument comes first as a
|
|
pointer type.
|
|
2. Followed by the function formal arguments in left to right source order.
|
|
|
|
The source language result arguments are:
|
|
|
|
1. The function result argument.
|
|
|
|
The source language input or result struct type arguments that are less than or
|
|
equal to 16 bytes, are decomposed recursively into their base type fields, and
|
|
each field is passed as if a separate argument. For input arguments, if the
|
|
called function requires the struct to be in memory, for example because its
|
|
address is taken, then the function body is responsible for allocating a stack
|
|
location and copying the field arguments into it. Clang terms this *direct
|
|
struct*.
|
|
|
|
The source language input struct type arguments that are greater than 16 bytes,
|
|
are passed by reference. The caller is responsible for allocating a stack
|
|
location to make a copy of the struct value and pass the address as the input
|
|
argument. The called function is responsible to perform the dereference when
|
|
accessing the input argument. Clang terms this *by-value struct*.
|
|
|
|
A source language result struct type argument that is greater than 16 bytes, is
|
|
returned by reference. The caller is responsible for allocating a stack location
|
|
to hold the result value and passes the address as the last input argument
|
|
(before the implicit input arguments). In this case there are no result
|
|
arguments. The called function is responsible to perform the dereference when
|
|
storing the result value. Clang terms this *structured return (sret)*.
|
|
|
|
*TODO: correct the ``sret`` definition.*
|
|
|
|
.. TODO::
|
|
|
|
Is this definition correct? Or is ``sret`` only used if passing in registers, and
|
|
pass as non-decomposed struct as stack argument? Or something else? Is the
|
|
memory location in the caller stack frame, or a stack memory argument and so
|
|
no address is passed as the caller can directly write to the argument stack
|
|
location? But then the stack location is still live after return. If an
|
|
argument stack location is it the first stack argument or the last one?
|
|
|
|
Lambda argument types are treated as struct types with an implementation defined
|
|
set of fields.
|
|
|
|
.. TODO::
|
|
|
|
Need to specify the ABI for lambda types for AMDGPU.
|
|
|
|
For AMDGPU backend all source language arguments (including the decomposed
|
|
struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
|
|
they are passed in SGPRs.
|
|
|
|
The AMDGPU backend walks the function call graph from the leaves to determine
|
|
which implicit input arguments are used, propagating to each caller of the
|
|
function. The used implicit arguments are appended to the function arguments
|
|
after the source language arguments in the following order:
|
|
|
|
.. TODO::
|
|
|
|
Is recursion or external functions supported?
|
|
|
|
1. Work-Item ID (1 VGPR)
|
|
|
|
The X, Y and Z work-item ID are packed into a single VGRP with the following
|
|
layout. Only fields actually used by the function are set. The other bits
|
|
are undefined.
|
|
|
|
The values come from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
|
|
|
|
.. table:: Work-item implicit argument layout
|
|
:name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
|
|
|
|
======= ======= ==============
|
|
Bits Size Field Name
|
|
======= ======= ==============
|
|
9:0 10 bits X Work-Item ID
|
|
19:10 10 bits Y Work-Item ID
|
|
29:20 10 bits Z Work-Item ID
|
|
31:30 2 bits Unused
|
|
======= ======= ==============
|
|
|
|
2. Dispatch Ptr (2 SGPRs)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
3. Queue Ptr (2 SGPRs)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
4. Kernarg Segment Ptr (2 SGPRs)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
5. Dispatch id (2 SGPRs)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
6. Work-Group ID X (1 SGPR)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
7. Work-Group ID Y (1 SGPR)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
8. Work-Group ID Z (1 SGPR)
|
|
|
|
The value comes from the initial kernel execution state. See
|
|
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
|
|
|
9. Implicit Argument Ptr (2 SGPRs)
|
|
|
|
The value is computed by adding an offset to Kernarg Segment Ptr to get the
|
|
global address space pointer to the first kernarg implicit argument.
|
|
|
|
The input and result arguments are assigned in order in the following manner:
|
|
|
|
.. note::
|
|
|
|
There are likely some errors and omissions in the following description that
|
|
need correction.
|
|
|
|
.. TODO::
|
|
|
|
Check the Clang source code to decipher how function arguments and return
|
|
results are handled. Also see the AMDGPU specific values used.
|
|
|
|
* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
|
|
VGPR31.
|
|
|
|
If there are more arguments than will fit in these registers, the remaining
|
|
arguments are allocated on the stack in order on naturally aligned
|
|
addresses.
|
|
|
|
.. TODO::
|
|
|
|
How are overly aligned structures allocated on the stack?
|
|
|
|
* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
|
|
SGPR29.
|
|
|
|
If there are more arguments than will fit in these registers, the remaining
|
|
arguments are allocated on the stack in order on naturally aligned
|
|
addresses.
|
|
|
|
Note that decomposed struct type arguments may have some fields passed in
|
|
registers and some in memory.
|
|
|
|
.. TODO::
|
|
|
|
So, a struct which can pass some fields as decomposed register arguments, will
|
|
pass the rest as decomposed stack elements? But an argument that will not start
|
|
in registers will not be decomposed and will be passed as a non-decomposed
|
|
stack value?
|
|
|
|
The following is not part of the AMDGPU function calling convention but
|
|
describes how the AMDGPU implements function calls:
|
|
|
|
1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
|
|
unswizzled scratch address. It is only needed if runtime sized ``alloca``
|
|
are used, or for the reasons defined in ``SIFrameLowering``.
|
|
2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
|
|
to access the incoming stack arguments in the function. The BP is needed
|
|
only when the function requires the runtime stack alignment.
|
|
|
|
3. Allocating SGPR arguments on the stack are not supported.
|
|
|
|
4. No CFI is currently generated. See
|
|
:ref:`amdgpu-dwarf-call-frame-information`.
|
|
|
|
.. note::
|
|
|
|
CFI will be generated that defines the CFA as the unswizzled address
|
|
relative to the wave scratch base in the unswizzled private address space
|
|
of the lowest address stack allocated local variable.
|
|
|
|
``DW_AT_frame_base`` will be defined as the swizzled address in the
|
|
swizzled private address space by dividing the CFA by the wavefront size
|
|
(since CFA is always at least dword aligned which matches the scratch
|
|
swizzle element size).
|
|
|
|
If no dynamic stack alignment was performed, the stack allocated arguments
|
|
are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
|
|
local variables and register spill slots are accessed as positive offsets
|
|
relative to ``DW_AT_frame_base``.
|
|
|
|
5. Function argument passing is implemented by copying the input physical
|
|
registers to virtual registers on entry. The register allocator can spill if
|
|
necessary. These are copied back to physical registers at call sites. The
|
|
net effect is that each function call can have these values in entirely
|
|
distinct locations. The IPRA can help avoid shuffling argument registers.
|
|
6. Call sites are implemented by setting up the arguments at positive offsets
|
|
from SP. Then SP is incremented to account for the known frame size before
|
|
the call and decremented after the call.
|
|
|
|
.. note::
|
|
|
|
The CFI will reflect the changed calculation needed to compute the CFA
|
|
from SP.
|
|
|
|
7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
|
|
emergency spill slot. Buffer instructions are used for stack accesses and
|
|
not the ``flat_scratch`` instruction.
|
|
|
|
.. TODO::
|
|
|
|
Explain when the emergency spill slot is used.
|
|
|
|
.. TODO::
|
|
|
|
Possible broken issues:
|
|
|
|
- Stack arguments must be aligned to required alignment.
|
|
- Stack is aligned to max(16, max formal argument alignment)
|
|
- Direct argument < 64 bits should check register budget.
|
|
- Register budget calculation should respect ``inreg`` for SGPR.
|
|
- SGPR overflow is not handled.
|
|
- struct with 1 member unpeeling is not checking size of member.
|
|
- ``sret`` is after ``this`` pointer.
|
|
- Caller is not implementing stack realignment: need an extra pointer.
|
|
- Should say AMDGPU passes FP rather than SP.
|
|
- Should CFI define CFA as address of locals or arguments. Difference is
|
|
apparent when have implemented dynamic alignment.
|
|
- If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
|
|
highest address of stack frame and use negative offset for locals. Would
|
|
allow SP to be the same as FP and could support signal-handler-like as now
|
|
have a real SP for the top of the stack.
|
|
- How is ``sret`` passed on the stack? In argument stack area? Can it overlay
|
|
arguments?
|
|
|
|
AMDPAL
|
|
------
|
|
|
|
This section provides code conventions used when the target triple OS is
|
|
``amdpal`` (see :ref:`amdgpu-target-triples`).
|
|
|
|
.. _amdgpu-amdpal-code-object-metadata-section:
|
|
|
|
Code Object Metadata
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. note::
|
|
|
|
The metadata is currently in development and is subject to major
|
|
changes. Only the current version is supported. *When this document
|
|
was generated the version was 2.6.*
|
|
|
|
Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
|
|
record (see :ref:`amdgpu-note-records-v3-onwards`).
|
|
|
|
The metadata is represented as Message Pack formatted binary data (see
|
|
[MsgPack]_). The top level is a Message Pack map that includes the keys
|
|
defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
|
|
and referenced tables.
|
|
|
|
Additional information can be added to the maps. To avoid conflicts, any
|
|
key names should be prefixed by "*vendor-name*." where ``vendor-name``
|
|
can be the name of the vendor and specific vendor tool that generates the
|
|
information. The prefix is abbreviated to simply "." when it appears
|
|
within a map that has been added by the same *vendor-name*.
|
|
|
|
.. table:: AMDPAL Code Object Metadata Map
|
|
:name: amdgpu-amdpal-code-object-metadata-map-table
|
|
|
|
=================== ============== ========= ======================================================================
|
|
String Key Value Type Required? Description
|
|
=================== ============== ========= ======================================================================
|
|
"amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
|
|
2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
|
|
"amdpal.pipelines" sequence of Required Per-pipeline metadata. See
|
|
map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
|
|
definition of the keys included in that map.
|
|
=================== ============== ========= ======================================================================
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Pipeline Metadata Map
|
|
:name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
|
|
|
|
====================================== ============== ========= ===================================================
|
|
String Key Value Type Required? Description
|
|
====================================== ============== ========= ===================================================
|
|
".name" string Source name of the pipeline.
|
|
".type" string Pipeline type, e.g. VsPs. Values include:
|
|
|
|
- "VsPs"
|
|
- "Gs"
|
|
- "Cs"
|
|
- "Ngg"
|
|
- "Tess"
|
|
- "GsTess"
|
|
- "NggTess"
|
|
|
|
".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
|
|
2 integers 64 bits is the "stable" portion of the hash, used
|
|
for e.g. shader replacement lookup. Upper 64 bits
|
|
is the "unique" portion of the hash, used for
|
|
e.g. pipeline cache lookup. The value is
|
|
implementation defined, and can not be relied on
|
|
between different builds of the compiler.
|
|
".shaders" map Per-API shader metadata. See
|
|
:ref:`amdgpu-amdpal-code-object-shader-map-table`
|
|
for the definition of the keys included in that
|
|
map.
|
|
".hardware_stages" map Per-hardware stage metadata. See
|
|
:ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
|
|
for the definition of the keys included in that
|
|
map.
|
|
".shader_functions" map Per-shader function metadata. See
|
|
:ref:`amdgpu-amdpal-code-object-shader-function-map-table`
|
|
for the definition of the keys included in that
|
|
map.
|
|
".registers" map Required Hardware register configuration. See
|
|
:ref:`amdgpu-amdpal-code-object-register-map-table`
|
|
for the definition of the keys included in that
|
|
map.
|
|
".user_data_limit" integer Number of user data entries accessed by this
|
|
pipeline.
|
|
".spill_threshold" integer The user data spill threshold. 0xFFFF for
|
|
NoUserDataSpilling.
|
|
".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
|
|
viewport array index feature. Pipelines which use
|
|
this feature can render into all 16 viewports,
|
|
whereas pipelines which do not use it are
|
|
restricted to viewport #0.
|
|
".es_gs_lds_size" integer Size in bytes of LDS space used internally for
|
|
handling data-passing between the ES and GS
|
|
shader stages. This can be zero if the data is
|
|
passed using off-chip buffers. This value should
|
|
be used to program all user-SGPRs which have been
|
|
marked with "UserDataMapping::EsGsLdsSize"
|
|
(typically only the GS and VS HW stages will ever
|
|
have a user-SGPR so marked).
|
|
".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
|
|
(maximum number of threads in a subgroup).
|
|
".num_interpolants" integer Graphics only. Number of PS interpolants.
|
|
".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
|
|
".api" string Name of the client graphics API.
|
|
".api_create_info" binary Graphics API shader create info binary blob. Can
|
|
be defined by the driver using the compiler if
|
|
they want to be able to correlate API-specific
|
|
information used during creation at a later time.
|
|
====================================== ============== ========= ===================================================
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Shader Map
|
|
:name: amdgpu-amdpal-code-object-shader-map-table
|
|
|
|
|
|
+-------------+--------------+-------------------------------------------------------------------+
|
|
|String Key |Value Type |Description |
|
|
+=============+==============+===================================================================+
|
|
|- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
|
|
|- ".vertex" | |for the definition of the keys included in that map. |
|
|
|- ".hull" | | |
|
|
|- ".domain" | | |
|
|
|- ".geometry"| | |
|
|
|- ".pixel" | | |
|
|
+-------------+--------------+-------------------------------------------------------------------+
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object API Shader Metadata Map
|
|
:name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
|
|
|
|
==================== ============== ========= =====================================================================
|
|
String Key Value Type Required? Description
|
|
==================== ============== ========= =====================================================================
|
|
".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
|
|
2 integers is implementation defined, and can not be relied on between
|
|
different builds of the compiler.
|
|
".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
|
|
string include:
|
|
|
|
- ".ls"
|
|
- ".hs"
|
|
- ".es"
|
|
- ".gs"
|
|
- ".vs"
|
|
- ".ps"
|
|
- ".cs"
|
|
|
|
==================== ============== ========= =====================================================================
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Hardware Stage Map
|
|
:name: amdgpu-amdpal-code-object-hardware-stage-map-table
|
|
|
|
+-------------+--------------+-----------------------------------------------------------------------+
|
|
|String Key |Value Type |Description |
|
|
+=============+==============+=======================================================================+
|
|
|- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
|
|
|- ".hs" | |for the definition of the keys included in that map. |
|
|
|- ".es" | | |
|
|
|- ".gs" | | |
|
|
|- ".vs" | | |
|
|
|- ".ps" | | |
|
|
|- ".cs" | | |
|
|
+-------------+--------------+-----------------------------------------------------------------------+
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Hardware Stage Metadata Map
|
|
:name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
|
|
|
|
========================== ============== ========= ===============================================================
|
|
String Key Value Type Required? Description
|
|
========================== ============== ========= ===============================================================
|
|
".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
|
|
".scratch_memory_size" integer Scratch memory size in bytes.
|
|
".lds_size" integer Local Data Share size in bytes.
|
|
".perf_data_buffer_size" integer Performance data buffer size in bytes.
|
|
".vgpr_count" integer Number of VGPRs used.
|
|
".sgpr_count" integer Number of SGPRs used.
|
|
".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
|
|
directive to instruct the compiler to limit the VGPR usage to
|
|
be less than or equal to the specified value (only set if
|
|
different from HW default).
|
|
".sgpr_limit" integer SGPR count upper limit (only set if different from HW
|
|
default).
|
|
".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
|
|
3 integers
|
|
".wavefront_size" integer Wavefront size (only set if different from HW default).
|
|
".uses_uavs" boolean The shader reads or writes UAVs.
|
|
".uses_rovs" boolean The shader reads or writes ROVs.
|
|
".writes_uavs" boolean The shader writes to one or more UAVs.
|
|
".writes_depth" boolean The shader writes out a depth value.
|
|
".uses_append_consume" boolean The shader uses append and/or consume operations, either
|
|
memory or GDS.
|
|
".uses_prim_id" boolean The shader uses PrimID.
|
|
========================== ============== ========= ===============================================================
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Shader Function Map
|
|
:name: amdgpu-amdpal-code-object-shader-function-map-table
|
|
|
|
=============== ============== ====================================================================
|
|
String Key Value Type Description
|
|
=============== ============== ====================================================================
|
|
*symbol name* map *symbol name* is the ELF symbol name of the shader function code
|
|
entry address. The value is the function's metadata. See
|
|
:ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
|
|
=============== ============== ====================================================================
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Shader Function Metadata Map
|
|
:name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
|
|
|
|
============================= ============== =================================================================
|
|
String Key Value Type Description
|
|
============================= ============== =================================================================
|
|
".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
|
|
2 integers is implementation defined, and can not be relied on between
|
|
different builds of the compiler.
|
|
".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
|
|
".lds_size" integer Size in bytes of LDS memory.
|
|
".vgpr_count" integer Number of VGPRs used by the shader.
|
|
".sgpr_count" integer Number of SGPRs used by the shader.
|
|
".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
|
|
".shader_subtype" string Shader subtype/kind. Values include:
|
|
|
|
- "Unknown"
|
|
|
|
============================= ============== =================================================================
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL Code Object Register Map
|
|
:name: amdgpu-amdpal-code-object-register-map-table
|
|
|
|
========================== ============== ====================================================================
|
|
32-bit Integer Key Value Type Description
|
|
========================== ============== ====================================================================
|
|
``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
|
|
a GRBM register (i.e., driver accessible GPU register number, not
|
|
shader GPR register number). The driver is required to program each
|
|
specified register to the corresponding specified value when
|
|
executing this pipeline. Typically, the ``reg offsets`` are the
|
|
``uint16_t`` offsets to each register as defined by the hardware
|
|
chip headers. The register is set to the provided value. However, a
|
|
``reg offset`` that specifies a user data register (e.g.,
|
|
COMPUTE_USER_DATA_0) needs special treatment. See
|
|
:ref:`amdgpu-amdpal-code-object-user-data-section` section for more
|
|
information.
|
|
========================== ============== ====================================================================
|
|
|
|
.. _amdgpu-amdpal-code-object-user-data-section:
|
|
|
|
User Data
|
|
+++++++++
|
|
|
|
Each hardware stage has a set of 32-bit physical SPI *user data registers*
|
|
(either 16 or 32 based on graphics IP and the stage) which can be
|
|
written from a command buffer and then loaded into SGPRs when waves are
|
|
launched via a subsequent dispatch or draw operation. This is the way
|
|
most arguments are passed from the application/runtime to a hardware
|
|
shader.
|
|
|
|
PAL abstracts this functionality by exposing a set of 128 *user data
|
|
entries* per pipeline a client can use to pass arguments from a command
|
|
buffer to one or more shaders in that pipeline. The ELF code object must
|
|
specify a mapping from virtualized *user data entries* to physical *user
|
|
data registers*, and PAL is responsible for implementing that mapping,
|
|
including spilling overflow *user data entries* to memory if needed.
|
|
|
|
Since the *user data registers* are GRBM-accessible SPI registers, this
|
|
mapping is actually embedded in the ``.registers`` metadata entry. For
|
|
most registers, the value in that map is a literal 32-bit value that
|
|
should be written to the register by the driver. However, when the
|
|
register is a *user data register* (any USER_DATA register e.g.,
|
|
SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
|
|
the driver to write either a *user data entry* value or one of several
|
|
driver-internal values to the register. This encoding is described in
|
|
the following table:
|
|
|
|
.. note::
|
|
|
|
Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
|
|
and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
|
|
always be programmed to the address of the GlobalTable, and *user data
|
|
register* 1 must always be programmed to the address of the PerShaderTable.
|
|
|
|
..
|
|
|
|
.. table:: AMDPAL User Data Mapping
|
|
:name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
|
|
|
|
========== ================= ===============================================================================
|
|
Value Name Description
|
|
========== ================= ===============================================================================
|
|
0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
|
|
0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
|
|
always point to *user data register* 0).
|
|
0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
|
|
:ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
|
|
for more detail (should always point to *user data register* 1).
|
|
0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
|
|
:ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
|
|
more detail.
|
|
0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
|
|
reference the draw index in the vertex shader. Only supported by the first
|
|
stage in a graphics pipeline.
|
|
0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
|
|
a graphics pipeline.
|
|
0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
|
|
graphics pipeline.
|
|
0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
|
|
a buffer containing the grid dimensions for a Compute dispatch operation. The
|
|
high half of the address is stored in the next sequential user-SGPR. Only
|
|
supported by compute pipelines.
|
|
0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
|
|
space used for the ES/GS pseudo-ring-buffer for passing data between shader
|
|
stages.
|
|
0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
|
|
pipeline instancing.
|
|
0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
|
|
can only appear for one shader stage per pipeline.
|
|
0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
|
|
0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
|
|
only appear for one shader stage per pipeline.
|
|
0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
|
|
only appear for one shader stage per pipeline (PS). These replace color targets
|
|
and are completely separate from any UAVs used by the shader. This is optional,
|
|
and only used by the PS when UAV exports are used to replace color-target
|
|
exports to optimize specific shaders.
|
|
0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
|
|
some NGG pipelines to perform culling. This value contains the address of the
|
|
first of two consecutive registers which provide the full GPU address.
|
|
0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
|
|
========== ================= ===============================================================================
|
|
|
|
.. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
|
|
|
|
Per-Shader Table
|
|
################
|
|
|
|
Low 32 bits of the GPU address for an optional buffer in the ``.data``
|
|
section of the ELF. The high 32 bits of the address match the high 32 bits
|
|
of the shader's program counter.
|
|
|
|
The buffer can be anything the shader compiler needs it for, and
|
|
allows each shader to have its own region of the ``.data`` section.
|
|
Typically, this could be a table of buffer SRD's and the data pointed to
|
|
by the buffer SRD's, but it could be a flat-address region of memory as
|
|
well. Its layout and usage are defined by the shader compiler.
|
|
|
|
Each shader's table in the ``.data`` section is referenced by the symbol
|
|
``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
|
|
hardware shader stage the data is for. E.g.,
|
|
``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
|
|
|
|
.. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
|
|
|
|
Spill Table
|
|
###########
|
|
|
|
It is possible for a hardware shader to need access to more *user data
|
|
entries* than there are slots available in user data registers for one
|
|
or more hardware shader stages. In that case, the PAL runtime expects
|
|
the necessary *user data entries* to be spilled to GPU memory and use
|
|
one user data register to point to the spilled user data memory. The
|
|
value of the *user data entry* must then represent the location where
|
|
a shader expects to read the low 32-bits of the table's GPU virtual
|
|
address. The *spill table* itself represents a set of 32-bit values
|
|
managed by the PAL runtime in GPU-accessible memory that can be made
|
|
indirectly accessible to a hardware shader.
|
|
|
|
Unspecified OS
|
|
--------------
|
|
|
|
This section provides code conventions used when the target triple OS is
|
|
empty (see :ref:`amdgpu-target-triples`).
|
|
|
|
Trap Handler ABI
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
|
|
not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
|
|
instructions are handled as follows:
|
|
|
|
.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
|
|
:name: amdgpu-trap-handler-for-non-amdhsa-os-table
|
|
|
|
=============== =============== ===========================================
|
|
Usage Code Sequence Description
|
|
=============== =============== ===========================================
|
|
llvm.trap s_endpgm Causes wavefront to be terminated.
|
|
llvm.debugtrap *none* Compiler warning given that there is no
|
|
trap handler installed.
|
|
=============== =============== ===========================================
|
|
|
|
Source Languages
|
|
================
|
|
|
|
.. _amdgpu-opencl:
|
|
|
|
OpenCL
|
|
------
|
|
|
|
When the language is OpenCL the following differences occur:
|
|
|
|
1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
|
|
2. The AMDGPU backend appends additional arguments to the kernel's explicit
|
|
arguments for the AMDHSA OS (see
|
|
:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
|
|
3. Additional metadata is generated
|
|
(see :ref:`amdgpu-amdhsa-code-object-metadata`).
|
|
|
|
.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
|
|
:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
|
|
|
|
======== ==== ========= ===========================================
|
|
Position Byte Byte Description
|
|
Size Alignment
|
|
======== ==== ========= ===========================================
|
|
1 8 8 OpenCL Global Offset X
|
|
2 8 8 OpenCL Global Offset Y
|
|
3 8 8 OpenCL Global Offset Z
|
|
4 8 8 OpenCL address of printf buffer
|
|
5 8 8 OpenCL address of virtual queue used by
|
|
enqueue_kernel.
|
|
6 8 8 OpenCL address of AqlWrap struct used by
|
|
enqueue_kernel.
|
|
7 8 8 Pointer argument used for Multi-gird
|
|
synchronization.
|
|
======== ==== ========= ===========================================
|
|
|
|
.. _amdgpu-hcc:
|
|
|
|
HCC
|
|
---
|
|
|
|
When the language is HCC the following differences occur:
|
|
|
|
1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
|
|
|
|
.. _amdgpu-assembler:
|
|
|
|
Assembler
|
|
---------
|
|
|
|
AMDGPU backend has LLVM-MC based assembler which is currently in development.
|
|
It supports AMDGCN GFX6-GFX10.
|
|
|
|
This section describes general syntax for instructions and operands.
|
|
|
|
Instructions
|
|
~~~~~~~~~~~~
|
|
|
|
An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
|
|
|
|
| ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
|
|
<``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
|
|
|
|
:doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
|
|
:doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
|
|
|
|
The order of operands and modifiers is fixed.
|
|
Most modifiers are optional and may be omitted.
|
|
|
|
Links to detailed instruction syntax description may be found in the following
|
|
table. Note that features under development are not included
|
|
in this description.
|
|
|
|
=================================== =======================================
|
|
Core ISA ISA Extensions
|
|
=================================== =======================================
|
|
:doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
|
|
:doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
|
|
:doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
|
|
|
|
:doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
|
|
|
|
:doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
|
|
|
|
:doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
|
|
|
|
:doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
|
|
|
|
:doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
|
|
|
|
:doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
|
|
|
|
:doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
|
|
|
|
:doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
|
|
=================================== =======================================
|
|
|
|
For more information about instructions, their semantics and supported
|
|
combinations of operands, refer to one of instruction set architecture manuals
|
|
[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
|
|
[AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
|
|
[AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
|
|
|
|
Operands
|
|
~~~~~~~~
|
|
|
|
Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
|
|
|
|
Modifiers
|
|
~~~~~~~~~
|
|
|
|
Detailed description of modifiers may be found
|
|
:doc:`here<AMDGPUModifierSyntax>`.
|
|
|
|
Instruction Examples
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
DS
|
|
++
|
|
|
|
.. code-block:: nasm
|
|
|
|
ds_add_u32 v2, v4 offset:16
|
|
ds_write_src2_b64 v2 offset0:4 offset1:8
|
|
ds_cmpst_f32 v2, v4, v6
|
|
ds_min_rtn_f64 v[8:9], v2, v[4:5]
|
|
|
|
For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
|
|
Manual.
|
|
|
|
FLAT
|
|
++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
flat_load_dword v1, v[3:4]
|
|
flat_store_dwordx3 v[3:4], v[5:7]
|
|
flat_atomic_swap v1, v[3:4], v5 glc
|
|
flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
|
|
flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
|
|
|
|
For full list of supported instructions, refer to "FLAT instructions" in ISA
|
|
Manual.
|
|
|
|
MUBUF
|
|
+++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
buffer_load_dword v1, off, s[4:7], s1
|
|
buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
|
|
buffer_store_format_xy v[1:2], off, s[4:7], s1
|
|
buffer_wbinvl1
|
|
buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
|
|
|
|
For full list of supported instructions, refer to "MUBUF Instructions" in ISA
|
|
Manual.
|
|
|
|
SMRD/SMEM
|
|
+++++++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_load_dword s1, s[2:3], 0xfc
|
|
s_load_dwordx8 s[8:15], s[2:3], s4
|
|
s_load_dwordx16 s[88:103], s[2:3], s4
|
|
s_dcache_inv_vol
|
|
s_memtime s[4:5]
|
|
|
|
For full list of supported instructions, refer to "Scalar Memory Operations" in
|
|
ISA Manual.
|
|
|
|
SOP1
|
|
++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_mov_b32 s1, s2
|
|
s_mov_b64 s[0:1], 0x80000000
|
|
s_cmov_b32 s1, 200
|
|
s_wqm_b64 s[2:3], s[4:5]
|
|
s_bcnt0_i32_b64 s1, s[2:3]
|
|
s_swappc_b64 s[2:3], s[4:5]
|
|
s_cbranch_join s[4:5]
|
|
|
|
For full list of supported instructions, refer to "SOP1 Instructions" in ISA
|
|
Manual.
|
|
|
|
SOP2
|
|
++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_add_u32 s1, s2, s3
|
|
s_and_b64 s[2:3], s[4:5], s[6:7]
|
|
s_cselect_b32 s1, s2, s3
|
|
s_andn2_b32 s2, s4, s6
|
|
s_lshr_b64 s[2:3], s[4:5], s6
|
|
s_ashr_i32 s2, s4, s6
|
|
s_bfm_b64 s[2:3], s4, s6
|
|
s_bfe_i64 s[2:3], s[4:5], s6
|
|
s_cbranch_g_fork s[4:5], s[6:7]
|
|
|
|
For full list of supported instructions, refer to "SOP2 Instructions" in ISA
|
|
Manual.
|
|
|
|
SOPC
|
|
++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_cmp_eq_i32 s1, s2
|
|
s_bitcmp1_b32 s1, s2
|
|
s_bitcmp0_b64 s[2:3], s4
|
|
s_setvskip s3, s5
|
|
|
|
For full list of supported instructions, refer to "SOPC Instructions" in ISA
|
|
Manual.
|
|
|
|
SOPP
|
|
++++
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_barrier
|
|
s_nop 2
|
|
s_endpgm
|
|
s_waitcnt 0 ; Wait for all counters to be 0
|
|
s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
|
|
s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
|
|
s_sethalt 9
|
|
s_sleep 10
|
|
s_sendmsg 0x1
|
|
s_sendmsg sendmsg(MSG_INTERRUPT)
|
|
s_trap 1
|
|
|
|
For full list of supported instructions, refer to "SOPP Instructions" in ISA
|
|
Manual.
|
|
|
|
Unless otherwise mentioned, little verification is performed on the operands
|
|
of SOPP Instructions, so it is up to the programmer to be familiar with the
|
|
range or acceptable values.
|
|
|
|
VALU
|
|
++++
|
|
|
|
For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
|
|
the assembler will automatically use optimal encoding based on its operands. To
|
|
force specific encoding, one can add a suffix to the opcode of the instruction:
|
|
|
|
* _e32 for 32-bit VOP1/VOP2/VOPC
|
|
* _e64 for 64-bit VOP3
|
|
* _dpp for VOP_DPP
|
|
* _sdwa for VOP_SDWA
|
|
|
|
VOP1/VOP2/VOP3/VOPC examples:
|
|
|
|
.. code-block:: nasm
|
|
|
|
v_mov_b32 v1, v2
|
|
v_mov_b32_e32 v1, v2
|
|
v_nop
|
|
v_cvt_f64_i32_e32 v[1:2], v2
|
|
v_floor_f32_e32 v1, v2
|
|
v_bfrev_b32_e32 v1, v2
|
|
v_add_f32_e32 v1, v2, v3
|
|
v_mul_i32_i24_e64 v1, v2, 3
|
|
v_mul_i32_i24_e32 v1, -3, v3
|
|
v_mul_i32_i24_e32 v1, -100, v3
|
|
v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
|
|
v_max_f16_e32 v1, v2, v3
|
|
|
|
VOP_DPP examples:
|
|
|
|
.. code-block:: nasm
|
|
|
|
v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
|
|
v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
v_mov_b32 v0, v0 wave_shl:1
|
|
v_mov_b32 v0, v0 row_mirror
|
|
v_mov_b32 v0, v0 row_bcast:31
|
|
v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
|
|
VOP_SDWA examples:
|
|
|
|
.. code-block:: nasm
|
|
|
|
v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
|
|
v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
|
|
v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
|
|
v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
|
|
v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
|
|
|
|
For full list of supported instructions, refer to "Vector ALU instructions".
|
|
|
|
.. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
|
|
|
|
Code Object V2 Predefined Symbols
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. warning::
|
|
Code object V2 is not the default code object version emitted by
|
|
this version of LLVM.
|
|
|
|
The AMDGPU assembler defines and updates some symbols automatically. These
|
|
symbols do not affect code generation.
|
|
|
|
.option.machine_version_major
|
|
+++++++++++++++++++++++++++++
|
|
|
|
Set to the GFX major generation number of the target being assembled for. For
|
|
example, when assembling for a "GFX9" target this will be set to the integer
|
|
value "9". The possible GFX major generation numbers are presented in
|
|
:ref:`amdgpu-processors`.
|
|
|
|
.option.machine_version_minor
|
|
+++++++++++++++++++++++++++++
|
|
|
|
Set to the GFX minor generation number of the target being assembled for. For
|
|
example, when assembling for a "GFX810" target this will be set to the integer
|
|
value "1". The possible GFX minor generation numbers are presented in
|
|
:ref:`amdgpu-processors`.
|
|
|
|
.option.machine_version_stepping
|
|
++++++++++++++++++++++++++++++++
|
|
|
|
Set to the GFX stepping generation number of the target being assembled for.
|
|
For example, when assembling for a "GFX704" target this will be set to the
|
|
integer value "4". The possible GFX stepping generation numbers are presented
|
|
in :ref:`amdgpu-processors`.
|
|
|
|
.kernel.vgpr_count
|
|
++++++++++++++++++
|
|
|
|
Set to zero each time a
|
|
:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
|
|
encountered. At each instruction, if the current value of this symbol is less
|
|
than or equal to the maximum VGPR number explicitly referenced within that
|
|
instruction then the symbol value is updated to equal that VGPR number plus
|
|
one.
|
|
|
|
.kernel.sgpr_count
|
|
++++++++++++++++++
|
|
|
|
Set to zero each time a
|
|
:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
|
|
encountered. At each instruction, if the current value of this symbol is less
|
|
than or equal to the maximum VGPR number explicitly referenced within that
|
|
instruction then the symbol value is updated to equal that SGPR number plus
|
|
one.
|
|
|
|
.. _amdgpu-amdhsa-assembler-directives-v2:
|
|
|
|
Code Object V2 Directives
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. warning::
|
|
Code object V2 is not the default code object version emitted by
|
|
this version of LLVM.
|
|
|
|
AMDGPU ABI defines auxiliary data in output code object. In assembly source,
|
|
one can specify them with assembler directives.
|
|
|
|
.hsa_code_object_version major, minor
|
|
+++++++++++++++++++++++++++++++++++++
|
|
|
|
*major* and *minor* are integers that specify the version of the HSA code
|
|
object that will be generated by the assembler.
|
|
|
|
.hsa_code_object_isa [major, minor, stepping, vendor, arch]
|
|
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
|
|
|
|
*major*, *minor*, and *stepping* are all integers that describe the instruction
|
|
set architecture (ISA) version of the assembly program.
|
|
|
|
*vendor* and *arch* are quoted strings. *vendor* should always be equal to
|
|
"AMD" and *arch* should always be equal to "AMDGPU".
|
|
|
|
By default, the assembler will derive the ISA version, *vendor*, and *arch*
|
|
from the value of the -mcpu option that is passed to the assembler.
|
|
|
|
.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
|
|
|
|
.amdgpu_hsa_kernel (name)
|
|
+++++++++++++++++++++++++
|
|
|
|
This directives specifies that the symbol with given name is a kernel entry
|
|
point (label) and the object should contain corresponding symbol of type
|
|
STT_AMDGPU_HSA_KERNEL.
|
|
|
|
.amd_kernel_code_t
|
|
++++++++++++++++++
|
|
|
|
This directive marks the beginning of a list of key / value pairs that are used
|
|
to specify the amd_kernel_code_t object that will be emitted by the assembler.
|
|
The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
|
|
amd_kernel_code_t values that are unspecified a default value will be used. The
|
|
default value for all keys is 0, with the following exceptions:
|
|
|
|
- *amd_code_version_major* defaults to 1.
|
|
- *amd_kernel_code_version_minor* defaults to 2.
|
|
- *amd_machine_kind* defaults to 1.
|
|
- *amd_machine_version_major*, *machine_version_minor*, and
|
|
*amd_machine_version_stepping* are derived from the value of the -mcpu option
|
|
that is passed to the assembler.
|
|
- *kernel_code_entry_byte_offset* defaults to 256.
|
|
- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
|
|
defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
|
|
Note that wavefront size is specified as a power of two, so a value of **n**
|
|
means a size of 2^ **n**.
|
|
- *call_convention* defaults to -1.
|
|
- *kernarg_segment_alignment*, *group_segment_alignment*, and
|
|
*private_segment_alignment* default to 4. Note that alignments are specified
|
|
as a power of 2, so a value of **n** means an alignment of 2^ **n**.
|
|
- *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
|
|
GFX90A onwards.
|
|
- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
|
|
GFX10 onwards.
|
|
- *enable_mem_ordered* defaults to 1 for GFX10 onwards.
|
|
|
|
The *.amd_kernel_code_t* directive must be placed immediately after the
|
|
function label and before any instructions.
|
|
|
|
For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
|
|
comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
|
|
|
|
.. _amdgpu-amdhsa-assembler-example-v2:
|
|
|
|
Code Object V2 Example Source Code
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. warning::
|
|
Code Object V2 is not the default code object version emitted by
|
|
this version of LLVM.
|
|
|
|
Here is an example of a minimal assembly source file, defining one HSA kernel:
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
.hsa_code_object_version 1,0
|
|
.hsa_code_object_isa
|
|
|
|
.hsatext
|
|
.globl hello_world
|
|
.p2align 8
|
|
.amdgpu_hsa_kernel hello_world
|
|
|
|
hello_world:
|
|
|
|
.amd_kernel_code_t
|
|
enable_sgpr_kernarg_segment_ptr = 1
|
|
is_ptr64 = 1
|
|
compute_pgm_rsrc1_vgprs = 0
|
|
compute_pgm_rsrc1_sgprs = 0
|
|
compute_pgm_rsrc2_user_sgpr = 2
|
|
compute_pgm_rsrc1_wgp_mode = 0
|
|
compute_pgm_rsrc1_mem_ordered = 0
|
|
compute_pgm_rsrc1_fwd_progress = 1
|
|
.end_amd_kernel_code_t
|
|
|
|
s_load_dwordx2 s[0:1], s[0:1] 0x0
|
|
v_mov_b32 v0, 3.14159
|
|
s_waitcnt lgkmcnt(0)
|
|
v_mov_b32 v1, s0
|
|
v_mov_b32 v2, s1
|
|
flat_store_dword v[1:2], v0
|
|
s_endpgm
|
|
.Lfunc_end0:
|
|
.size hello_world, .Lfunc_end0-hello_world
|
|
|
|
.. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
|
|
|
|
Code Object V3 and Above Predefined Symbols
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The AMDGPU assembler defines and updates some symbols automatically. These
|
|
symbols do not affect code generation.
|
|
|
|
.amdgcn.gfx_generation_number
|
|
+++++++++++++++++++++++++++++
|
|
|
|
Set to the GFX major generation number of the target being assembled for. For
|
|
example, when assembling for a "GFX9" target this will be set to the integer
|
|
value "9". The possible GFX major generation numbers are presented in
|
|
:ref:`amdgpu-processors`.
|
|
|
|
.amdgcn.gfx_generation_minor
|
|
++++++++++++++++++++++++++++
|
|
|
|
Set to the GFX minor generation number of the target being assembled for. For
|
|
example, when assembling for a "GFX810" target this will be set to the integer
|
|
value "1". The possible GFX minor generation numbers are presented in
|
|
:ref:`amdgpu-processors`.
|
|
|
|
.amdgcn.gfx_generation_stepping
|
|
+++++++++++++++++++++++++++++++
|
|
|
|
Set to the GFX stepping generation number of the target being assembled for.
|
|
For example, when assembling for a "GFX704" target this will be set to the
|
|
integer value "4". The possible GFX stepping generation numbers are presented
|
|
in :ref:`amdgpu-processors`.
|
|
|
|
.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
|
|
|
|
.amdgcn.next_free_vgpr
|
|
++++++++++++++++++++++
|
|
|
|
Set to zero before assembly begins. At each instruction, if the current value
|
|
of this symbol is less than or equal to the maximum VGPR number explicitly
|
|
referenced within that instruction then the symbol value is updated to equal
|
|
that VGPR number plus one.
|
|
|
|
May be used to set the `.amdhsa_next_free_vgpr` directive in
|
|
:ref:`amdhsa-kernel-directives-table`.
|
|
|
|
May be set at any time, e.g. manually set to zero at the start of each kernel.
|
|
|
|
.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
|
|
|
|
.amdgcn.next_free_sgpr
|
|
++++++++++++++++++++++
|
|
|
|
Set to zero before assembly begins. At each instruction, if the current value
|
|
of this symbol is less than or equal the maximum SGPR number explicitly
|
|
referenced within that instruction then the symbol value is updated to equal
|
|
that SGPR number plus one.
|
|
|
|
May be used to set the `.amdhsa_next_free_spgr` directive in
|
|
:ref:`amdhsa-kernel-directives-table`.
|
|
|
|
May be set at any time, e.g. manually set to zero at the start of each kernel.
|
|
|
|
.. _amdgpu-amdhsa-assembler-directives-v3-onwards:
|
|
|
|
Code Object V3 and Above Directives
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
|
|
architecture processors, and are not OS-specific. Directives which begin with
|
|
``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
|
|
``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
|
|
:ref:`amdgpu-processors`.
|
|
|
|
.. _amdgpu-assembler-directive-amdgcn-target:
|
|
|
|
.amdgcn_target <target-triple> "-" <target-id>
|
|
++++++++++++++++++++++++++++++++++++++++++++++
|
|
|
|
Optional directive which declares the ``<target-triple>-<target-id>`` supported
|
|
by the containing assembler source file. Used by the assembler to validate
|
|
command-line options such as ``-triple``, ``-mcpu``, and
|
|
``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
|
|
:ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
|
|
|
|
.. note::
|
|
|
|
The target ID syntax used for code object V2 to V3 for this directive differs
|
|
from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
|
|
|
|
.amdhsa_kernel <name>
|
|
+++++++++++++++++++++
|
|
|
|
Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
|
|
``<name>.kd``, in the current location of the current section. Only valid when
|
|
the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
|
|
instruction to execute, and does not need to be previously defined.
|
|
|
|
Marks the beginning of a list of directives used to generate the bytes of a
|
|
kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
|
|
Directives which may appear in this list are described in
|
|
:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
|
|
be valid for the target being assembled for, and cannot be repeated. Directives
|
|
support the range of values specified by the field they reference in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
|
|
assumed to have its default value, unless it is marked as "Required", in which
|
|
case it is an error to omit the directive. This list of directives is
|
|
terminated by an ``.end_amdhsa_kernel`` directive.
|
|
|
|
.. table:: AMDHSA Kernel Assembler Directives
|
|
:name: amdhsa-kernel-directives-table
|
|
|
|
======================================================== =================== ============ ===================
|
|
Directive Default Supported On Description
|
|
======================================================== =================== ============ ===================
|
|
``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_count`` 0 GFX6-GFX10 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`
|
|
``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
|
|
:ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in
|
|
Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
Specific
|
|
(wavefrontsize64)
|
|
``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
Possible values are defined in
|
|
:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
|
|
``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one.
|
|
Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one.
|
|
Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file.
|
|
Used to calculate ACCUM_OFFSET in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
|
|
``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR.
|
|
Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
|
|
scratch memory. Used to calculate
|
|
GRANULATED_WAVEFRONT_SGPR_COUNT in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
|
|
Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
|
|
Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
(xnack)
|
|
``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
Possible values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
|
|
``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
Possible values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
|
|
``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
Possible values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
|
|
``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
Possible values are defined in
|
|
:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
|
|
``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in
|
|
Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
|
|
Specific
|
|
(tgsplit)
|
|
``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in
|
|
Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
|
|
Specific
|
|
(cumode)
|
|
``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
|
|
:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
|
|
======================================================== =================== ============ ===================
|
|
|
|
.amdgpu_metadata
|
|
++++++++++++++++
|
|
|
|
Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
|
|
note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
|
|
|
|
The contents must be in the [YAML]_ markup format, with the same structure and
|
|
semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v4` or
|
|
:ref:`amdgpu-amdhsa-code-object-metadata-v5`.
|
|
|
|
This directive is terminated by an ``.end_amdgpu_metadata`` directive.
|
|
|
|
.. _amdgpu-amdhsa-assembler-example-v3-onwards:
|
|
|
|
Code Object V3 and Above Example Source Code
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Here is an example of a minimal assembly source file, defining one HSA kernel:
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
.amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
|
|
|
|
.text
|
|
.globl hello_world
|
|
.p2align 8
|
|
.type hello_world,@function
|
|
hello_world:
|
|
s_load_dwordx2 s[0:1], s[0:1] 0x0
|
|
v_mov_b32 v0, 3.14159
|
|
s_waitcnt lgkmcnt(0)
|
|
v_mov_b32 v1, s0
|
|
v_mov_b32 v2, s1
|
|
flat_store_dword v[1:2], v0
|
|
s_endpgm
|
|
.Lfunc_end0:
|
|
.size hello_world, .Lfunc_end0-hello_world
|
|
|
|
.rodata
|
|
.p2align 6
|
|
.amdhsa_kernel hello_world
|
|
.amdhsa_user_sgpr_kernarg_segment_ptr 1
|
|
.amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
|
|
.amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
|
|
.end_amdhsa_kernel
|
|
|
|
.amdgpu_metadata
|
|
---
|
|
amdhsa.version:
|
|
- 1
|
|
- 0
|
|
amdhsa.kernels:
|
|
- .name: hello_world
|
|
.symbol: hello_world.kd
|
|
.kernarg_segment_size: 48
|
|
.group_segment_fixed_size: 0
|
|
.private_segment_fixed_size: 0
|
|
.kernarg_segment_align: 4
|
|
.wavefront_size: 64
|
|
.sgpr_count: 2
|
|
.vgpr_count: 3
|
|
.max_flat_workgroup_size: 256
|
|
.args:
|
|
- .size: 8
|
|
.offset: 0
|
|
.value_kind: global_buffer
|
|
.address_space: global
|
|
.actual_access: write_only
|
|
//...
|
|
.end_amdgpu_metadata
|
|
|
|
This kernel is equivalent to the following HIP program:
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
__global__ void hello_world(float *p) {
|
|
*p = 3.14159f;
|
|
}
|
|
|
|
If an assembly source file contains multiple kernels and/or functions, the
|
|
:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
|
|
:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
|
|
the ``.set <symbol>, <expression>`` directive. For example, in the case of two
|
|
kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
|
|
to group the function with the kernel that calls it and reset the symbols
|
|
between the two connected components:
|
|
|
|
.. code::
|
|
:number-lines:
|
|
|
|
.amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
|
|
|
|
// gpr tracking symbols are implicitly set to zero
|
|
|
|
.text
|
|
.globl kern0
|
|
.p2align 8
|
|
.type kern0,@function
|
|
kern0:
|
|
// ...
|
|
s_endpgm
|
|
.Lkern0_end:
|
|
.size kern0, .Lkern0_end-kern0
|
|
|
|
.rodata
|
|
.p2align 6
|
|
.amdhsa_kernel kern0
|
|
// ...
|
|
.amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
|
|
.amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
|
|
.end_amdhsa_kernel
|
|
|
|
// reset symbols to begin tracking usage in func1 and kern1
|
|
.set .amdgcn.next_free_vgpr, 0
|
|
.set .amdgcn.next_free_sgpr, 0
|
|
|
|
.text
|
|
.hidden func1
|
|
.global func1
|
|
.p2align 2
|
|
.type func1,@function
|
|
func1:
|
|
// ...
|
|
s_setpc_b64 s[30:31]
|
|
.Lfunc1_end:
|
|
.size func1, .Lfunc1_end-func1
|
|
|
|
.globl kern1
|
|
.p2align 8
|
|
.type kern1,@function
|
|
kern1:
|
|
// ...
|
|
s_getpc_b64 s[4:5]
|
|
s_add_u32 s4, s4, func1@rel32@lo+4
|
|
s_addc_u32 s5, s5, func1@rel32@lo+4
|
|
s_swappc_b64 s[30:31], s[4:5]
|
|
// ...
|
|
s_endpgm
|
|
.Lkern1_end:
|
|
.size kern1, .Lkern1_end-kern1
|
|
|
|
.rodata
|
|
.p2align 6
|
|
.amdhsa_kernel kern1
|
|
// ...
|
|
.amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
|
|
.amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
|
|
.end_amdhsa_kernel
|
|
|
|
These symbols cannot identify connected components in order to automatically
|
|
track the usage for each kernel. However, in some cases careful organization of
|
|
the kernels and functions in the source file means there is minimal additional
|
|
effort required to accurately calculate GPR usage.
|
|
|
|
Additional Documentation
|
|
========================
|
|
|
|
.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
|
|
.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
|
|
.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
|
|
.. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
|
|
.. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
|
|
.. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
|
|
.. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
|
|
.. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
|
|
.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
|
|
.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
|
|
.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
|
|
.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
|
|
.. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
|
|
.. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
|
|
.. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
|
|
.. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
|
|
.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
|
|
.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
|
|
.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
|
|
.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
|
|
.. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
|
|
.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
|
|
.. [SEMVER] `Semantic Versioning <https://semver.org/>`__
|
|
.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
|