forked from OSchip/llvm-project
[AMDGPU] Cleanup AMDGPUUsage.rst
- Layout and typo improvements. - Add memory spaces section. - reStructure syntax fixes. Differential Revision: https://reviews.llvm.org/D90002
This commit is contained in:
parent
d590c85430
commit
bf6518a806
|
@ -96,45 +96,45 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
.. table:: AMDGPU Processors
|
||||
:name: amdgpu-processor-table
|
||||
|
||||
=========== =============== ============ ===== ================= ======= ======================
|
||||
=========== =============== ============ ===== ============================= ======= ======================
|
||||
Processor Alternative Target dGPU/ Target ROCm Example
|
||||
Processor Triple APU Features Support Products
|
||||
Architecture Supported
|
||||
[Default]
|
||||
=========== =============== ============ ===== ================= ======= ======================
|
||||
=========== =============== ============ ===== ============================= ======= ======================
|
||||
**Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``r600`` ``r600`` dGPU
|
||||
``r630`` ``r600`` dGPU
|
||||
``rs880`` ``r600`` dGPU
|
||||
``rv670`` ``r600`` dGPU
|
||||
**Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``rv710`` ``r600`` dGPU
|
||||
``rv730`` ``r600`` dGPU
|
||||
``rv770`` ``r600`` dGPU
|
||||
**Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``cedar`` ``r600`` dGPU
|
||||
``cypress`` ``r600`` dGPU
|
||||
``juniper`` ``r600`` dGPU
|
||||
``redwood`` ``r600`` dGPU
|
||||
``sumo`` ``r600`` dGPU
|
||||
**Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``barts`` ``r600`` dGPU
|
||||
``caicos`` ``r600`` dGPU
|
||||
``cayman`` ``r600`` dGPU
|
||||
``turks`` ``r600`` dGPU
|
||||
**GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
|
||||
``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
|
||||
- ``verde``
|
||||
``gfx602`` - ``hainan`` ``amdgcn`` dGPU
|
||||
- ``oland``
|
||||
**GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
|
||||
- A6 Pro-7050B
|
||||
- A8-7100
|
||||
|
@ -166,9 +166,15 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
- Radeon HD 8770
|
||||
- R7 260
|
||||
- R7 260X
|
||||
``gfx705`` ``amdgcn`` APU
|
||||
``gfx705`` ``amdgcn`` APU *TBA*
|
||||
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
**GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
|
||||
[on] - Pro A6-8500B
|
||||
- A8-8600P
|
||||
|
@ -206,10 +212,15 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
- FirePro W7100
|
||||
- Mobile FirePro
|
||||
M7170
|
||||
``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
|
||||
``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack *TBA*
|
||||
[on]
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
**GCN GFX9** [AMD-GCN-GFX9]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
|
||||
[off] Frontier Edition
|
||||
- Radeon RX Vega 56
|
||||
|
@ -222,8 +233,10 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
|
||||
[off]
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50
|
||||
[off] - Radeon Instinct MI60
|
||||
- sram-ecc - Radeon VII
|
||||
|
@ -233,15 +246,19 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
- sram-ecc
|
||||
[on]
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
``gfx909`` ``amdgcn`` APU - xnack *TBA*
|
||||
[on]
|
||||
[off]
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
**GCN GFX10** [AMD-GCN-GFX10]_
|
||||
-----------------------------------------------------------------------------------------------
|
||||
-----------------------------------------------------------------------------------------------------------
|
||||
``gfx1010`` ``amdgcn`` dGPU - xnack - Radeon RX 5700
|
||||
[off] - Radeon RX 5700 XT
|
||||
- wavefrontsize64 - Radeon Pro 5600 XT
|
||||
|
@ -254,9 +271,11 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
[off]
|
||||
- cumode
|
||||
[off]
|
||||
.. TODO
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
``gfx1012`` ``amdgcn`` dGPU - xnack - Radeon RX 5500
|
||||
[off] - Radeon RX 5500 XT
|
||||
- wavefrontsize64
|
||||
|
@ -267,24 +286,30 @@ names from both the *Processor* and *Alternative Processor* can be used.
|
|||
[off]
|
||||
- cumode
|
||||
[off]
|
||||
.. TODO
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
``gfx1031`` ``amdgcn`` dGPU - wavefrontsize64 *TBA*
|
||||
[off]
|
||||
- cumode
|
||||
[off]
|
||||
.. TODO
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
|
||||
``gfx1032`` ``amdgcn`` dGPU - wavefrontsize64 *TBA*
|
||||
[off]
|
||||
- cumode
|
||||
[off]
|
||||
.. TODO
|
||||
.. TODO::
|
||||
|
||||
Add product
|
||||
names.
|
||||
=========== =============== ============ ===== ================= ======= ======================
|
||||
|
||||
=========== =============== ============ ===== ============================= ======= ======================
|
||||
|
||||
.. _amdgpu-target-features:
|
||||
|
||||
|
@ -782,10 +807,10 @@ The AMDGPU backend uses the following ELF header:
|
|||
.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
|
||||
:name: amdgpu-ef-amdgpu-mach-table
|
||||
|
||||
================================= ========== =============================
|
||||
==================================== ========== =============================
|
||||
Name Value Description (see
|
||||
:ref:`amdgpu-processor-table`)
|
||||
================================= ========== =============================
|
||||
==================================== ========== =============================
|
||||
``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
|
||||
``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
|
||||
``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
|
||||
|
@ -834,7 +859,7 @@ The AMDGPU backend uses the following ELF header:
|
|||
``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
|
||||
``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
|
||||
``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
|
||||
================================= ========== =============================
|
||||
==================================== ========== =============================
|
||||
|
||||
Sections
|
||||
--------
|
||||
|
@ -922,8 +947,8 @@ Code Object V2 Note Records (--amdhsa-code-object-version=2)
|
|||
default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`.
|
||||
|
||||
The AMDGPU backend code object uses the following ELF note record in the
|
||||
``.note`` section when compiling for Code Object
|
||||
V2 (--amdhsa-code-object-version=2).
|
||||
``.note`` section when compiling for Code Object V2
|
||||
(--amdhsa-code-object-version=2).
|
||||
|
||||
Additional note records may be present, but any which are not documented here
|
||||
are deprecated and should not be used.
|
||||
|
@ -2359,12 +2384,14 @@ non-AMD key names should be prefixed by "*vendor-name*.".
|
|||
- "Region"
|
||||
|
||||
.. TODO::
|
||||
|
||||
Is GlobalBuffer only Global
|
||||
or Constant? Is
|
||||
DynamicSharedPointer always
|
||||
Local? Can HCC allow Generic?
|
||||
How can Private or Region
|
||||
ever happen?
|
||||
|
||||
"AccQual" string Kernel argument access
|
||||
qualifier. Only present if
|
||||
"ValueKind" is "Image" or
|
||||
|
@ -2376,8 +2403,10 @@ non-AMD key names should be prefixed by "*vendor-name*.".
|
|||
- "ReadWrite"
|
||||
|
||||
.. TODO::
|
||||
|
||||
Does this apply to
|
||||
GlobalBuffer?
|
||||
|
||||
"ActualAccQual" string The actual memory accesses
|
||||
performed by the kernel on the
|
||||
kernel argument. Only present if
|
||||
|
@ -2415,8 +2444,10 @@ non-AMD key names should be prefixed by "*vendor-name*.".
|
|||
if "ValueKind" is "Pipe".
|
||||
|
||||
.. TODO::
|
||||
|
||||
Can GlobalBuffer be pipe
|
||||
qualified?
|
||||
|
||||
================= ============== ========= ================================
|
||||
|
||||
..
|
||||
|
@ -2838,12 +2869,14 @@ same *vendor-name*.
|
|||
- "region"
|
||||
|
||||
.. TODO::
|
||||
|
||||
Is "global_buffer" only "global"
|
||||
or "constant"? Is
|
||||
"dynamic_shared_pointer" always
|
||||
"local"? Can HCC allow "generic"?
|
||||
How can "private" or "region"
|
||||
ever happen?
|
||||
|
||||
".access" string Kernel argument access
|
||||
qualifier. Only present if
|
||||
".value_kind" is "image" or
|
||||
|
@ -2855,8 +2888,10 @@ same *vendor-name*.
|
|||
- "read_write"
|
||||
|
||||
.. TODO::
|
||||
|
||||
Does this apply to
|
||||
"global_buffer"?
|
||||
|
||||
".actual_access" string The actual memory accesses
|
||||
performed by the kernel on the
|
||||
kernel argument. Only present if
|
||||
|
@ -2894,8 +2929,10 @@ same *vendor-name*.
|
|||
if ".value_kind" is "pipe".
|
||||
|
||||
.. TODO::
|
||||
|
||||
Can "global_buffer" be pipe
|
||||
qualified?
|
||||
|
||||
====================== ============== ========= ================================
|
||||
|
||||
..
|
||||
|
@ -2903,12 +2940,12 @@ same *vendor-name*.
|
|||
Kernel Dispatch
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The HSA architected queuing language (AQL) defines a user space memory
|
||||
interface that can be used to control the dispatch of kernels, in an agent
|
||||
independent way. An agent can have zero or more AQL queues created for it using
|
||||
the ROCm runtime, in which AQL packets (all of which are 64 bytes) can be
|
||||
placed. See the *HSA Platform System Architecture Specification* [HSA]_ for the
|
||||
AQL queue mechanics and packet layouts.
|
||||
The HSA architected queuing language (AQL) defines a user space memory interface
|
||||
that can be used to control the dispatch of kernels, in an agent independent
|
||||
way. An agent can have zero or more AQL queues created for it using the ROCm
|
||||
runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
|
||||
*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
|
||||
mechanics and packet layouts.
|
||||
|
||||
The packet processor of a kernel agent is responsible for detecting and
|
||||
dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
|
||||
|
@ -2965,6 +3002,86 @@ CPU host program, or from an HSA kernel executing on a GPU.
|
|||
10. When the kernel dispatch has completed execution, CP signals the completion
|
||||
signal specified in the kernel dispatch packet if not 0.
|
||||
|
||||
.. _amdgpu-amdhsa-memory-spaces:
|
||||
|
||||
Memory Spaces
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The memory space properties are:
|
||||
|
||||
.. table:: AMDHSA Memory Spaces
|
||||
:name: amdgpu-amdhsa-memory-spaces-table
|
||||
|
||||
================= =========== ======== ======= ==================
|
||||
Memory Space Name HSA Segment Hardware Address NULL Value
|
||||
Name Name Size
|
||||
================= =========== ======== ======= ==================
|
||||
Private private scratch 32 0x00000000
|
||||
Local group LDS 32 0xFFFFFFFF
|
||||
Global global global 64 0x0000000000000000
|
||||
Constant constant *same as 64 0x0000000000000000
|
||||
global*
|
||||
Generic flat flat 64 0x0000000000000000
|
||||
Region N/A GDS 32 *not implemented
|
||||
for AMDHSA*
|
||||
================= =========== ======== ======= ==================
|
||||
|
||||
The global and constant memory spaces both use global virtual addresses, which
|
||||
are the same virtual address space used by the CPU. However, some virtual
|
||||
addresses may only be accessible to the CPU, some only accessible by the GPU,
|
||||
and some by both.
|
||||
|
||||
Using the constant memory space indicates that the data will not change during
|
||||
the execution of the kernel. This allows scalar read instructions to be
|
||||
used. The vector and scalar L1 caches are invalidated of volatile data before
|
||||
each kernel dispatch execution to allow constant memory to change values between
|
||||
kernel dispatches.
|
||||
|
||||
The local memory space uses the hardware Local Data Store (LDS) which is
|
||||
automatically allocated when the hardware creates work-groups of wavefronts, and
|
||||
freed when all the wavefronts of a work-group have terminated. The data store
|
||||
(DS) instructions can be used to access it.
|
||||
|
||||
The private memory space uses the hardware scratch memory support. If the kernel
|
||||
uses scratch, then the hardware allocates memory that is accessed using
|
||||
wavefront lane dword (4 byte) interleaving. The mapping used from private
|
||||
address to physical address is:
|
||||
|
||||
``wavefront-scratch-base +
|
||||
(private-address * wavefront-size * 4) +
|
||||
(wavefront-lane-id * 4)``
|
||||
|
||||
There are different ways that the wavefront scratch base address is determined
|
||||
by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
|
||||
memory can be accessed in an interleaved manner using buffer instruction with
|
||||
the scratch buffer descriptor and per wavefront scratch offset, by the scratch
|
||||
instructions, or by flat instructions. If each lane of a wavefront accesses the
|
||||
same private address, the interleaving results in adjacent dwords being accessed
|
||||
and hence requires fewer cache lines to be fetched. Multi-dword access is not
|
||||
supported except by flat and scratch instructions in GFX9-GFX10.
|
||||
|
||||
The generic address space uses the hardware flat address support available in
|
||||
GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
|
||||
local apertures), that are outside the range of addressible global memory, to
|
||||
map from a flat address to a private or local address.
|
||||
|
||||
FLAT instructions can take a flat address and access global, private (scratch)
|
||||
and group (LDS) memory depending in if the address is within one of the
|
||||
aperture ranges. Flat access to scratch requires hardware aperture setup and
|
||||
setup in the kernel prologue (see
|
||||
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
|
||||
hardware aperture setup and M0 (GFX7-GFX8) register setup (see
|
||||
:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
|
||||
|
||||
To convert between a segment address and a flat address the base address of the
|
||||
apertures address can be used. For GFX7-GFX8 these are available in the
|
||||
:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
|
||||
Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
|
||||
GFX9-GFX10 the aperture base addresses are directly available as inline constant
|
||||
registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
|
||||
address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
|
||||
which makes it easier to convert from flat to segment or segment to flat.
|
||||
|
||||
Image and Samplers
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
@ -3635,7 +3752,7 @@ SGPR register initial state is defined in
|
|||
First Private Segment Buffer 4 V# that can be used, together
|
||||
(enable_sgpr_private with Scratch Wavefront Offset
|
||||
_segment_buffer) as an offset, to access the
|
||||
private address space using a
|
||||
private memory space using a
|
||||
segment address.
|
||||
|
||||
CP uses the value provided by
|
||||
|
@ -3835,13 +3952,13 @@ VGPR register initial state is defined in
|
|||
(kernel descriptor enable of
|
||||
field) VGPRs
|
||||
========== ========================== ====== ==============================
|
||||
First Work-Item Id X 1 32-bit work item id in X
|
||||
First Work-Item Id X 1 32-bit work-item id in X
|
||||
(Always initialized) dimension of work-group for
|
||||
wavefront lane.
|
||||
then Work-Item Id Y 1 32-bit work item id in Y
|
||||
then Work-Item Id Y 1 32-bit work-item id in Y
|
||||
(enable_vgpr_workitem_id dimension of work-group for
|
||||
> 0) wavefront lane.
|
||||
then Work-Item Id Z 1 32-bit work item id in Z
|
||||
then Work-Item Id Z 1 32-bit work-item id in Z
|
||||
(enable_vgpr_workitem_id dimension of work-group for
|
||||
> 1) wavefront lane.
|
||||
========== ========================== ====== ==============================
|
||||
|
@ -4100,7 +4217,7 @@ For GFX6-GFX9:
|
|||
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
|
||||
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
|
||||
scalar operations are used in a restricted way so do not impact the memory
|
||||
model. See :ref:`amdgpu-address-spaces`.
|
||||
model. See :ref:`amdgpu-amdhsa-memory-spaces`.
|
||||
* The vector and scalar memory operations use an L2 cache shared by all CUs on
|
||||
the same agent.
|
||||
* The L2 cache has independent channels to service disjoint ranges of virtual
|
||||
|
@ -4155,7 +4272,7 @@ For GFX10:
|
|||
* The scalar memory operations access a scalar L0 cache shared by all wavefronts
|
||||
on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
|
||||
operations are used in a restricted way so do not impact the memory model. See
|
||||
:ref:`amdgpu-address-spaces`.
|
||||
:ref:`amdgpu-amdhsa-memory-spaces`.
|
||||
* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
|
||||
the same SA. Therefore, no special action is required for coherence between
|
||||
the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is
|
||||
|
@ -4220,7 +4337,7 @@ variables. Therefore, the kernel machine code does not have to maintain the
|
|||
scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
|
||||
and vector L1 caches are invalidated between kernel dispatches by CP since
|
||||
constant address space data may change between kernel dispatch executions. See
|
||||
:ref:`amdgpu-address-spaces`.
|
||||
:ref:`amdgpu-amdhsa-memory-spaces`.
|
||||
|
||||
The one exception is if scalar writes are used to spill SGPR registers. In this
|
||||
case the AMDGPU backend ensures the memory location used to spill is never
|
||||
|
|
Loading…
Reference in New Issue