[AMDGPU] Update gfx90a memory model support

Update AMDGPU gfx90a memory model to make coarse grain memory allocations
consistent when fine grained system scope atomic acquire and release is
performed.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D105137
This commit is contained in:
Tony Tye 2021-05-07 20:55:23 +00:00
parent 801c2b9bba
commit 7f19aa73c2
7 changed files with 625 additions and 51 deletions

View File

@ -6093,10 +6093,10 @@ For GFX90A:
ensures a previous vector memory operation has completed before executing a
subsequent vector memory or LDS operation and so can be used to meet the
requirements of acquire and release.
* The L2 cache of one agent can be kept coherent with other agents by using
the MTYPE CC (cache-coherent) with the PTE C-bit for memory local to the L2,
and MTYPE UC (uncached) with the PTE C-bit set for memory not local to the
L2.
* The L2 cache of one agent can be kept coherent with other agents by:
using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
* Any local memory cache lines will be automatically invalidated by writes
from CUs associated with other L2 caches, or writes from the CPU, due to
@ -6108,13 +6108,21 @@ For GFX90A:
the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
* Since all work-groups on the same agent share the same L2, no L2
invalidation or writeback is required for coherence.
* Since local memory reads and writes of work-groups in different agents
access memory using MTYPE CC, no L2 invalidate or writeback is required
for coherence. MTYPE CC causes write through to DRAM and local reads to be
invalidated by remote writes with with the PTE C-bit.
* Since remote memory reads and writes of work-groups in different agents
access memory using MTYPE UC, no L2 invalidate or writeback is required
for coherence. MTYPE UC causes direct accesses to DRAM.
* To ensure coherence of local and remote memory writes of work-groups in
different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
()used for remote coarse grain memory). Note that MTYPE CC (used for local
fine grain memory) causes write through to DRAM, and MTYPE UC (used for
remote fine grain memory) bypasses the L2, so both will never result in
dirty L2 cache lines.
* To ensure coherence of local and remote memory reads of work-groups in
different agents a ``buffer_invl2`` is required. It will invalidate L2
cache lines with MTYPE NC (used for remote coarse grain memory). Note that
MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
coarse memory) cause local reads to be invalidated by remote writes with
with the PTE C-bit so these cache lines are not invalidated. Note that
MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
never result in L2 cache lines that need to be invalidated.
* PCIe access from the GPU to the CPU memory is kept coherent by using the
MTYPE UC (uncached) which bypasses the L2.
@ -6384,14 +6392,15 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
2. s_waitcnt vmcnt(0)
- Must happen before
following
following buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures the load
has completed
before invalidating
the cache.
3. buffer_wbinvl1_vol
3. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -6401,7 +6410,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -6444,13 +6455,15 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
lgkmcnt(0).
- Must happen before
following
buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures the flat_load
has completed
before invalidating
the caches.
3. buffer_wbinvl1_vol
3. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -6459,8 +6472,10 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
atomic/atomicrmw.
- Ensures that
following
L1 loads will not see
stale global data.
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -6579,7 +6594,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
2. s_waitcnt vmcnt(0)
- Must happen before
following
following buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures the
atomicrmw has
@ -6587,7 +6602,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
invalidating the
caches.
3. buffer_wbinvl1_vol
3. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -6597,8 +6613,10 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
MTYPE RW and CC L2 memory
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -6641,6 +6659,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
lgkmcnt(0).
- Must happen before
following
buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures the
atomicrmw has
@ -6648,7 +6667,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
invalidating the
caches.
3. buffer_wbinvl1_vol
3. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -6658,7 +6678,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -6734,7 +6756,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
value read by the
fence-paired-atomic.
3. buffer_wbinvl1_vol
2. buffer_wbinvl1_vol
- If not TgSplit execution
mode, omit.
@ -6872,7 +6894,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
termed the
fence-paired-atomic).
- Must happen before
the following
the following buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures that the
fence-paired atomic
@ -6887,7 +6909,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
the
fence-paired-atomic.
2. buffer_wbinvl1_vol
2. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before any
following global/generic
@ -6897,7 +6920,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -6991,8 +7016,18 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
released.
2. buffer/global/flat_store
store atomic release - system - global 1. s_waitcnt lgkmcnt(0) &
- generic vmcnt(0)
store atomic release - system - global 1. buffer_wbl2
- generic
- Must happen before
following s_waitcnt.
- Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
2. s_waitcnt lgkmcnt(0) &
vmcnt(0)
- If TgSplit execution mode,
omit lgkmcnt(0).
@ -7035,7 +7070,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
store that is being
released.
2. buffer/global/flat_store
3. buffer/global/flat_store
atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
- wavefront - generic
atomicrmw release - singlethread - local *If TgSplit execution mode,
@ -7123,8 +7158,18 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
is being released.
2. buffer/global/flat_atomic
atomicrmw release - system - global 1. s_waitcnt lgkmcnt(0) &
- generic vmcnt(0)
atomicrmw release - system - global 1. buffer_wbl2
- generic
- Must happen before
following s_waitcnt.
- Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
2. s_waitcnt lgkmcnt(0) &
vmcnt(0)
- If TgSplit execution mode,
omit lgkmcnt(0).
@ -7165,7 +7210,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
store that is being
released.
2. buffer/global/flat_atomic
3. buffer/global/flat_atomic
fence release - singlethread *none* *none*
- wavefront
fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
@ -7298,7 +7343,20 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
following
fence-paired-atomic.
fence release - system *none* 1. s_waitcnt lgkmcnt(0) &
fence release - system *none* 1. buffer_wbl2
- If OpenCL and
address space is
local, omit.
- Must happen before
following s_waitcnt.
- Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
2. s_waitcnt lgkmcnt(0) &
vmcnt(0)
- If TgSplit execution mode,
@ -7588,7 +7646,17 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
will not see stale
global data.
atomicrmw acq_rel - system - global 1. s_waitcnt lgkmcnt(0) &
atomicrmw acq_rel - system - global 1. buffer_wbl2
- Must happen before
following s_waitcnt.
- Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
2. s_waitcnt lgkmcnt(0) &
vmcnt(0)
- If TgSplit execution mode,
@ -7629,11 +7697,11 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
atomicrmw that is
being released.
2. buffer/global_atomic
3. s_waitcnt vmcnt(0)
3. buffer/global_atomic
4. s_waitcnt vmcnt(0)
- Must happen before
following
following buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures the
atomicrmw has
@ -7641,7 +7709,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
invalidating the
caches.
4. buffer_wbinvl1_vol
5. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -7651,7 +7720,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -7726,7 +7797,17 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
will not see stale
global data.
atomicrmw acq_rel - system - generic 1. s_waitcnt lgkmcnt(0) &
atomicrmw acq_rel - system - generic 1. buffer_wbl2
- Must happen before
following s_waitcnt.
- Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
2. s_waitcnt lgkmcnt(0) &
vmcnt(0)
- If TgSplit execution mode,
@ -7767,8 +7848,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
atomicrmw that is
being released.
2. flat_atomic
3. s_waitcnt vmcnt(0) &
3. flat_atomic
4. s_waitcnt vmcnt(0) &
lgkmcnt(0)
- If TgSplit execution mode,
@ -7776,7 +7857,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- If OpenCL, omit
lgkmcnt(0).
- Must happen before
following
following buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures the
atomicrmw has
@ -7784,7 +7865,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
invalidating the
caches.
4. buffer_wbinvl1_vol
5. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -7794,7 +7876,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
@ -7902,7 +7986,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
the
acquire-fence-paired-atomic.
3. buffer_wbinvl1_vol
2. buffer_wbinvl1_vol
- If not TgSplit execution
mode, omit.
@ -8007,7 +8091,20 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
requirements of
acquire.
fence acq_rel - system *none* 1. s_waitcnt lgkmcnt(0) &
fence acq_rel - system *none* 1. buffer_wbl2
- If OpenCL and
address space is
local, omit.
- Must happen before
following s_waitcnt.
- Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
2. s_waitcnt lgkmcnt(0) &
vmcnt(0)
- If TgSplit execution mode,
@ -8048,7 +8145,7 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
atomic/store
atomic/atomicrmw.
- Must happen before
the following
the following buffer_invl2 and
buffer_wbinvl1_vol.
- Ensures that the
preceding
@ -8087,7 +8184,8 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
requirements of
release.
2. buffer_wbinvl1_vol
3. buffer_invl2;
buffer_wbinvl1_vol
- Must happen before
any following
@ -8098,7 +8196,9 @@ in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
- Ensures that
following
loads will not see
stale L1 global data.
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.

View File

@ -452,6 +452,12 @@ public:
SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace,
Position Pos) const override;
bool insertRelease(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace,
bool IsCrossAddrSpaceOrdering,
Position Pos) const override;
};
class SIGfx10CacheControl : public SIGfx7CacheControl {
@ -1265,9 +1271,26 @@ bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
bool Changed = false;
MachineBasicBlock &MBB = *MI->getParent();
DebugLoc DL = MI->getDebugLoc();
if (Pos == Position::AFTER)
++MI;
if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
switch (Scope) {
case SIAtomicScope::SYSTEM:
// Ensures that following loads will not see stale remote VMEM data or
// stale local VMEM data with MTYPE NC. Local VMEM data with MTYPE RW and
// CC will never be stale due to the local memory probes.
BuildMI(MBB, MI, DL, TII->get(AMDGPU::BUFFER_INVL2));
// Inserting a "S_WAITCNT vmcnt(0)" after is not required because the
// hardware does not reorder memory operations by the same wave with
// respect to a preceding "BUFFER_INVL2". The invalidate is guaranteed to
// remove any cache lines of earlier writes by the same wave and ensures
// later reads by the same wave will refetch the cache lines.
Changed = true;
break;
case SIAtomicScope::AGENT:
// Same as GFX7.
break;
@ -1297,11 +1320,62 @@ bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
/// Other address spaces do not have a cache.
if (Pos == Position::AFTER)
--MI;
Changed |= SIGfx7CacheControl::insertAcquire(MI, Scope, AddrSpace, Pos);
return Changed;
}
bool SIGfx90ACacheControl::insertRelease(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace,
bool IsCrossAddrSpaceOrdering,
Position Pos) const {
bool Changed = false;
MachineBasicBlock &MBB = *MI->getParent();
DebugLoc DL = MI->getDebugLoc();
if (Pos == Position::AFTER)
++MI;
if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
switch (Scope) {
case SIAtomicScope::SYSTEM:
// Inserting a "S_WAITCNT vmcnt(0)" before is not required because the
// hardware does not reorder memory operations by the same wave with
// respect to a following "BUFFER_WBL2". The "BUFFER_WBL2" is guaranteed
// to initiate writeback of any dirty cache lines of earlier writes by the
// same wave. A "S_WAITCNT vmcnt(0)" is needed after to ensure the
// writeback has completed.
BuildMI(MBB, MI, DL, TII->get(AMDGPU::BUFFER_WBL2));
// Followed by same as GFX7, which will ensure the necessary "S_WAITCNT
// vmcnt(0)" needed by the "BUFFER_WBL2".
Changed = true;
break;
case SIAtomicScope::AGENT:
case SIAtomicScope::WORKGROUP:
case SIAtomicScope::WAVEFRONT:
case SIAtomicScope::SINGLETHREAD:
// Same as GFX7.
break;
default:
llvm_unreachable("Unsupported synchronization scope");
}
}
if (Pos == Position::AFTER)
--MI;
Changed |=
SIGfx7CacheControl::insertRelease(MI, Scope, AddrSpace,
IsCrossAddrSpaceOrdering, Pos);
return Changed;
}
bool SIGfx10CacheControl::enableLoadCacheBypass(
const MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,

View File

@ -424,9 +424,11 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat(double addrspace(1)*
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: v_mov_b32_e32 v4, 0
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: global_atomic_cmpswap_x2 v[0:1], v4, v[0:3], s[0:1] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
@ -470,9 +472,11 @@ define amdgpu_kernel void @global_atomic_fadd_f64_noret_pat_system(double addrsp
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: v_mov_b32_e32 v4, 0
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: global_atomic_cmpswap_x2 v[0:1], v4, v[0:3], s[0:1] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
@ -526,9 +530,11 @@ define double @global_atomic_fadd_f64_rtn_pat(double addrspace(1)* %ptr, double
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
; GFX90A-NEXT: v_add_f64 v[2:3], v[4:5], 4.0
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: global_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5], off glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
@ -571,9 +577,11 @@ define double @global_atomic_fadd_f64_rtn_pat_system(double addrspace(1)* %ptr,
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
; GFX90A-NEXT: v_add_f64 v[2:3], v[4:5], 4.0
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: global_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5], off glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
@ -655,9 +663,11 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat(double* %ptr) #1 {
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
@ -702,9 +712,11 @@ define amdgpu_kernel void @flat_atomic_fadd_f64_noret_pat_system(double* %ptr) #
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_add_f64 v[0:1], v[2:3], 4.0
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], s[0:1], s[0:1] op_sel:[0,1]
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
@ -730,9 +742,11 @@ define double @flat_atomic_fadd_f64_rtn_pat(double* %ptr) #1 {
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
; GFX90A-NEXT: v_add_f64 v[2:3], v[4:5], 4.0
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]
; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
@ -775,9 +789,11 @@ define double @flat_atomic_fadd_f64_rtn_pat_system(double* %ptr) #1 {
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_pk_mov_b32 v[4:5], v[2:3], v[2:3] op_sel:[0,1]
; GFX90A-NEXT: v_add_f64 v[2:3], v[4:5], 4.0
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[2:3], v[0:1], v[2:5] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[2:3], v[4:5]

View File

@ -70,9 +70,11 @@ define amdgpu_kernel void @global_atomic_fadd_ret_f32(float addrspace(1)* %ptr)
; GFX90A-NEXT: v_mov_b32_e32 v1, v0
; GFX90A-NEXT: v_mov_b32_e32 v2, 0
; GFX90A-NEXT: v_add_f32_e32 v0, 4.0, v1
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: global_atomic_cmpswap v0, v2, v[0:1], s[0:1] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v0, v1
; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]
@ -527,9 +529,11 @@ define amdgpu_kernel void @global_atomic_fadd_ret_f32_system(float addrspace(1)*
; GFX90A-NEXT: v_mov_b32_e32 v1, v0
; GFX90A-NEXT: v_mov_b32_e32 v2, 0
; GFX90A-NEXT: v_add_f32_e32 v0, 4.0, v1
; GFX90A-NEXT: buffer_wbl2
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: global_atomic_cmpswap v0, v2, v[0:1], s[0:1] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v0, v1
; GFX90A-NEXT: s_or_b64 s[2:3], vcc, s[2:3]

View File

@ -1275,13 +1275,17 @@ define amdgpu_kernel void @system_acquire_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_acquire_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: buffer_invl2
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_acquire_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-TGSPLIT-NEXT: buffer_invl2
; GFX90A-TGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1319,11 +1323,13 @@ define amdgpu_kernel void @system_release_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_release_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_release_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1367,13 +1373,17 @@ define amdgpu_kernel void @system_acq_rel_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_acq_rel_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: buffer_invl2
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_acq_rel_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-TGSPLIT-NEXT: buffer_invl2
; GFX90A-TGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1417,13 +1427,17 @@ define amdgpu_kernel void @system_seq_cst_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_seq_cst_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: buffer_invl2
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_seq_cst_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-TGSPLIT-NEXT: buffer_invl2
; GFX90A-TGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1467,13 +1481,17 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_one_as_acquire_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: buffer_invl2
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_one_as_acquire_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-TGSPLIT-NEXT: buffer_invl2
; GFX90A-TGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1511,11 +1529,13 @@ define amdgpu_kernel void @system_one_as_release_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_one_as_release_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_one_as_release_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1559,13 +1579,17 @@ define amdgpu_kernel void @system_one_as_acq_rel_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_one_as_acq_rel_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: buffer_invl2
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_one_as_acq_rel_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-TGSPLIT-NEXT: buffer_invl2
; GFX90A-TGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:
@ -1609,13 +1633,17 @@ define amdgpu_kernel void @system_one_as_seq_cst_fence() {
;
; GFX90A-NOTTGSPLIT-LABEL: system_one_as_seq_cst_fence:
; GFX90A-NOTTGSPLIT: ; %bb.0: ; %entry
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbl2
; GFX90A-NOTTGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-NOTTGSPLIT-NEXT: buffer_invl2
; GFX90A-NOTTGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-NOTTGSPLIT-NEXT: s_endpgm
;
; GFX90A-TGSPLIT-LABEL: system_one_as_seq_cst_fence:
; GFX90A-TGSPLIT: ; %bb.0: ; %entry
; GFX90A-TGSPLIT-NEXT: buffer_wbl2
; GFX90A-TGSPLIT-NEXT: s_waitcnt vmcnt(0)
; GFX90A-TGSPLIT-NEXT: buffer_invl2
; GFX90A-TGSPLIT-NEXT: buffer_wbinvl1_vol
; GFX90A-TGSPLIT-NEXT: s_endpgm
entry:

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff