2016-02-13 07:45:29 +08:00
|
|
|
; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s | FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s | FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
|
[AMDGPU] Simplify VCCZ bug handling
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
2019-10-30 21:47:32 +08:00
|
|
|
; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
|
2016-02-13 07:45:29 +08:00
|
|
|
|
|
|
|
; GCN-FUNC: {{^}}vccz_workaround:
|
2020-02-25 22:38:57 +08:00
|
|
|
; GCN: s_load_dword [[REG:s[0-9]+]], s[{{[0-9]+:[0-9]+}}],
|
|
|
|
; GCN: v_cmp_neq_f32_e64 {{[^,]*}}, [[REG]], 0{{$}}
|
AMDGPU/InsertWaitcnts: Untangle some semi-global state
Summary:
Reduce the statefulness of the algorithm in two ways:
1. More clearly split generateWaitcntInstBefore into two phases: the
first one which determines the required wait, if any, without changing
the ScoreBrackets, and the second one which actually inserts the wait
and updates the brackets.
2. Communicate pre-existing s_waitcnt instructions using an argument to
generateWaitcntInstBefore instead of through the ScoreBrackets.
To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.
There are some functional changes:
1. The FIXME for the VCCZ bug workaround was implemented: we only wait for
SMEM instructions as required instead of waiting on all counters.
2. We now properly track pre-existing waitcnt's in all cases, which leads
to less conservative waitcnts being emitted in some cases.
s_load_dword ...
s_waitcnt lgkmcnt(0) <-- pre-existing wait count
ds_read_b32 v0, ...
ds_read_b32 v1, ...
s_waitcnt lgkmcnt(0) <-- this is too conservative
use(v0)
more code
use(v1)
This increases code size a bit, but the reduced latency should still be a
win in basically all cases. The worst code size regressions in my shader-db
are:
WORST REGRESSIONS - Code Size
Before After Delta Percentage
1724 1736 12 0.70 % shaders/private/f1-2015/1334.shader_test [0]
2276 2284 8 0.35 % shaders/private/f1-2015/1306.shader_test [0]
4632 4640 8 0.17 % shaders/private/ue4_elemental/62.shader_test [0]
2376 2384 8 0.34 % shaders/private/f1-2015/1308.shader_test [0]
3284 3292 8 0.24 % shaders/private/talos_principle/1955.shader_test [0]
Reviewers: msearles, rampitec, scott.linder, kanarayan
Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam
Differential Revision: https://reviews.llvm.org/D54226
llvm-svn: 347848
2018-11-29 19:06:06 +08:00
|
|
|
; VCCZ-BUG: s_waitcnt lgkmcnt(0)
|
2016-02-13 07:45:29 +08:00
|
|
|
; VCCZ-BUG: s_mov_b64 vcc, vcc
|
[AMDGPU] Simplify VCCZ bug handling
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
2019-10-30 21:47:32 +08:00
|
|
|
; GCN-NOT: s_mov_b64 vcc, vcc
|
2016-02-13 07:45:29 +08:00
|
|
|
; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]
|
|
|
|
; GCN: buffer_store_dword
|
|
|
|
; GCN: [[EXIT]]:
|
|
|
|
; GCN: s_endpgm
|
2018-02-14 02:00:25 +08:00
|
|
|
define amdgpu_kernel void @vccz_workaround(i32 addrspace(4)* %in, i32 addrspace(1)* %out, float %cond) {
|
2016-02-13 07:45:29 +08:00
|
|
|
entry:
|
|
|
|
%cnd = fcmp oeq float 0.0, %cond
|
2018-02-14 02:00:25 +08:00
|
|
|
%sgpr = load volatile i32, i32 addrspace(4)* %in
|
2016-02-13 07:45:29 +08:00
|
|
|
br i1 %cnd, label %if, label %endif
|
|
|
|
|
|
|
|
if:
|
|
|
|
store i32 %sgpr, i32 addrspace(1)* %out
|
|
|
|
br label %endif
|
|
|
|
|
|
|
|
endif:
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
; GCN-FUNC: {{^}}vccz_noworkaround:
|
|
|
|
; GCN: v_cmp_neq_f32_e32 vcc, 0, v{{[0-9]+}}
|
[AMDGPU] Simplify VCCZ bug handling
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
2019-10-30 21:47:32 +08:00
|
|
|
; GCN-NOT: s_waitcnt lgkmcnt(0)
|
|
|
|
; GCN-NOT: s_mov_b64 vcc, vcc
|
2016-02-13 07:45:29 +08:00
|
|
|
; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]
|
|
|
|
; GCN: buffer_store_dword
|
|
|
|
; GCN: [[EXIT]]:
|
|
|
|
; GCN: s_endpgm
|
2017-03-22 05:39:51 +08:00
|
|
|
define amdgpu_kernel void @vccz_noworkaround(float addrspace(1)* %in, float addrspace(1)* %out) {
|
2016-02-13 07:45:29 +08:00
|
|
|
entry:
|
|
|
|
%vgpr = load volatile float, float addrspace(1)* %in
|
|
|
|
%cnd = fcmp oeq float 0.0, %vgpr
|
|
|
|
br i1 %cnd, label %if, label %endif
|
|
|
|
|
|
|
|
if:
|
|
|
|
store float %vgpr, float addrspace(1)* %out
|
|
|
|
br label %endif
|
|
|
|
|
|
|
|
endif:
|
|
|
|
ret void
|
|
|
|
}
|