2016-02-13 07:45:29 +08:00
|
|
|
; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s | FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s | FileCheck -check-prefix=GCN -check-prefix=VCCZ-BUG %s
|
[AMDGPU] Simplify VCCZ bug handling
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
2019-10-30 21:47:32 +08:00
|
|
|
; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s | FileCheck -check-prefix=GCN %s
|
2016-02-13 07:45:29 +08:00
|
|
|
|
|
|
|
; GCN-FUNC: {{^}}vccz_workaround:
|
|
|
|
; GCN: s_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x0
|
[AMDGPU] Fixed incorrect uniform branch condition
Summary:
I had a case where multiple nested uniform ifs resulted in code that did
v_cmp comparisons, combining the results with s_and_b64, s_or_b64 and
s_xor_b64 and using the resulting mask in s_cbranch_vccnz, without first
ensuring that bits for inactive lanes were clear.
There was already code for inserting an "s_and_b64 vcc, exec, vcc" to
clear bits for inactive lanes in the case that the branch is instruction
selected as s_cbranch_scc1 and is then changed to s_cbranch_vccnz in
SIFixSGPRCopies. I have added the same code into SILowerControlFlow for
the case that the branch is instruction selected as s_cbranch_vccnz.
This de-optimizes the code in some cases where the s_and is not needed,
because vcc is the result of a v_cmp, or multiple v_cmp instructions
combined by s_and/s_or. We should add a pass to re-optimize those cases.
Reviewers: arsenm, kzhuravl
Subscribers: wdng, yaxunl, t-tye, llvm-commits, dstuttard, timcorringham, nhaehnle
Differential Revision: https://reviews.llvm.org/D41292
llvm-svn: 322119
2018-01-10 05:34:43 +08:00
|
|
|
; GCN: v_cmp_neq_f32_e64 {{[^,]*}}, s{{[0-9]+}}, 0{{$}}
|
AMDGPU/InsertWaitcnts: Untangle some semi-global state
Summary:
Reduce the statefulness of the algorithm in two ways:
1. More clearly split generateWaitcntInstBefore into two phases: the
first one which determines the required wait, if any, without changing
the ScoreBrackets, and the second one which actually inserts the wait
and updates the brackets.
2. Communicate pre-existing s_waitcnt instructions using an argument to
generateWaitcntInstBefore instead of through the ScoreBrackets.
To simplify these changes, a Waitcnt structure is introduced which carries
the counts of an s_waitcnt instruction in decoded form.
There are some functional changes:
1. The FIXME for the VCCZ bug workaround was implemented: we only wait for
SMEM instructions as required instead of waiting on all counters.
2. We now properly track pre-existing waitcnt's in all cases, which leads
to less conservative waitcnts being emitted in some cases.
s_load_dword ...
s_waitcnt lgkmcnt(0) <-- pre-existing wait count
ds_read_b32 v0, ...
ds_read_b32 v1, ...
s_waitcnt lgkmcnt(0) <-- this is too conservative
use(v0)
more code
use(v1)
This increases code size a bit, but the reduced latency should still be a
win in basically all cases. The worst code size regressions in my shader-db
are:
WORST REGRESSIONS - Code Size
Before After Delta Percentage
1724 1736 12 0.70 % shaders/private/f1-2015/1334.shader_test [0]
2276 2284 8 0.35 % shaders/private/f1-2015/1306.shader_test [0]
4632 4640 8 0.17 % shaders/private/ue4_elemental/62.shader_test [0]
2376 2384 8 0.34 % shaders/private/f1-2015/1308.shader_test [0]
3284 3292 8 0.24 % shaders/private/talos_principle/1955.shader_test [0]
Reviewers: msearles, rampitec, scott.linder, kanarayan
Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits, hakzsam
Differential Revision: https://reviews.llvm.org/D54226
llvm-svn: 347848
2018-11-29 19:06:06 +08:00
|
|
|
; VCCZ-BUG: s_waitcnt lgkmcnt(0)
|
2016-02-13 07:45:29 +08:00
|
|
|
; VCCZ-BUG: s_mov_b64 vcc, vcc
|
[AMDGPU] Simplify VCCZ bug handling
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
2019-10-30 21:47:32 +08:00
|
|
|
; GCN-NOT: s_mov_b64 vcc, vcc
|
2016-02-13 07:45:29 +08:00
|
|
|
; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]
|
|
|
|
; GCN: buffer_store_dword
|
|
|
|
; GCN: [[EXIT]]:
|
|
|
|
; GCN: s_endpgm
|
2018-02-14 02:00:25 +08:00
|
|
|
define amdgpu_kernel void @vccz_workaround(i32 addrspace(4)* %in, i32 addrspace(1)* %out, float %cond) {
|
2016-02-13 07:45:29 +08:00
|
|
|
entry:
|
|
|
|
%cnd = fcmp oeq float 0.0, %cond
|
2018-02-14 02:00:25 +08:00
|
|
|
%sgpr = load volatile i32, i32 addrspace(4)* %in
|
2016-02-13 07:45:29 +08:00
|
|
|
br i1 %cnd, label %if, label %endif
|
|
|
|
|
|
|
|
if:
|
|
|
|
store i32 %sgpr, i32 addrspace(1)* %out
|
|
|
|
br label %endif
|
|
|
|
|
|
|
|
endif:
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
; GCN-FUNC: {{^}}vccz_noworkaround:
|
|
|
|
; GCN: v_cmp_neq_f32_e32 vcc, 0, v{{[0-9]+}}
|
[AMDGPU] Simplify VCCZ bug handling
Summary:
VCCZBugHandledSet was used to make sure we don't apply the same
workaround more than once to a single cbranch instruction, but it's not
necessary because the workaround involves inserting an s_waitcnt
instruction, which is enough for subsequent iterations to detect that no
further workaround is necessary.
Also beef up the test case to check that the workaround was only applied
once. I have also manually verified that the test still passes even if I
hack the big do-while loop in runOnMachineFunction to run a minimum of
five iterations.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69621
2019-10-30 21:47:32 +08:00
|
|
|
; GCN-NOT: s_waitcnt lgkmcnt(0)
|
|
|
|
; GCN-NOT: s_mov_b64 vcc, vcc
|
2016-02-13 07:45:29 +08:00
|
|
|
; GCN: s_cbranch_vccnz [[EXIT:[0-9A-Za-z_]+]]
|
|
|
|
; GCN: buffer_store_dword
|
|
|
|
; GCN: [[EXIT]]:
|
|
|
|
; GCN: s_endpgm
|
2017-03-22 05:39:51 +08:00
|
|
|
define amdgpu_kernel void @vccz_noworkaround(float addrspace(1)* %in, float addrspace(1)* %out) {
|
2016-02-13 07:45:29 +08:00
|
|
|
entry:
|
|
|
|
%vgpr = load volatile float, float addrspace(1)* %in
|
|
|
|
%cnd = fcmp oeq float 0.0, %vgpr
|
|
|
|
br i1 %cnd, label %if, label %endif
|
|
|
|
|
|
|
|
if:
|
|
|
|
store float %vgpr, float addrspace(1)* %out
|
|
|
|
br label %endif
|
|
|
|
|
|
|
|
endif:
|
|
|
|
ret void
|
|
|
|
}
|