2020-01-29 21:04:56 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
|
|
|
; RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs | FileCheck -check-prefixes=TONGA %s
|
|
|
|
; RUN: llc < %s -march=amdgcn -mcpu=gfx810 -verify-machineinstrs | FileCheck -check-prefixes=GFX81 %s
|
|
|
|
; RUN: llc < %s -march=amdgcn -mcpu=gfx900 -verify-machineinstrs | FileCheck -check-prefixes=GFX9 %s
|
|
|
|
; RUN: llc < %s -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs | FileCheck -check-prefixes=GFX10 %s
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
|
|
|
|
define amdgpu_ps half @image_sample_2d_f16(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %s, float %t) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-LABEL: image_sample_2d_f16:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; TONGA-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; TONGA-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; TONGA-NEXT: image_sample v0, v[0:1], s[0:7], s[8:11] dmask:0x1 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_2d_f16:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX81-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX81-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX81-NEXT: image_sample v0, v[0:1], s[0:7], s[8:11] dmask:0x1 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_2d_f16:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX9-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX9-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX9-NEXT: image_sample v0, v[0:1], s[0:7], s[8:11] dmask:0x1 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_2d_f16:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: s_mov_b32 s12, exec_lo
|
|
|
|
; GFX10-NEXT: s_wqm_b32 exec_lo, exec_lo
|
|
|
|
; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s12
|
|
|
|
; GFX10-NEXT: image_sample v0, v[0:1], s[0:7], s[8:11] dmask:0x1 dim:SQ_RSRC_IMG_2D d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%tex = call half @llvm.amdgcn.image.sample.2d.f16.f32(i32 1, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
|
|
|
|
ret half %tex
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps half @image_sample_2d_f16_tfe(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %s, float %t, i32 addrspace(1)* inreg %out) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-LABEL: image_sample_2d_f16_tfe:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: s_mov_b64 s[14:15], exec
|
|
|
|
; TONGA-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v2, 0
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v3, v2
|
|
|
|
; TONGA-NEXT: s_and_b64 exec, exec, s[14:15]
|
|
|
|
; TONGA-NEXT: image_sample v[2:3], v[0:1], s[0:7], s[8:11] dmask:0x1 tfe d16
|
2020-10-27 08:55:55 +08:00
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v0, s12
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v1, s13
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
2020-10-27 08:55:55 +08:00
|
|
|
; TONGA-NEXT: flat_store_dword v[0:1], v3
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v0, v2
|
2020-10-16 15:09:38 +08:00
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_2d_f16_tfe:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: s_mov_b64 s[14:15], exec
|
|
|
|
; GFX81-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v2, 0
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v3, v2
|
|
|
|
; GFX81-NEXT: s_and_b64 exec, exec, s[14:15]
|
|
|
|
; GFX81-NEXT: image_sample v[2:3], v[0:1], s[0:7], s[8:11] dmask:0x1 tfe d16
|
2020-10-27 08:55:55 +08:00
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v0, s12
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v1, s13
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
2020-10-27 08:55:55 +08:00
|
|
|
; GFX81-NEXT: flat_store_dword v[0:1], v3
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v0, v2
|
2020-10-16 15:09:38 +08:00
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_2d_f16_tfe:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: s_mov_b64 s[14:15], exec
|
|
|
|
; GFX9-NEXT: s_wqm_b64 exec, exec
|
2020-11-11 00:06:59 +08:00
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v5, v4
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v2, v4
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v3, v5
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX9-NEXT: s_and_b64 exec, exec, s[14:15]
|
|
|
|
; GFX9-NEXT: image_sample v[2:3], v[0:1], s[0:7], s[8:11] dmask:0x1 tfe d16
|
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v0, v2
|
2020-11-11 00:06:59 +08:00
|
|
|
; GFX9-NEXT: global_store_dword v4, v3, s[12:13]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_2d_f16_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
2021-04-27 03:48:12 +08:00
|
|
|
; GFX10-NEXT: s_mov_b32 s14, exec_lo
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_wqm_b32 exec_lo, exec_lo
|
2020-11-11 00:06:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v4
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v4
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v5
|
2021-04-27 03:48:12 +08:00
|
|
|
; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s14
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: image_sample v[2:3], v[0:1], s[0:7], s[8:11] dmask:0x1 dim:SQ_RSRC_IMG_2D tfe d16
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v2
|
2020-11-11 00:06:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v4, v3, s[12:13]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%tex = call {half,i32} @llvm.amdgcn.image.sample.2d.f16i32.f32(i32 1, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 1, i32 0)
|
|
|
|
%tex.vec = extractvalue {half, i32} %tex, 0
|
|
|
|
%tex.err = extractvalue {half, i32} %tex, 1
|
|
|
|
store i32 %tex.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret half %tex.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps float @image_sample_c_d_1d_v2f16(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %zcompare, float %dsdh, float %dsdv, float %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-LABEL: image_sample_c_d_1d_v2f16:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: image_sample_c_d v[0:1], v[0:3], s[0:7], s[8:11] dmask:0x3 d16
|
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v1, 16, v1
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_c_d_1d_v2f16:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: image_sample_c_d v0, v[0:3], s[0:7], s[8:11] dmask:0x3 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_c_d_1d_v2f16:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: image_sample_c_d v0, v[0:3], s[0:7], s[8:11] dmask:0x3 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_c_d_1d_v2f16:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_sample_c_d v0, v[0:3], s[0:7], s[8:11] dmask:0x3 dim:SQ_RSRC_IMG_1D d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%tex = call <2 x half> @llvm.amdgcn.image.sample.c.d.1d.v2f16.f32.f32(i32 3, float %zcompare, float %dsdh, float %dsdv, float %s, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
|
|
|
|
%r = bitcast <2 x half> %tex to float
|
|
|
|
ret float %r
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <2 x float> @image_sample_c_d_1d_v2f16_tfe(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %zcompare, float %dsdh, float %dsdv, float %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-LABEL: image_sample_c_d_1d_v2f16_tfe:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v5, v4
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v6, v4
|
|
|
|
; TONGA-NEXT: image_sample_c_d v[4:6], v[0:3], s[0:7], s[8:11] dmask:0x3 tfe d16
|
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v0, 16, v5
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v0, v4, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v1, v6
|
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_c_d_1d_v2f16_tfe:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v5, v4
|
|
|
|
; GFX81-NEXT: image_sample_c_d v[4:5], v[0:3], s[0:7], s[8:11] dmask:0x3 tfe d16
|
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v1, v5
|
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_c_d_1d_v2f16_tfe:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v5, v4
|
|
|
|
; GFX9-NEXT: image_sample_c_d v[4:5], v[0:3], s[0:7], s[8:11] dmask:0x3 tfe d16
|
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v1, v5
|
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_c_d_1d_v2f16_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v1
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX10-NEXT: image_sample_c_d v[0:1], [v5, v4, v2, v3], s[0:7], s[8:11] dmask:0x3 dim:SQ_RSRC_IMG_1D tfe d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%tex = call {<2 x half>,i32} @llvm.amdgcn.image.sample.c.d.1d.v2f16i32.f32.f32(i32 3, float %zcompare, float %dsdh, float %dsdv, float %s, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 1, i32 0)
|
|
|
|
%tex.vec = extractvalue {<2 x half>, i32} %tex, 0
|
|
|
|
%tex.err = extractvalue {<2 x half>, i32} %tex, 1
|
|
|
|
%tex.vecf = bitcast <2 x half> %tex.vec to float
|
|
|
|
%r.0 = insertelement <2 x float> undef, float %tex.vecf, i32 0
|
|
|
|
%tex.errf = bitcast i32 %tex.err to float
|
|
|
|
%r = insertelement <2 x float> %r.0, float %tex.errf, i32 1
|
|
|
|
ret <2 x float> %r
|
|
|
|
}
|
|
|
|
|
2020-07-23 22:59:00 +08:00
|
|
|
define amdgpu_ps <2 x float> @image_sample_b_2d_v3f16(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %bias, float %s, float %t) {
|
|
|
|
; TONGA-LABEL: image_sample_b_2d_v3f16:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; TONGA-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; TONGA-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; TONGA-NEXT: image_sample_b v[0:2], v[0:2], s[0:7], s[8:11] dmask:0x7 d16
|
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v1, 16, v1
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v1, v2
|
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_b_2d_v3f16:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX81-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX81-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX81-NEXT: image_sample_b v[0:1], v[0:2], s[0:7], s[8:11] dmask:0x7 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
2020-07-23 22:59:00 +08:00
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_b_2d_v3f16:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX9-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX9-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX9-NEXT: image_sample_b v[0:1], v[0:2], s[0:7], s[8:11] dmask:0x7 d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
2020-07-23 22:59:00 +08:00
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_b_2d_v3f16:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: s_mov_b32 s12, exec_lo
|
|
|
|
; GFX10-NEXT: s_wqm_b32 exec_lo, exec_lo
|
|
|
|
; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s12
|
|
|
|
; GFX10-NEXT: image_sample_b v[0:1], v[0:2], s[0:7], s[8:11] dmask:0x7 dim:SQ_RSRC_IMG_2D d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-07-23 22:59:00 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
|
|
|
main_body:
|
|
|
|
%tex = call <3 x half> @llvm.amdgcn.image.sample.b.2d.v3f16.f32.f32(i32 7, float %bias, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
|
|
|
|
%tex_wide = shufflevector <3 x half> %tex, <3 x half> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
|
|
|
|
%r = bitcast <4 x half> %tex_wide to <2 x float>
|
|
|
|
ret <2 x float> %r
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @image_sample_b_2d_v3f16_tfe(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %bias, float %s, float %t) {
|
|
|
|
; TONGA-LABEL: image_sample_b_2d_v3f16_tfe:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; TONGA-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v5, v3
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v6, v3
|
|
|
|
; TONGA-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; TONGA-NEXT: image_sample_b v[3:6], v[0:2], s[0:7], s[8:11] dmask:0x7 tfe d16
|
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v0, 16, v4
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v1, v5
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v2, v6
|
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_b_2d_v3f16_tfe:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX81-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v5, v3
|
|
|
|
; GFX81-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX81-NEXT: image_sample_b v[3:5], v[0:2], s[0:7], s[8:11] dmask:0x7 tfe d16
|
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v1, v4
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v2, v5
|
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_b_2d_v3f16_tfe:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX9-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v5, v3
|
|
|
|
; GFX9-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX9-NEXT: image_sample_b v[3:5], v[0:2], s[0:7], s[8:11] dmask:0x7 tfe d16
|
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v1, v4
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v2, v5
|
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_b_2d_v3f16_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: s_mov_b32 s12, exec_lo
|
|
|
|
; GFX10-NEXT: s_wqm_b32 exec_lo, exec_lo
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v2
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v1
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s12
|
|
|
|
; GFX10-NEXT: image_sample_b v[0:2], v[3:5], s[0:7], s[8:11] dmask:0x7 dim:SQ_RSRC_IMG_2D tfe d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-07-23 22:59:00 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
|
|
|
main_body:
|
|
|
|
%tex = call {<3 x half>,i32} @llvm.amdgcn.image.sample.b.2d.v3f16i32.f32.f32(i32 7, float %bias, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 1, i32 0)
|
|
|
|
%tex.vec = extractvalue {<3 x half>, i32} %tex, 0
|
|
|
|
%tex.vec_wide = shufflevector <3 x half> %tex.vec, <3 x half> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
|
|
|
|
%tex.err = extractvalue {<3 x half>, i32} %tex, 1
|
|
|
|
%tex.vecf = bitcast <4 x half> %tex.vec_wide to <2 x float>
|
|
|
|
%tex.vecf.0 = extractelement <2 x float> %tex.vecf, i32 0
|
|
|
|
%tex.vecf.1 = extractelement <2 x float> %tex.vecf, i32 1
|
|
|
|
%r.0 = insertelement <4 x float> undef, float %tex.vecf.0, i32 0
|
|
|
|
%r.1 = insertelement <4 x float> %r.0, float %tex.vecf.1, i32 1
|
|
|
|
%tex.errf = bitcast i32 %tex.err to float
|
|
|
|
%r = insertelement <4 x float> %r.1, float %tex.errf, i32 2
|
|
|
|
ret <4 x float> %r
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <2 x float> @image_sample_b_2d_v4f16(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %bias, float %s, float %t) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-LABEL: image_sample_b_2d_v4f16:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; TONGA-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; TONGA-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; TONGA-NEXT: image_sample_b v[0:3], v[0:2], s[0:7], s[8:11] dmask:0xf d16
|
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v1, 16, v1
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v3, 16, v3
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v1, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_b_2d_v4f16:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX81-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX81-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX81-NEXT: image_sample_b v[0:1], v[0:2], s[0:7], s[8:11] dmask:0xf d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_b_2d_v4f16:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX9-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX9-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX9-NEXT: image_sample_b v[0:1], v[0:2], s[0:7], s[8:11] dmask:0xf d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_b_2d_v4f16:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: s_mov_b32 s12, exec_lo
|
|
|
|
; GFX10-NEXT: s_wqm_b32 exec_lo, exec_lo
|
|
|
|
; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s12
|
|
|
|
; GFX10-NEXT: image_sample_b v[0:1], v[0:2], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%tex = call <4 x half> @llvm.amdgcn.image.sample.b.2d.v4f16.f32.f32(i32 15, float %bias, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
|
|
|
|
%r = bitcast <4 x half> %tex to <2 x float>
|
|
|
|
ret <2 x float> %r
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @image_sample_b_2d_v4f16_tfe(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %bias, float %s, float %t) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; TONGA-LABEL: image_sample_b_2d_v4f16_tfe:
|
|
|
|
; TONGA: ; %bb.0: ; %main_body
|
|
|
|
; TONGA-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; TONGA-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v5, v3
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v6, v3
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v7, v3
|
|
|
|
; TONGA-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; TONGA-NEXT: image_sample_b v[3:7], v[0:2], s[0:7], s[8:11] dmask:0xf tfe d16
|
|
|
|
; TONGA-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v0, 16, v4
|
|
|
|
; TONGA-NEXT: v_lshlrev_b32_e32 v1, 16, v6
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v0, v3, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: v_or_b32_sdwa v1, v5, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
|
|
|
|
; TONGA-NEXT: v_mov_b32_e32 v2, v7
|
|
|
|
; TONGA-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX81-LABEL: image_sample_b_2d_v4f16_tfe:
|
|
|
|
; GFX81: ; %bb.0: ; %main_body
|
|
|
|
; GFX81-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX81-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v5, v3
|
|
|
|
; GFX81-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX81-NEXT: image_sample_b v[3:5], v[0:2], s[0:7], s[8:11] dmask:0xf tfe d16
|
|
|
|
; GFX81-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v1, v4
|
|
|
|
; GFX81-NEXT: v_mov_b32_e32 v2, v5
|
|
|
|
; GFX81-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX9-LABEL: image_sample_b_2d_v4f16_tfe:
|
|
|
|
; GFX9: ; %bb.0: ; %main_body
|
|
|
|
; GFX9-NEXT: s_mov_b64 s[12:13], exec
|
|
|
|
; GFX9-NEXT: s_wqm_b64 exec, exec
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v5, v3
|
|
|
|
; GFX9-NEXT: s_and_b64 exec, exec, s[12:13]
|
|
|
|
; GFX9-NEXT: image_sample_b v[3:5], v[0:2], s[0:7], s[8:11] dmask:0xf tfe d16
|
|
|
|
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v1, v4
|
|
|
|
; GFX9-NEXT: v_mov_b32_e32 v2, v5
|
|
|
|
; GFX9-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_sample_b_2d_v4f16_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: s_mov_b32 s12, exec_lo
|
|
|
|
; GFX10-NEXT: s_wqm_b32 exec_lo, exec_lo
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v2
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v1
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s12
|
|
|
|
; GFX10-NEXT: image_sample_b v[0:2], v[3:5], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D tfe d16
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%tex = call {<4 x half>,i32} @llvm.amdgcn.image.sample.b.2d.v4f16i32.f32.f32(i32 15, float %bias, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 1, i32 0)
|
|
|
|
%tex.vec = extractvalue {<4 x half>, i32} %tex, 0
|
|
|
|
%tex.err = extractvalue {<4 x half>, i32} %tex, 1
|
|
|
|
%tex.vecf = bitcast <4 x half> %tex.vec to <2 x float>
|
|
|
|
%tex.vecf.0 = extractelement <2 x float> %tex.vecf, i32 0
|
|
|
|
%tex.vecf.1 = extractelement <2 x float> %tex.vecf, i32 1
|
|
|
|
%r.0 = insertelement <4 x float> undef, float %tex.vecf.0, i32 0
|
|
|
|
%r.1 = insertelement <4 x float> %r.0, float %tex.vecf.1, i32 1
|
|
|
|
%tex.errf = bitcast i32 %tex.err to float
|
|
|
|
%r = insertelement <4 x float> %r.1, float %tex.errf, i32 2
|
|
|
|
ret <4 x float> %r
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare half @llvm.amdgcn.image.sample.2d.f16.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {half,i32} @llvm.amdgcn.image.sample.2d.f16i32.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
2020-07-23 22:59:00 +08:00
|
|
|
declare <3 x half> @llvm.amdgcn.image.sample.2d.v3f16.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare <4 x half> @llvm.amdgcn.image.sample.2d.v4f16.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
|
|
|
declare {<2 x half>,i32} @llvm.amdgcn.image.sample.2d.v2f16i32.f32(i32, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <2 x half> @llvm.amdgcn.image.sample.c.d.1d.v2f16.f32.f32(i32, float, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<2 x half>,i32} @llvm.amdgcn.image.sample.c.d.1d.v2f16i32.f32.f32(i32, float, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
2020-07-23 22:59:00 +08:00
|
|
|
declare <3 x half> @llvm.amdgcn.image.sample.b.2d.v3f16.f32.f32(i32, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
|
|
|
declare {<3 x half>,i32} @llvm.amdgcn.image.sample.b.2d.v3f16i32.f32.f32(i32, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x half> @llvm.amdgcn.image.sample.b.2d.v4f16.f32.f32(i32, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x half>,i32} @llvm.amdgcn.image.sample.b.2d.v4f16i32.f32.f32(i32, float, float, float, <8 x i32>, <4 x i32>, i1, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
|
|
|
|
attributes #0 = { nounwind }
|
|
|
|
attributes #1 = { nounwind readonly }
|
|
|
|
attributes #2 = { nounwind readnone }
|