2020-01-29 21:04:56 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s | FileCheck -check-prefixes=VERDE %s
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s | FileCheck -check-prefixes=FIJI %s
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefixes=GFX6789 %s
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=gfx900 -mattr=-enable-prt-strict-null -verify-machineinstrs < %s | FileCheck -check-prefixes=NOPRT %s
|
|
|
|
; RUN: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs -show-mc-encoding < %s | FileCheck -check-prefixes=GFX10 %s
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_1d(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.1d.v4f32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_1d_tfe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_tfe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_tfe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_tfe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_tfe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v0, s[0:7] dmask:0xf unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x1f,0x01,0xf0,0x05,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_lwe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_lwe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf unorm lwe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_lwe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf unorm lwe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_lwe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf unorm lwe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_lwe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v0, s[0:7] dmask:0xf unorm lwe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_lwe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v5, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm lwe ; encoding: [0x00,0x1f,0x02,0xf0,0x05,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>, i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 2, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2d(<8 x i32> inreg %rsrc, i32 %s, i32 %t) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm ; encoding: [0x08,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.2d.v4f32.i32(i32 15, i32 %s, i32 %t, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2d_tfe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2d_tfe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2d_tfe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2d_tfe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2d_tfe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:1], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2d_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm tfe ; encoding: [0x08,0x1f,0x01,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.2d.v4f32i32.i32(i32 15, i32 %s, i32 %t, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_3d(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %r) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_3d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_3d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_3d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_3d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_3d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_3D unorm ; encoding: [0x10,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.3d.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %r, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_3d_tfe_lwe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t, i32 %r) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_3d_tfe_lwe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_3d_tfe_lwe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_3d_tfe_lwe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_3d_tfe_lwe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:2], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_3d_tfe_lwe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v7, v2 ; encoding: [0x02,0x03,0x0e,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_3D unorm tfe lwe ; encoding: [0x10,0x1f,0x03,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.3d.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %r, <8 x i32> %rsrc, i32 3, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_cube(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_cube:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_cube:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_cube:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_cube:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_cube:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_CUBE unorm ; encoding: [0x18,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.cube.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %slice, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_cube_lwe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_cube_lwe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_cube_lwe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_cube_lwe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_cube_lwe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:2], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_cube_lwe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v7, v2 ; encoding: [0x02,0x03,0x0e,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_CUBE unorm lwe ; encoding: [0x18,0x1f,0x02,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.cube.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %slice, <8 x i32> %rsrc, i32 2, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_1darray(<8 x i32> inreg %rsrc, i32 %s, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:1], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D_ARRAY unorm ; encoding: [0x20,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.1darray.v4f32.i32(i32 15, i32 %s, i32 %slice, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_1darray_tfe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1darray_tfe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1darray_tfe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1darray_tfe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1darray_tfe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:1], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1darray_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D_ARRAY unorm tfe ; encoding: [0x20,0x1f,0x01,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1darray.v4f32i32.i32(i32 15, i32 %s, i32 %slice, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2darray(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_ARRAY unorm ; encoding: [0x28,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.2darray.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %slice, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2darray_lwe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2darray_lwe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2darray_lwe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2darray_lwe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2darray_lwe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:2], s[0:7] dmask:0xf unorm lwe da
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2darray_lwe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v7, v2 ; encoding: [0x02,0x03,0x0e,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_ARRAY unorm lwe ; encoding: [0x28,0x1f,0x02,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.2darray.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %slice, <8 x i32> %rsrc, i32 2, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2dmsaa(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %fragid) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2dmsaa:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2dmsaa:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2dmsaa:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2dmsaa:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2dmsaa:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:2], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA unorm ; encoding: [0x30,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.2dmsaa.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2dmsaa_both(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t, i32 %fragid) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2dmsaa_both:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2dmsaa_both:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2dmsaa_both:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2dmsaa_both:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:2], s[0:7] dmask:0xf unorm tfe lwe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2dmsaa_both:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v7, v2 ; encoding: [0x02,0x03,0x0e,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA unorm tfe lwe ; encoding: [0x30,0x1f,0x03,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.2dmsaa.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %fragid, <8 x i32> %rsrc, i32 3, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2darraymsaa(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %slice, i32 %fragid) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2darraymsaa:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2darraymsaa:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2darraymsaa:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2darraymsaa:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2darraymsaa:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v[0:3], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm ; encoding: [0x38,0x1f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.2darraymsaa.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %slice, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_2darraymsaa_tfe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t, i32 %slice, i32 %fragid) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_2darraymsaa_tfe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v8, v3
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:4], v[5:8], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_2darraymsaa_tfe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v8, v3
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:4], v[5:8], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_2darraymsaa_tfe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v8, v3
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:4], v[5:8], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_2darraymsaa_tfe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:4], v[0:3], s[0:7] dmask:0xf unorm tfe da
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_2darraymsaa_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v8, v3 ; encoding: [0x03,0x03,0x10,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v7, v2 ; encoding: [0x02,0x03,0x0e,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:4], v[5:8], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm tfe ; encoding: [0x38,0x1f,0x01,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.2darraymsaa.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %slice, i32 %fragid, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_mip_1d(<8 x i32> inreg %rsrc, i32 %s, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_1d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_1d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_1d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_1d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:3], v[0:1], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_1d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:3], v[0:1], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x04,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.mip.1d.v4f32.i32(i32 15, i32 %s, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_mip_1d_lwe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_1d_lwe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:4], v[5:6], s[0:7] dmask:0xf unorm lwe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_1d_lwe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:4], v[5:6], s[0:7] dmask:0xf unorm lwe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_1d_lwe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:4], v[5:6], s[0:7] dmask:0xf unorm lwe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_1d_lwe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:4], v[0:1], s[0:7] dmask:0xf unorm lwe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_1d_lwe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:4], v[5:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm lwe ; encoding: [0x00,0x1f,0x06,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.mip.1d.v4f32i32.i32(i32 15, i32 %s, i32 %mip, <8 x i32> %rsrc, i32 2, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_mip_2d(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm ; encoding: [0x08,0x1f,0x04,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_mip_2d_tfe(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2d_tfe:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2d_tfe:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v4, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2d_tfe:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v7, v2
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, v1
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:4], v[5:7], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2d_tfe:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, 0
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:4], v[0:2], s[0:7] dmask:0xf unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v6, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[5:6], v4, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2d_tfe:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, v0 ; encoding: [0x00,0x03,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v7, v2 ; encoding: [0x02,0x03,0x0e,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, v1 ; encoding: [0x01,0x03,0x0c,0x7e]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:4], v[5:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm tfe ; encoding: [0x08,0x1f,0x05,0xf0,0x05,0x00,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s8 ; encoding: [0x08,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v6, s9 ; encoding: [0x09,0x02,0x0c,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: global_store_dword v[5:6], v4, off ; encoding: [0x00,0x80,0x70,0xdc,0x05,0x04,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.mip.2d.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_1d_V2_tfe_dmask0(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_V2_tfe_dmask0:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v1
|
|
|
|
; VERDE-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v2
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_V2_tfe_dmask0:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v1
|
|
|
|
; FIJI-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v2
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_V2_tfe_dmask0:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v1
|
|
|
|
; GFX6789-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v2
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_V2_tfe_dmask0:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v1
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_V2_tfe_dmask0:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, 0 ; encoding: [0x80,0x02,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v1 ; encoding: [0x01,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x11,0x01,0xf0,0x00,0x01,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v2 ; encoding: [0x02,0x03,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<2 x float>,i32} @llvm.amdgcn.image.load.1d.v2f32i32.i32(i32 0, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.err = extractvalue {<2 x float>, i32} %v, 1
|
|
|
|
%vv = bitcast i32 %v.err to float
|
|
|
|
ret float %vv
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_1d_V1_tfe_dmask0(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_V1_tfe_dmask0:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v1
|
|
|
|
; VERDE-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v2
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_V1_tfe_dmask0:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v1
|
|
|
|
; FIJI-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v2
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_V1_tfe_dmask0:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v1
|
|
|
|
; GFX6789-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v2
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_V1_tfe_dmask0:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v1
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_V1_tfe_dmask0:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, 0 ; encoding: [0x80,0x02,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v1 ; encoding: [0x01,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[1:2], v0, s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x11,0x01,0xf0,0x00,0x01,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v2 ; encoding: [0x02,0x03,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {float,i32} @llvm.amdgcn.image.load.1d.f32i32.i32(i32 0, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.err = extractvalue {float, i32} %v, 1
|
|
|
|
%vv = bitcast i32 %v.err to float
|
|
|
|
ret float %vv
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_mip_2d_tfe_dmask0(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2d_tfe_dmask0:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; VERDE-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2d_tfe_dmask0:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; FIJI-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2d_tfe_dmask0:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX6789-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2d_tfe_dmask0:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; NOPRT-NEXT: image_load_mip v[2:3], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2d_tfe_dmask0:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, 0 ; encoding: [0x80,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v3 ; encoding: [0x03,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_2D unorm tfe ; encoding: [0x08,0x11,0x05,0xf0,0x00,0x03,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v4 ; encoding: [0x04,0x03,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.mip.2d.v4f32i32.i32(i32 0, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
%vv = bitcast i32 %v.err to float
|
|
|
|
ret float %vv
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_mip_2d_tfe_nouse(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2d_tfe_nouse:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; VERDE-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2d_tfe_nouse:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; FIJI-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2d_tfe_nouse:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX6789-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2d_tfe_nouse:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; NOPRT-NEXT: image_load_mip v[2:3], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2d_tfe_nouse:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, 0 ; encoding: [0x80,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v3 ; encoding: [0x03,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_2D unorm tfe ; encoding: [0x08,0x11,0x05,0xf0,0x00,0x03,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v4 ; encoding: [0x04,0x03,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.mip.2d.v4f32i32.i32(i32 15, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
%vv = bitcast i32 %v.err to float
|
|
|
|
ret float %vv
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_mip_2d_tfe_nouse_V2(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2d_tfe_nouse_V2:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; VERDE-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2d_tfe_nouse_V2:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; FIJI-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2d_tfe_nouse_V2:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX6789-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2d_tfe_nouse_V2:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; NOPRT-NEXT: image_load_mip v[2:3], v[0:2], s[0:7] dmask:0x1 unorm tfe
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2d_tfe_nouse_V2:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, 0 ; encoding: [0x80,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v3 ; encoding: [0x03,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_2D unorm tfe ; encoding: [0x08,0x11,0x05,0xf0,0x00,0x03,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v4 ; encoding: [0x04,0x03,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<2 x float>,i32} @llvm.amdgcn.image.load.mip.2d.v2f32i32.i32(i32 6, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.err = extractvalue {<2 x float>, i32} %v, 1
|
|
|
|
%vv = bitcast i32 %v.err to float
|
|
|
|
ret float %vv
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_mip_2d_tfe_nouse_V1(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2d_tfe_nouse_V1:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; VERDE-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x2 unorm tfe
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2d_tfe_nouse_V1:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; FIJI-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x2 unorm tfe
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2d_tfe_nouse_V1:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v3
|
|
|
|
; GFX6789-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x2 unorm tfe
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v4
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2d_tfe_nouse_V1:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; NOPRT-NEXT: image_load_mip v[2:3], v[0:2], s[0:7] dmask:0x2 unorm tfe
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v3
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2d_tfe_nouse_V1:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, 0 ; encoding: [0x80,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v3 ; encoding: [0x03,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: image_load_mip v[3:4], v[0:2], s[0:7] dmask:0x2 dim:SQ_RSRC_IMG_2D unorm tfe ; encoding: [0x08,0x12,0x05,0xf0,0x00,0x03,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v4 ; encoding: [0x04,0x03,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {float, i32} @llvm.amdgcn.image.load.mip.2d.f32i32.i32(i32 2, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.err = extractvalue {float, i32} %v, 1
|
|
|
|
%vv = bitcast i32 %v.err to float
|
|
|
|
ret float %vv
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_tfe_V4_dmask3(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_tfe_V4_dmask3:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v4, s[0:7] dmask:0x7 unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v3, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_tfe_V4_dmask3:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v4, s[0:7] dmask:0x7 unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v3, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_tfe_V4_dmask3:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v4, s[0:7] dmask:0x7 unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v5, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[4:5], v3, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_tfe_V4_dmask3:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v0, s[0:7] dmask:0x7 unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v5, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[4:5], v3, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_tfe_V4_dmask3:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, v0 ; encoding: [0x00,0x03,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v5, s9 ; encoding: [0x09,0x02,0x0a,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v4, s[0:7] dmask:0x7 dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x17,0x01,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, s8 ; encoding: [0x08,0x02,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: global_store_dword v[4:5], v3, off ; encoding: [0x00,0x80,0x70,0xdc,0x04,0x03,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 7, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_tfe_V4_dmask2(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_tfe_V4_dmask2:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:2], v3, s[0:7] dmask:0x6 unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v2, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_tfe_V4_dmask2:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:2], v3, s[0:7] dmask:0x6 unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v2, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_tfe_V4_dmask2:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:2], v3, s[0:7] dmask:0x6 unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v4, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[3:4], v2, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_tfe_V4_dmask2:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v2, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:2], v0, s[0:7] dmask:0x6 unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v4, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[3:4], v2, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_tfe_V4_dmask2:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, v0 ; encoding: [0x00,0x03,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v4, s9 ; encoding: [0x09,0x02,0x08,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:2], v3, s[0:7] dmask:0x6 dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x16,0x01,0xf0,0x03,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, s8 ; encoding: [0x08,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: global_store_dword v[3:4], v2, off ; encoding: [0x00,0x80,0x70,0xdc,0x03,0x02,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 6, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_tfe_V4_dmask1(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_tfe_V4_dmask1:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v1, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_tfe_V4_dmask1:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v1, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_tfe_V4_dmask1:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[2:3], v1, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_tfe_V4_dmask1:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v2, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[2:3], v1, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_tfe_V4_dmask1:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, s9 ; encoding: [0x09,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x18,0x01,0xf0,0x02,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, s8 ; encoding: [0x08,0x02,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: global_store_dword v[2:3], v1, off ; encoding: [0x00,0x80,0x70,0xdc,0x02,0x01,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 8, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<4 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<4 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <4 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <2 x float> @load_1d_tfe_V2_dmask1(<8 x i32> inreg %rsrc, i32 addrspace(1)* inreg %out, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_tfe_V2_dmask1:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; VERDE-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; VERDE-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; VERDE-NEXT: s_mov_b32 s10, -1
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: buffer_store_dword v1, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0) expcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_tfe_V2_dmask1:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; FIJI-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; FIJI-NEXT: s_mov_b32 s11, 0xf000
|
|
|
|
; FIJI-NEXT: s_mov_b32 s10, -1
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: buffer_store_dword v1, off, s[8:11], 0
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_tfe_V2_dmask1:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, v0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, 0
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v1, v0
|
|
|
|
; GFX6789-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v2, s8
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, s9
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: global_store_dword v[2:3], v1, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_tfe_V2_dmask1:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v1, 0
|
|
|
|
; NOPRT-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x8 unorm tfe
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v2, s8
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, s9
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: global_store_dword v[2:3], v1, off
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_tfe_V2_dmask1:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, v0 ; encoding: [0x00,0x03,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, 0 ; encoding: [0x80,0x02,0x00,0x7e]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v3, s9 ; encoding: [0x09,0x02,0x06,0x7e]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v1, v0 ; encoding: [0x00,0x03,0x02,0x7e]
|
|
|
|
; GFX10-NEXT: image_load v[0:1], v2, s[0:7] dmask:0x8 dim:SQ_RSRC_IMG_1D unorm tfe ; encoding: [0x00,0x18,0x01,0xf0,0x02,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, s8 ; encoding: [0x08,0x02,0x04,0x7e]
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: global_store_dword v[2:3], v1, off ; encoding: [0x00,0x80,0x70,0xdc,0x02,0x01,0x7d,0x00]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0 ; encoding: [0x00,0x00,0xfd,0xbb]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
main_body:
|
|
|
|
%v = call {<2 x float>,i32} @llvm.amdgcn.image.load.1d.v2f32i32.i32(i32 8, i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
|
|
|
|
%v.vec = extractvalue {<2 x float>, i32} %v, 0
|
|
|
|
%v.err = extractvalue {<2 x float>, i32} %v, 1
|
|
|
|
store i32 %v.err, i32 addrspace(1)* %out, align 4
|
|
|
|
ret <2 x float> %v.vec
|
|
|
|
}
|
|
|
|
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
define amdgpu_ps <4 x float> @load_mip_3d(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %r, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_3d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_3d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_3d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_3d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_3d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_3D unorm ; encoding: [0x10,0x1f,0x04,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.mip.3d.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %r, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_mip_cube(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %slice, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_cube:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_cube:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_cube:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_cube:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_cube:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_CUBE unorm ; encoding: [0x18,0x1f,0x04,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.mip.cube.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %slice, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_mip_1darray(<8 x i32> inreg %rsrc, i32 %s, i32 %slice, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_1darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_1darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_1darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_1darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_1darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:3], v[0:2], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D_ARRAY unorm ; encoding: [0x20,0x1f,0x04,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.mip.1darray.v4f32.i32(i32 15, i32 %s, i32 %slice, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_mip_2darray(<8 x i32> inreg %rsrc, i32 %s, i32 %t, i32 %slice, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_mip_2darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_mip_2darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_mip_2darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_mip_2darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_mip_2darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load_mip v[0:3], v[0:3], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_ARRAY unorm ; encoding: [0x28,0x1f,0x04,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.mip.2darray.v4f32.i32(i32 15, i32 %s, i32 %t, i32 %slice, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1d(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_2d(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_2d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_2d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_2d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_2d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_2d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm ; encoding: [0x08,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.2d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_3d(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %r) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_3d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_3d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_3d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_3d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_3d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_3D unorm ; encoding: [0x10,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.3d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %r, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_cube(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_cube:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_cube:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_cube:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_cube:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_cube:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_CUBE unorm ; encoding: [0x18,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.cube.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %slice, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1darray(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:5], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D_ARRAY unorm ; encoding: [0x20,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1darray.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %slice, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_2darray(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %slice) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_2darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_2darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_2darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_2darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_2darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_ARRAY unorm ; encoding: [0x28,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.2darray.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %slice, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_2dmsaa(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %fragid) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_2dmsaa:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_2dmsaa:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_2dmsaa:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_2dmsaa:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_2dmsaa:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA unorm ; encoding: [0x30,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.2dmsaa.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_2darraymsaa(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %slice, i32 %fragid) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_2darraymsaa:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_2darraymsaa:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_2darraymsaa:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_2darraymsaa:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_2darraymsaa:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v[4:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm ; encoding: [0x38,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.2darraymsaa.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %slice, i32 %fragid, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_mip_1d(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_mip_1d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store_mip v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_mip_1d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store_mip v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_mip_1d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store_mip v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_mip_1d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store_mip v[0:3], v[4:5], s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_mip_1d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store_mip v[0:3], v[4:5], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x24,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.mip.1d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_mip_2d(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_mip_2d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_mip_2d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_mip_2d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_mip_2d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_mip_2d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm ; encoding: [0x08,0x1f,0x24,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.mip.2d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_mip_3d(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %r, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_mip_3d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_mip_3d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_mip_3d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_mip_3d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_mip_3d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_3D unorm ; encoding: [0x10,0x1f,0x24,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.mip.3d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %r, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_mip_cube(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %slice, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_mip_cube:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_mip_cube:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_mip_cube:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_mip_cube:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_mip_cube:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_CUBE unorm ; encoding: [0x18,0x1f,0x24,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.mip.cube.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %slice, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_mip_1darray(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %slice, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_mip_1darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_mip_1darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_mip_1darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_mip_1darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_mip_1darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store_mip v[0:3], v[4:6], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D_ARRAY unorm ; encoding: [0x20,0x1f,0x24,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.mip.1darray.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %slice, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_mip_2darray(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s, i32 %t, i32 %slice, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_mip_2darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_mip_2darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_mip_2darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_mip_2darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf unorm da
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_mip_2darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store_mip v[0:3], v[4:7], s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_ARRAY unorm ; encoding: [0x28,0x1f,0x24,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.mip.2darray.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, i32 %t, i32 %slice, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_1d(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_1d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_1d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_1d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_1d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_1d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.1d.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_2d(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_2d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_2d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_2d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_2d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_2d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D unorm ; encoding: [0x08,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.2d.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_3d(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_3d:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_3d:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_3d:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_3d:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_3d:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_3D unorm ; encoding: [0x10,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.3d.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_cube(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_cube:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_cube:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_cube:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_cube:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_cube:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_CUBE unorm ; encoding: [0x18,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.cube.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_1darray(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_1darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_1darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_1darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_1darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_1darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D_ARRAY unorm ; encoding: [0x20,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.1darray.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_2darray(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_2darray:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_2darray:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_2darray:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_2darray:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_2darray:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_ARRAY unorm ; encoding: [0x28,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.2darray.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_2dmsaa(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_2dmsaa:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_2dmsaa:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_2dmsaa:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_2dmsaa:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_2dmsaa:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA unorm ; encoding: [0x30,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.2dmsaa.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @getresinfo_2darraymsaa(<8 x i32> inreg %rsrc, i32 %mip) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_2darraymsaa:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_2darraymsaa:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_2darraymsaa:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_2darraymsaa:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf unorm da
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_2darraymsaa:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_2D_MSAA_ARRAY unorm ; encoding: [0x38,0x1f,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.getresinfo.2darraymsaa.v4f32.i32(i32 15, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @load_1d_V1(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_V1:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v0, v0, s[0:7] dmask:0x8 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_V1:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v0, v0, s[0:7] dmask:0x8 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_V1:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v0, v0, s[0:7] dmask:0x8 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_V1:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v0, v0, s[0:7] dmask:0x8 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_V1:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v0, v0, s[0:7] dmask:0x8 dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x18,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call float @llvm.amdgcn.image.load.1d.f32.i32(i32 8, i32 %s, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret float %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <2 x float> @load_1d_V2(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_V2:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x9 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_V2:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x9 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_V2:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x9 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_V2:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x9 unorm
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_V2:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:1], v0, s[0:7] dmask:0x9 dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x19,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <2 x float> @llvm.amdgcn.image.load.1d.v2f32.i32(i32 9, i32 %s, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <2 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1d_V1(<8 x i32> inreg %rsrc, float %vdata, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1d_V1:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v0, v1, s[0:7] dmask:0x2 unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1d_V1:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v0, v1, s[0:7] dmask:0x2 unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1d_V1:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v0, v1, s[0:7] dmask:0x2 unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1d_V1:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v0, v1, s[0:7] dmask:0x2 unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1d_V1:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v0, v1, s[0:7] dmask:0x2 dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x12,0x20,0xf0,0x01,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.f32.i32(float %vdata, i32 2, i32 %s, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1d_V2(<8 x i32> inreg %rsrc, <2 x float> %vdata, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1d_V2:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:1], v2, s[0:7] dmask:0xc unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1d_V2:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:1], v2, s[0:7] dmask:0xc unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1d_V2:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:1], v2, s[0:7] dmask:0xc unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1d_V2:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:1], v2, s[0:7] dmask:0xc unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1d_V2:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:1], v2, s[0:7] dmask:0xc dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1c,0x20,0xf0,0x02,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v2f32.i32(<2 x float> %vdata, i32 12, i32 %s, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_glc(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_glc:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_glc:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_glc:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_glc:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_glc:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm glc ; encoding: [0x00,0x3f,0x00,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.1d.v4f32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 1)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_slc(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_slc:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_slc:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_slc:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_slc:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_slc:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm slc ; encoding: [0x00,0x1f,0x00,0xf2,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.1d.v4f32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 2)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <4 x float> @load_1d_glc_slc(<8 x i32> inreg %rsrc, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: load_1d_glc_slc:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: load_1d_glc_slc:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: load_1d_glc_slc:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: load_1d_glc_slc:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf unorm glc slc
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: load_1d_glc_slc:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v0, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm glc slc ; encoding: [0x00,0x3f,0x00,0xf2,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
%v = call <4 x float> @llvm.amdgcn.image.load.1d.v4f32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 3)
|
|
|
|
ret <4 x float> %v
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1d_glc(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1d_glc:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1d_glc:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1d_glc:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1d_glc:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1d_glc:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm glc ; encoding: [0x00,0x3f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 1)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1d_slc(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1d_slc:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm slc
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1d_slc:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm slc
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1d_slc:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm slc
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1d_slc:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm slc
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1d_slc:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm slc ; encoding: [0x00,0x1f,0x20,0xf2,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 2)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps void @store_1d_glc_slc(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %s) {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: store_1d_glc_slc:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc slc
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: store_1d_glc_slc:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc slc
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: store_1d_glc_slc:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc slc
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: store_1d_glc_slc:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm glc slc
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: store_1d_glc_slc:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm glc slc ; encoding: [0x00,0x3f,0x20,0xf2,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %vdata, i32 15, i32 %s, <8 x i32> %rsrc, i32 0, i32 3)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2020-09-24 00:01:40 +08:00
|
|
|
define amdgpu_ps <3 x float> @getresinfo_dmask7(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %mip) {
|
|
|
|
; GFX6-LABEL: getresinfo_dmask7:
|
|
|
|
; GFX6: ; %bb.0: ; %main_body
|
|
|
|
; GFX6-NEXT: s_mov_b32 s0, s2
|
|
|
|
; GFX6-NEXT: s_mov_b32 s1, s3
|
|
|
|
; GFX6-NEXT: s_mov_b32 s2, s4
|
|
|
|
; GFX6-NEXT: s_mov_b32 s3, s5
|
|
|
|
; GFX6-NEXT: s_mov_b32 s4, s6
|
|
|
|
; GFX6-NEXT: s_mov_b32 s5, s7
|
|
|
|
; GFX6-NEXT: s_mov_b32 s6, s8
|
|
|
|
; GFX6-NEXT: s_mov_b32 s7, s9
|
|
|
|
; GFX6-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 unorm
|
|
|
|
; GFX6-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX8-LABEL: getresinfo_dmask7:
|
|
|
|
; GFX8: ; %bb.0: ; %main_body
|
|
|
|
; GFX8-NEXT: s_mov_b32 s0, s2
|
|
|
|
; GFX8-NEXT: s_mov_b32 s1, s3
|
|
|
|
; GFX8-NEXT: s_mov_b32 s2, s4
|
|
|
|
; GFX8-NEXT: s_mov_b32 s3, s5
|
|
|
|
; GFX8-NEXT: s_mov_b32 s4, s6
|
|
|
|
; GFX8-NEXT: s_mov_b32 s5, s7
|
|
|
|
; GFX8-NEXT: s_mov_b32 s6, s8
|
|
|
|
; GFX8-NEXT: s_mov_b32 s7, s9
|
|
|
|
; GFX8-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 unorm
|
|
|
|
; GFX8-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX8-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; VERDE-LABEL: getresinfo_dmask7:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 unorm
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_dmask7:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 unorm
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_dmask7:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 unorm
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_dmask7:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 unorm
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_dmask7:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:2], v0, s[0:7] dmask:0x7 dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x17,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
|
|
|
main_body:
|
|
|
|
%r = call <3 x float> @llvm.amdgcn.image.getresinfo.1d.v3f32.i32(i32 7, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <3 x float> %r
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps <2 x float> @getresinfo_dmask3(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %mip) {
|
|
|
|
; GFX6-LABEL: getresinfo_dmask3:
|
|
|
|
; GFX6: ; %bb.0: ; %main_body
|
|
|
|
; GFX6-NEXT: s_mov_b32 s0, s2
|
|
|
|
; GFX6-NEXT: s_mov_b32 s1, s3
|
|
|
|
; GFX6-NEXT: s_mov_b32 s2, s4
|
|
|
|
; GFX6-NEXT: s_mov_b32 s3, s5
|
|
|
|
; GFX6-NEXT: s_mov_b32 s4, s6
|
|
|
|
; GFX6-NEXT: s_mov_b32 s5, s7
|
|
|
|
; GFX6-NEXT: s_mov_b32 s6, s8
|
|
|
|
; GFX6-NEXT: s_mov_b32 s7, s9
|
|
|
|
; GFX6-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 unorm
|
|
|
|
; GFX6-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX8-LABEL: getresinfo_dmask3:
|
|
|
|
; GFX8: ; %bb.0: ; %main_body
|
|
|
|
; GFX8-NEXT: s_mov_b32 s0, s2
|
|
|
|
; GFX8-NEXT: s_mov_b32 s1, s3
|
|
|
|
; GFX8-NEXT: s_mov_b32 s2, s4
|
|
|
|
; GFX8-NEXT: s_mov_b32 s3, s5
|
|
|
|
; GFX8-NEXT: s_mov_b32 s4, s6
|
|
|
|
; GFX8-NEXT: s_mov_b32 s5, s7
|
|
|
|
; GFX8-NEXT: s_mov_b32 s6, s8
|
|
|
|
; GFX8-NEXT: s_mov_b32 s7, s9
|
|
|
|
; GFX8-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 unorm
|
|
|
|
; GFX8-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX8-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; VERDE-LABEL: getresinfo_dmask3:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 unorm
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_dmask3:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 unorm
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_dmask3:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 unorm
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_dmask3:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 unorm
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_dmask3:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v[0:1], v0, s[0:7] dmask:0x3 dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x13,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
|
|
|
main_body:
|
|
|
|
%r = call <2 x float> @llvm.amdgcn.image.getresinfo.1d.v2f32.i32(i32 3, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <2 x float> %r
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @getresinfo_dmask1(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %mip) {
|
|
|
|
; GFX6-LABEL: getresinfo_dmask1:
|
|
|
|
; GFX6: ; %bb.0: ; %main_body
|
|
|
|
; GFX6-NEXT: s_mov_b32 s0, s2
|
|
|
|
; GFX6-NEXT: s_mov_b32 s1, s3
|
|
|
|
; GFX6-NEXT: s_mov_b32 s2, s4
|
|
|
|
; GFX6-NEXT: s_mov_b32 s3, s5
|
|
|
|
; GFX6-NEXT: s_mov_b32 s4, s6
|
|
|
|
; GFX6-NEXT: s_mov_b32 s5, s7
|
|
|
|
; GFX6-NEXT: s_mov_b32 s6, s8
|
|
|
|
; GFX6-NEXT: s_mov_b32 s7, s9
|
|
|
|
; GFX6-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 unorm
|
|
|
|
; GFX6-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX8-LABEL: getresinfo_dmask1:
|
|
|
|
; GFX8: ; %bb.0: ; %main_body
|
|
|
|
; GFX8-NEXT: s_mov_b32 s0, s2
|
|
|
|
; GFX8-NEXT: s_mov_b32 s1, s3
|
|
|
|
; GFX8-NEXT: s_mov_b32 s2, s4
|
|
|
|
; GFX8-NEXT: s_mov_b32 s3, s5
|
|
|
|
; GFX8-NEXT: s_mov_b32 s4, s6
|
|
|
|
; GFX8-NEXT: s_mov_b32 s5, s7
|
|
|
|
; GFX8-NEXT: s_mov_b32 s6, s8
|
|
|
|
; GFX8-NEXT: s_mov_b32 s7, s9
|
|
|
|
; GFX8-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 unorm
|
|
|
|
; GFX8-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX8-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; VERDE-LABEL: getresinfo_dmask1:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 unorm
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_dmask1:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 unorm
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_dmask1:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 unorm
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_dmask1:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 unorm
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_dmask1:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_get_resinfo v0, v0, s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x11,0x38,0xf0,0x00,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
|
|
|
main_body:
|
|
|
|
%r = call float @llvm.amdgcn.image.getresinfo.1d.f32.i32(i32 1, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret float %r
|
|
|
|
}
|
|
|
|
|
AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).
This is in preparation for the eventual removal of the old-style
intrinsics.
Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
because %tmp1 was not used; remove the load, because it doesn't work
(Because of the amdgpu_ps calling convention? In any case, it's
orthogonal to what the test claims to be testing.)
Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf
Reviewers: arsenm, rampitec
Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D48018
llvm-svn: 335229
2018-06-21 21:37:19 +08:00
|
|
|
define amdgpu_ps <4 x float> @getresinfo_dmask0(<8 x i32> inreg %rsrc, <4 x float> %vdata, i32 %mip) #0 {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: getresinfo_dmask0:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: getresinfo_dmask0:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: getresinfo_dmask0:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: getresinfo_dmask0:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: getresinfo_dmask0:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).
This is in preparation for the eventual removal of the old-style
intrinsics.
Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
because %tmp1 was not used; remove the load, because it doesn't work
(Because of the amdgpu_ps calling convention? In any case, it's
orthogonal to what the test claims to be testing.)
Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf
Reviewers: arsenm, rampitec
Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D48018
llvm-svn: 335229
2018-06-21 21:37:19 +08:00
|
|
|
main_body:
|
|
|
|
%r = call <4 x float> @llvm.amdgcn.image.getresinfo.1d.v4f32.i32(i32 0, i32 %mip, <8 x i32> %rsrc, i32 0, i32 0)
|
|
|
|
ret <4 x float> %r
|
|
|
|
}
|
|
|
|
|
|
|
|
;
|
|
|
|
define amdgpu_ps void @image_store_wait(<8 x i32> inreg %arg, <8 x i32> inreg %arg1, <8 x i32> inreg %arg2, <4 x float> %arg3, i32 %arg4) #0 {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: image_store_wait:
|
|
|
|
; VERDE: ; %bb.0: ; %main_body
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_waitcnt expcnt(0)
|
|
|
|
; VERDE-NEXT: image_load v[0:3], v4, s[8:15] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: image_store v[0:3], v4, s[16:23] dmask:0xf unorm
|
|
|
|
; VERDE-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: image_store_wait:
|
|
|
|
; FIJI: ; %bb.0: ; %main_body
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: image_load v[0:3], v4, s[8:15] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: image_store v[0:3], v4, s[16:23] dmask:0xf unorm
|
|
|
|
; FIJI-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: image_store_wait:
|
|
|
|
; GFX6789: ; %bb.0: ; %main_body
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: image_load v[0:3], v4, s[8:15] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: image_store v[0:3], v4, s[16:23] dmask:0xf unorm
|
|
|
|
; GFX6789-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: image_store_wait:
|
|
|
|
; NOPRT: ; %bb.0: ; %main_body
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: image_load v[0:3], v4, s[8:15] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: image_store v[0:3], v4, s[16:23] dmask:0xf unorm
|
|
|
|
; NOPRT-NEXT: s_endpgm
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_store_wait:
|
|
|
|
; GFX10: ; %bb.0: ; %main_body
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v4, s[0:7] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x20,0xf0,0x04,0x00,0x00,0x00]
|
|
|
|
; GFX10-NEXT: image_load v[0:3], v4, s[8:15] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x00,0xf0,0x04,0x00,0x02,0x00]
|
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: image_store v[0:3], v4, s[16:23] dmask:0xf dim:SQ_RSRC_IMG_1D unorm ; encoding: [0x00,0x1f,0x20,0xf0,0x04,0x00,0x04,0x00]
|
|
|
|
; GFX10-NEXT: s_endpgm ; encoding: [0x00,0x00,0x81,0xbf]
|
AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).
This is in preparation for the eventual removal of the old-style
intrinsics.
Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
because %tmp1 was not used; remove the load, because it doesn't work
(Because of the amdgpu_ps calling convention? In any case, it's
orthogonal to what the test claims to be testing.)
Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf
Reviewers: arsenm, rampitec
Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D48018
llvm-svn: 335229
2018-06-21 21:37:19 +08:00
|
|
|
main_body:
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %arg3, i32 15, i32 %arg4, <8 x i32> %arg, i32 0, i32 0)
|
|
|
|
%data = call <4 x float> @llvm.amdgcn.image.load.1d.v4f32.i32(i32 15, i32 %arg4, <8 x i32> %arg1, i32 0, i32 0)
|
|
|
|
call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %data, i32 15, i32 %arg4, <8 x i32> %arg2, i32 0, i32 0)
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define amdgpu_ps float @image_load_mmo(<8 x i32> inreg %rsrc, float addrspace(3)* %lds, <2 x i32> %c) #0 {
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-LABEL: image_load_mmo:
|
|
|
|
; VERDE: ; %bb.0:
|
|
|
|
; VERDE-NEXT: image_load v1, v[1:2], s[0:7] dmask:0x1 unorm
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v3, 0
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: s_mov_b32 m0, -1
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; VERDE-NEXT: ds_write_b32 v0, v3
|
|
|
|
; VERDE-NEXT: v_add_i32_e32 v0, vcc, 16, v0
|
|
|
|
; VERDE-NEXT: ds_write_b32 v0, v3
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; VERDE-NEXT: v_mov_b32_e32 v0, v1
|
2020-09-23 23:16:39 +08:00
|
|
|
; VERDE-NEXT: s_waitcnt lgkmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; VERDE-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; FIJI-LABEL: image_load_mmo:
|
|
|
|
; FIJI: ; %bb.0:
|
|
|
|
; FIJI-NEXT: image_load v1, v[1:2], s[0:7] dmask:0x1 unorm
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; FIJI-NEXT: s_mov_b32 m0, -1
|
|
|
|
; FIJI-NEXT: ds_write2_b32 v0, v3, v3 offset1:4
|
|
|
|
; FIJI-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; FIJI-NEXT: v_mov_b32_e32 v0, v1
|
2020-09-23 23:16:39 +08:00
|
|
|
; FIJI-NEXT: s_waitcnt lgkmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; FIJI-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX6789-LABEL: image_load_mmo:
|
|
|
|
; GFX6789: ; %bb.0:
|
|
|
|
; GFX6789-NEXT: image_load v1, v[1:2], s[0:7] dmask:0x1 unorm
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; GFX6789-NEXT: ds_write2_b32 v0, v3, v3 offset1:4
|
|
|
|
; GFX6789-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; GFX6789-NEXT: v_mov_b32_e32 v0, v1
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX6789-NEXT: s_waitcnt lgkmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX6789-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; NOPRT-LABEL: image_load_mmo:
|
|
|
|
; NOPRT: ; %bb.0:
|
|
|
|
; NOPRT-NEXT: image_load v1, v[1:2], s[0:7] dmask:0x1 unorm
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v3, 0
|
|
|
|
; NOPRT-NEXT: ds_write2_b32 v0, v3, v3 offset1:4
|
|
|
|
; NOPRT-NEXT: s_waitcnt vmcnt(0)
|
|
|
|
; NOPRT-NEXT: v_mov_b32_e32 v0, v1
|
2020-09-23 23:16:39 +08:00
|
|
|
; NOPRT-NEXT: s_waitcnt lgkmcnt(0)
|
2020-01-29 21:04:56 +08:00
|
|
|
; NOPRT-NEXT: ; return to shader part epilog
|
|
|
|
;
|
|
|
|
; GFX10-LABEL: image_load_mmo:
|
|
|
|
; GFX10: ; %bb.0:
|
|
|
|
; GFX10-NEXT: image_load v1, v[1:2], s[0:7] dmask:0x1 dim:SQ_RSRC_IMG_2D unorm ; encoding: [0x08,0x11,0x00,0xf0,0x01,0x01,0x00,0x00]
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v2, 0 ; encoding: [0x80,0x02,0x04,0x7e]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; implicit-def: $vcc_hi
|
[AMDGPU] Remove dubious logic in bidirectional list scheduler
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
2019-10-07 22:33:59 +08:00
|
|
|
; GFX10-NEXT: ds_write2_b32 v0, v2, v2 offset1:4 ; encoding: [0x00,0x04,0x38,0xd8,0x00,0x02,0x02,0x00]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt vmcnt(0) ; encoding: [0x70,0x3f,0x8c,0xbf]
|
|
|
|
; GFX10-NEXT: v_mov_b32_e32 v0, v1 ; encoding: [0x01,0x03,0x00,0x7e]
|
2020-09-23 23:16:39 +08:00
|
|
|
; GFX10-NEXT: s_waitcnt lgkmcnt(0) ; encoding: [0x7f,0xc0,0x8c,0xbf]
|
2020-01-29 21:04:56 +08:00
|
|
|
; GFX10-NEXT: ; return to shader part epilog
|
AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).
This is in preparation for the eventual removal of the old-style
intrinsics.
Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
because %tmp1 was not used; remove the load, because it doesn't work
(Because of the amdgpu_ps calling convention? In any case, it's
orthogonal to what the test claims to be testing.)
Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf
Reviewers: arsenm, rampitec
Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D48018
llvm-svn: 335229
2018-06-21 21:37:19 +08:00
|
|
|
store float 0.000000e+00, float addrspace(3)* %lds
|
|
|
|
%c0 = extractelement <2 x i32> %c, i32 0
|
|
|
|
%c1 = extractelement <2 x i32> %c, i32 1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
%tex = call float @llvm.amdgcn.image.load.2d.f32.i32(i32 1, i32 %c0, i32 %c1, <8 x i32> %rsrc, i32 0, i32 0)
|
AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).
This is in preparation for the eventual removal of the old-style
intrinsics.
Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
because %tmp1 was not used; remove the load, because it doesn't work
(Because of the amdgpu_ps calling convention? In any case, it's
orthogonal to what the test claims to be testing.)
Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf
Reviewers: arsenm, rampitec
Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D48018
llvm-svn: 335229
2018-06-21 21:37:19 +08:00
|
|
|
%tmp2 = getelementptr float, float addrspace(3)* %lds, i32 4
|
|
|
|
store float 0.000000e+00, float addrspace(3)* %tmp2
|
|
|
|
ret float %tex
|
|
|
|
}
|
|
|
|
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.1d.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {float,i32} @llvm.amdgcn.image.load.1d.f32i32.i32(i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare {<2 x float>,i32} @llvm.amdgcn.image.load.1d.v2f32i32.i32(i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.2d.v4f32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.2d.v4f32i32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.3d.v4f32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.3d.v4f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.cube.v4f32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.cube.v4f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.1darray.v4f32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.1darray.v4f32i32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.2darray.v4f32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.2darray.v4f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.2dmsaa.v4f32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.2dmsaa.v4f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.2darraymsaa.v4f32.i32(i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.2darraymsaa.v4f32i32.i32(i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.mip.1d.v4f32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try
TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.
This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.
This change takes roughly 6 parts:
1. Modify the instruction defs in tablegen to add new instruction variants that
can accomodate the extra return values.
2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE
(where the bulk of the work for these instruction types is now done)
3. Extra verification code to catch cases where intrinsics have been used but
insufficient return registers are used.
4. Modification to the adjustWritemask optimisation to account for TFE/LWE being
enabled (requires extra registers to be maintained for error return value).
5. An extra pass to zero initialize the error value return - this is because if
the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.
6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO
for this to re-enable and handle correctly).
There's an additional fix now to avoid a dmask=0
For an image intrinsic with tfe where all result channels except tfe
were unused, I was getting an image instruction with dmask=0 and only a
single vgpr result for tfe. That is incorrect because the hardware
assumes there is at least one vgpr result, plus the one for tfe.
Fixed by forcing dmask to 1, which gives the desired two vgpr result
with tfe in the second one.
The TFE or LWE result is returned from the intrinsics using an aggregate
type. Look in the test code provided to see how this works, but in essence IR
code to invoke the intrinsic looks as follows:
%v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15,
i32 %s, <8 x i32> %rsrc, i32 1, i32 0)
%v.vec = extractvalue {<4 x float>, i32} %v, 0
%v.err = extractvalue {<4 x float>, i32} %v, 1
This re-submit of the change also includes a slight modification in
SIISelLowering.cpp to work-around a compiler bug for the powerpc_le
platform that caused a buildbot failure on a previous submission.
Differential revision: https://reviews.llvm.org/D48826
Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda
Work around for ppcle compiler bug
Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b
llvm-svn: 351054
2019-01-14 19:55:24 +08:00
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.mip.1d.v4f32i32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare {<4 x float>,i32} @llvm.amdgcn.image.load.mip.2d.v4f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare {<2 x float>,i32} @llvm.amdgcn.image.load.mip.2d.v2f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare {float,i32} @llvm.amdgcn.image.load.mip.2d.f32i32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.mip.3d.v4f32.i32(i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.mip.cube.v4f32.i32(i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.mip.1darray.v4f32.i32(i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.load.mip.2darray.v4f32.i32(i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
|
|
|
|
declare void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float>, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.2d.v4f32.i32(<4 x float>, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.3d.v4f32.i32(<4 x float>, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.cube.v4f32.i32(<4 x float>, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.1darray.v4f32.i32(<4 x float>, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.2darray.v4f32.i32(<4 x float>, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.2dmsaa.v4f32.i32(<4 x float>, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.2darraymsaa.v4f32.i32(<4 x float>, i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
|
|
|
|
declare void @llvm.amdgcn.image.store.mip.1d.v4f32.i32(<4 x float>, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.mip.2d.v4f32.i32(<4 x float>, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.mip.3d.v4f32.i32(<4 x float>, i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.mip.cube.v4f32.i32(<4 x float>, i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.mip.1darray.v4f32.i32(<4 x float>, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.mip.2darray.v4f32.i32(<4 x float>, i32, i32, i32, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.1d.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
2020-09-24 00:01:40 +08:00
|
|
|
declare <3 x float> @llvm.amdgcn.image.getresinfo.1d.v3f32.i32(i32 immarg, i32, <8 x i32>, i32 immarg, i32 immarg) #1
|
|
|
|
declare <2 x float> @llvm.amdgcn.image.getresinfo.1d.v2f32.i32(i32 immarg, i32, <8 x i32>, i32 immarg, i32 immarg) #1
|
|
|
|
declare float @llvm.amdgcn.image.getresinfo.1d.f32.i32(i32 immarg, i32, <8 x i32>, i32 immarg, i32 immarg) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.2d.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.3d.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.cube.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.1darray.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.2darray.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.2dmsaa.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
declare <4 x float> @llvm.amdgcn.image.getresinfo.2darraymsaa.v4f32.i32(i32, i32, <8 x i32>, i32, i32) #2
|
|
|
|
|
|
|
|
declare float @llvm.amdgcn.image.load.1d.f32.i32(i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Convert test cases to the dimension-aware intrinsics
Summary:
Also explicitly port over some tests in llvm.amdgcn.image.* that were
missing. Some tests are removed because they no longer apply (i.e.
explicitly testing building an address vector via insertelement).
This is in preparation for the eventual removal of the old-style
intrinsics.
Some additional notes:
- constant-address-space-32bit.ll: change some GCN-NEXT to GCN because
the instruction schedule was subtly altered
- insert_vector_elt.ll: the old test didn't actually test anything,
because %tmp1 was not used; remove the load, because it doesn't work
(Because of the amdgpu_ps calling convention? In any case, it's
orthogonal to what the test claims to be testing.)
Change-Id: Idfa99b6512ad139e755e82b8b89548ab08f0afcf
Reviewers: arsenm, rampitec
Subscribers: MatzeB, qcolombet, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D48018
llvm-svn: 335229
2018-06-21 21:37:19 +08:00
|
|
|
declare float @llvm.amdgcn.image.load.2d.f32.i32(i32, i32, i32, <8 x i32>, i32, i32) #1
|
AMDGPU: Dimension-aware image intrinsics
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
2018-04-04 18:58:54 +08:00
|
|
|
declare <2 x float> @llvm.amdgcn.image.load.1d.v2f32.i32(i32, i32, <8 x i32>, i32, i32) #1
|
|
|
|
declare void @llvm.amdgcn.image.store.1d.f32.i32(float, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
declare void @llvm.amdgcn.image.store.1d.v2f32.i32(<2 x float>, i32, i32, <8 x i32>, i32, i32) #0
|
|
|
|
|
|
|
|
attributes #0 = { nounwind }
|
|
|
|
attributes #1 = { nounwind readonly }
|
|
|
|
attributes #2 = { nounwind readnone }
|