llvm-project/llvm/test/CodeGen/AMDGPU/token-factor-inline-limit-t...

; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefixes=GCN,GCN-TFILD %s
; RUN: llc -march=amdgcn -mcpu=gfx900 -combiner-tokenfactor-inline-limit=7 -verify-machineinstrs < %s | FileCheck -enable-var-scope -check-prefixes=GCN,GCN-TFIL7 %s


; GCN-LABEL: {{^}}token_factor_inline_limit_test:

; GCN-TFILD: v_mov_b32_e32 [[REG8:v[0-9]+]], 8
; GCN-TFILD: buffer_store_dword [[REG8]], {{.*$}}
; GCN-TFILD: v_mov_b32_e32 [[REG9:v[0-9]+]], 9
; GCN-TFILD: buffer_store_dword [[REG9]], {{.*}} offset:4
; GCN-TFILD: v_mov_b32_e32 [[REG10:v[0-9]+]], 10
; GCN-TFILD: buffer_store_dword [[REG10]], {{.*}} offset:8
; GCN-TFILD: v_mov_b32_e32 [[REG11:v[0-9]+]], 11
; GCN-TFILD: buffer_store_dword [[REG11]], {{.*}} offset:12
; GCN-TFILD: v_mov_b32_e32 [[REG12:v[0-9]+]], 12
; GCN-TFILD: buffer_store_dword [[REG12]], {{.*}} offset:16
; GCN-TFILD: v_mov_b32_e32 [[REG13:v[0-9]+]], 13
; GCN-TFILD: buffer_store_dword [[REG13]], {{.*}} offset:20
; GCN-TFILD: v_mov_b32_e32 [[REG14:v[0-9]+]], 14
; GCN-TFILD: buffer_store_dword [[REG14]], {{.*}} offset:24
; GCN-TFILD: v_mov_b32_e32 [[REG15:v[0-9]+]], 15
; GCN-TFILD: buffer_store_dword [[REG15]], {{.*}} offset:28

; GCN-TFIL7: v_mov_b32_e32 [[REG15:v[0-9]+]], 15
; GCN-TFIL7: buffer_store_dword [[REG15]], {{.*}} offset:28
; GCN-TFIL7: v_mov_b32_e32 [[REG14:v[0-9]+]], 14
; GCN-TFIL7: buffer_store_dword [[REG14]], {{.*}} offset:24
; GCN-TFIL7: v_mov_b32_e32 [[REG13:v[0-9]+]], 13
; GCN-TFIL7: buffer_store_dword [[REG13]], {{.*}} offset:20
; GCN-TFIL7: v_mov_b32_e32 [[REG12:v[0-9]+]], 12
; GCN-TFIL7: buffer_store_dword [[REG12]], {{.*}} offset:16
; GCN-TFIL7: v_mov_b32_e32 [[REG11:v[0-9]+]], 11
; GCN-TFIL7: buffer_store_dword [[REG11]], {{.*}} offset:12
; GCN-TFIL7: v_mov_b32_e32 [[REG10:v[0-9]+]], 10
; GCN-TFIL7: buffer_store_dword [[REG10]], {{.*}} offset:8
; GCN-TFIL7: v_mov_b32_e32 [[REG9:v[0-9]+]], 9
; GCN-TFIL7: buffer_store_dword [[REG9]], {{.*}} offset:4
; GCN-TFIL7: v_mov_b32_e32 [[REG8:v[0-9]+]], 8
; GCN-TFIL7: buffer_store_dword [[REG8]], {{.*$}}

; GCN: v_mov_b32_e32 v31, 7
; GCN: s_getpc
define void @token_factor_inline_limit_test() {
entry:
  call void @external_void_func_8xv5i32(
      <5 x i32><i32 0, i32 0, i32 0, i32 0, i32 0>,
      <5 x i32><i32 1, i32 1, i32 1, i32 1, i32 1>,
      <5 x i32><i32 2, i32 2, i32 2, i32 2, i32 2>,
      <5 x i32><i32 3, i32 3, i32 3, i32 3, i32 3>,
      <5 x i32><i32 4, i32 4, i32 4, i32 4, i32 4>,
      <5 x i32><i32 5, i32 5, i32 5, i32 5, i32 5>,
      <5 x i32><i32 6, i32 7, i32 8, i32 9, i32 10>,
      <5 x i32><i32 11, i32 12, i32 13, i32 14, i32 15>)
  ret void
}

declare hidden void @external_void_func_8xv5i32(<5 x i32>, <5 x i32>, <5 x i32>, <5 x i32>,
                                                <5 x i32>, <5 x i32>, <5 x i32>, <5 x i32>)
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN-TFILD %s`
			`; RUN: llc -march=amdgcn -mcpu=gfx900 -combiner-tokenfactor-inline-limit=7 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GCN-TFIL7 %s`


			`; GCN-LABEL: {{^}}token_factor_inline_limit_test:`

			`; GCN-TFILD: v_mov_b32_e32 [[REG8:v[0-9]+]], 8`
			`; GCN-TFILD: buffer_store_dword [[REG8]], {{.*$}}`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG9:v[0-9]+]], 9`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG9]], {{.*}} offset:4`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG10:v[0-9]+]], 10`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG10]], {{.*}} offset:8`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG11:v[0-9]+]], 11`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG11]], {{.*}} offset:12`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG12:v[0-9]+]], 12`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG12]], {{.*}} offset:16`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG13:v[0-9]+]], 13`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG13]], {{.*}} offset:20`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG14:v[0-9]+]], 14`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG14]], {{.*}} offset:24`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFILD: v_mov_b32_e32 [[REG15:v[0-9]+]], 15`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFILD: buffer_store_dword [[REG15]], {{.*}} offset:28`

			`; GCN-TFIL7: v_mov_b32_e32 [[REG15:v[0-9]+]], 15`
[AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as follows. The existing heuristic roughly summarizes as below: * Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes * If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes * If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes * If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it properly handles both the sub-word loads and the wide loads. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D84354 2020-07-31 00:09:34 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG15]], {{.*}} offset:28`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG14:v[0-9]+]], 14`
[AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as follows. The existing heuristic roughly summarizes as below: * Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes * If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes * If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes * If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it properly handles both the sub-word loads and the wide loads. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D84354 2020-07-31 00:09:34 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG14]], {{.*}} offset:24`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG13:v[0-9]+]], 13`
[AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as follows. The existing heuristic roughly summarizes as below: * Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes * If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes * If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes * If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it properly handles both the sub-word loads and the wide loads. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D84354 2020-07-31 00:09:34 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG13]], {{.*}} offset:20`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG12:v[0-9]+]], 12`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG12]], {{.*}} offset:16`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG11:v[0-9]+]], 11`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG11]], {{.*}} offset:12`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG10:v[0-9]+]], 10`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG10]], {{.*}} offset:8`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG9:v[0-9]+]], 9`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG9]], {{.*}} offset:4`
[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530 2020-06-03 17:01:12 +08:00			`; GCN-TFIL7: v_mov_b32_e32 [[REG8:v[0-9]+]], 8`
DADCombiner: Don't simplify the token factor if the node's number of operands already exceeds TokenFactorInlineLimit Summary: In parallelizeChainedStores, a TokenFactor was created with the size greater than 3000. We found that DAGCombiner::visitTokenFactor will consume a huge amount of time on such nodes. Since the number of operands already exceeds TokenFactorInlineLimit, we propose to give up simplification with the consideration of compile time. Reviewers: @spatel, @arsenm Differential Revision: https://reviews.llvm.org/D84204 2020-07-26 12:20:59 +08:00			`; GCN-TFIL7: buffer_store_dword [[REG8]], {{.*$}}`

			`; GCN: v_mov_b32_e32 v31, 7`
			`; GCN: s_getpc`
			`define void @token_factor_inline_limit_test() {`
			`entry:`
			`call void @external_void_func_8xv5i32(`
			`<5 x i32><i32 0, i32 0, i32 0, i32 0, i32 0>,`
			`<5 x i32><i32 1, i32 1, i32 1, i32 1, i32 1>,`
			`<5 x i32><i32 2, i32 2, i32 2, i32 2, i32 2>,`
			`<5 x i32><i32 3, i32 3, i32 3, i32 3, i32 3>,`
			`<5 x i32><i32 4, i32 4, i32 4, i32 4, i32 4>,`
			`<5 x i32><i32 5, i32 5, i32 5, i32 5, i32 5>,`
			`<5 x i32><i32 6, i32 7, i32 8, i32 9, i32 10>,`
			`<5 x i32><i32 11, i32 12, i32 13, i32 14, i32 15>)`
			`ret void`
			`}`

			`declare hidden void @external_void_func_8xv5i32(<5 x i32>, <5 x i32>, <5 x i32>, <5 x i32>,`
			`<5 x i32>, <5 x i32>, <5 x i32>, <5 x i32>)`