2016-08-26 01:17:46 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+sse2 | FileCheck %s --check-prefix=SSE2
|
2016-08-26 01:17:46 +08:00
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 | FileCheck %s --check-prefix=AVX2
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx512f | FileCheck %s --check-prefix=AVX512F
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx512bw | FileCheck %s --check-prefix=AVX512BW
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
|
|
|
|
define void @avg_v4i8(<4 x i8>* %a, <4 x i8>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i8:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movd %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v4i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX2-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v4i8:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v4i8:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX512BW-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i8>, <4 x i8>* %a
|
|
|
|
%2 = load <4 x i8>, <4 x i8>* %b
|
|
|
|
%3 = zext <4 x i8> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i8> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, %4
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i8>
|
|
|
|
store <4 x i8> %8, <4 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v8i8(<8 x i8>* %a, <8 x i8>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v8i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX2-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v8i8:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v8i8:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i8>, <8 x i8>* %a
|
|
|
|
%2 = load <8 x i8>, <8 x i8>* %b
|
|
|
|
%3 = zext <8 x i8> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i8> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, %4
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i8>
|
|
|
|
store <8 x i8> %8, <8 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v16i8(<16 x i8>* %a, <16 x i8>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v16i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgb (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v16i8:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX2-NEXT: vpavgb (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v16i8:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX512F-NEXT: vpavgb (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v16i8:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX512BW-NEXT: vpavgb (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i8>, <16 x i8>* %a
|
|
|
|
%2 = load <16 x i8>, <16 x i8>* %b
|
|
|
|
%3 = zext <16 x i8> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i8> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, %4
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i8>
|
|
|
|
store <16 x i8> %8, <16 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v32i8(<32 x i8>* %a, <32 x i8>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i8:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm10 = xmm8[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3],xmm8[4],xmm4[4],xmm8[5],xmm4[5],xmm8[6],xmm4[6],xmm8[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm10 = xmm10[0],xmm4[0],xmm10[1],xmm4[1],xmm10[2],xmm4[2],xmm10[3],xmm4[3],xmm10[4],xmm4[4],xmm10[5],xmm4[5],xmm10[6],xmm4[6],xmm10[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm4[4],xmm12[5],xmm4[5],xmm12[6],xmm4[6],xmm12[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm4[0],xmm10[1],xmm4[1],xmm10[2],xmm4[2],xmm10[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm15 = xmm11[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm11 = xmm11[0],xmm4[0],xmm11[1],xmm4[1],xmm11[2],xmm4[2],xmm11[3],xmm4[3],xmm11[4],xmm4[4],xmm11[5],xmm4[5],xmm11[6],xmm4[6],xmm11[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm14
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm4[4],xmm14[5],xmm4[5],xmm14[6],xmm4[6],xmm14[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm4[0],xmm11[1],xmm4[1],xmm11[2],xmm4[2],xmm11[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm15 = xmm15[0],xmm4[0],xmm15[1],xmm4[1],xmm15[2],xmm4[2],xmm15[3],xmm4[3],xmm15[4],xmm4[4],xmm15[5],xmm4[5],xmm15[6],xmm4[6],xmm15[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm4[4],xmm9[5],xmm4[5],xmm9[6],xmm4[6],xmm9[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm4[0],xmm15[1],xmm4[1],xmm15[2],xmm4[2],xmm15[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3],xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3],xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3],xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm4[4],xmm13[5],xmm4[5],xmm13[6],xmm4[6],xmm13[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm14, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm0
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm7 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm13, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v32i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgb (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v32i8:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgb (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i8:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX512BW-NEXT: vpavgb (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i8>, <32 x i8>* %a
|
|
|
|
%2 = load <32 x i8>, <32 x i8>* %b
|
|
|
|
%3 = zext <32 x i8> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i8> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, %4
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i8>
|
|
|
|
store <32 x i8> %8, <32 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v64i8(<64 x i8>* %a, <64 x i8>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v64i8:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: subq $152, %rsp
|
2016-12-01 07:48:26 +08:00
|
|
|
; SSE2-NEXT: .Lcfi0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: .cfi_def_cfa_offset 160
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm5
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm6
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm1[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3],xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, {{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm5[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, (%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm9 = xmm6[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3],xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm1
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm9 = xmm9[0],xmm0[0],xmm9[1],xmm0[1],xmm9[2],xmm0[2],xmm9[3],xmm0[3],xmm9[4],xmm0[4],xmm9[5],xmm0[5],xmm9[6],xmm0[6],xmm9[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, %xmm1
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm9 = xmm9[0],xmm0[0],xmm9[1],xmm0[1],xmm9[2],xmm0[2],xmm9[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm15
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm7 = xmm15[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm15 = xmm15[0],xmm0[0],xmm15[1],xmm0[1],xmm15[2],xmm0[2],xmm15[3],xmm0[3],xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm0[4],xmm10[5],xmm0[5],xmm10[6],xmm0[6],xmm10[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm0[0],xmm15[1],xmm0[1],xmm15[2],xmm0[2],xmm15[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3],xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm14
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm1[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3],xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm11
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm3
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm8 = xmm3[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3],xmm8[4],xmm0[4],xmm8[5],xmm0[5],xmm8[6],xmm0[6],xmm8[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm8
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm9 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm9 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd (%rsp), %xmm11 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm6 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm12 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm13 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm7 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm14 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm15 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: paddd {{[0-9]+}}(%rsp), %xmm10 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm10
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: packuswb %xmm10, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm14, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm13, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm12
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm12, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm11, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm15, (%rax)
|
|
|
|
; SSE2-NEXT: addq $152, %rsp
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v64i8:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm9 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm10 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm11 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm12 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm13 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm14 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm15, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vpaddd %ymm14, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpaddd %ymm13, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm12, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm11, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm10, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm9, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpbroadcastd {{.*}}(%rip), %ymm8
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm0, %ymm9
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm1, %ymm10
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm6, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm7, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm11
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm12
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm6
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm7
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm10, %ymm8
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm9, %ymm3
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128,0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm9 = ymm3[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm9, %xmm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm8, %ymm8
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm8 = ymm8[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm8, %xmm1
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm7, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm6 = ymm6[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm6[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm4, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm5, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm4[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm12, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm11, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm7, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm6, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm5, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpbroadcastd {{.*}}(%rip), %zmm4
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm2, %xmm1
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm3, %xmm2
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v64i8:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqu8 (%rsi), %zmm0
|
|
|
|
; AVX512BW-NEXT: vpavgb (%rdi), %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu8 %zmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <64 x i8>, <64 x i8>* %a
|
|
|
|
%2 = load <64 x i8>, <64 x i8>* %b
|
|
|
|
%3 = zext <64 x i8> %1 to <64 x i32>
|
|
|
|
%4 = zext <64 x i8> %2 to <64 x i32>
|
|
|
|
%5 = add nuw nsw <64 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <64 x i32> %5, %4
|
|
|
|
%7 = lshr <64 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <64 x i32> %7 to <64 x i8>
|
|
|
|
store <64 x i8> %8, <64 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v4i16(<4 x i16>* %a, <4 x i16>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i16:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v4i16:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vpavgw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX2-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v4i16:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vpavgw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v4i16:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i16>, <4 x i16>* %a
|
|
|
|
%2 = load <4 x i16>, <4 x i16>* %b
|
|
|
|
%3 = zext <4 x i16> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i16> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, %4
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i16>
|
|
|
|
store <4 x i16> %8, <4 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v8i16(<8 x i16>* %a, <8 x i16>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i16:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgw (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v8i16:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX2-NEXT: vpavgw (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v8i16:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX512F-NEXT: vpavgw (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v8i16:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX512BW-NEXT: vpavgw (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i16>, <8 x i16>* %a
|
|
|
|
%2 = load <8 x i16>, <8 x i16>* %b
|
|
|
|
%3 = zext <8 x i16> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i16> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, %4
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i16>
|
|
|
|
store <8 x i16> %8, <8 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v16i16(<16 x i16>* %a, <16 x i16>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v16i16:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm5
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: pxor %xmm6, %xmm6
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm6[4],xmm8[5],xmm6[5],xmm8[6],xmm6[6],xmm8[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm6[0],xmm4[1],xmm6[1],xmm4[2],xmm6[2],xmm4[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1],xmm5[2],xmm6[2],xmm5[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm6[4],xmm3[5],xmm6[5],xmm3[6],xmm6[6],xmm3[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm6[4],xmm2[5],xmm6[5],xmm2[6],xmm6[6],xmm2[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm6[0],xmm1[1],xmm6[1],xmm1[2],xmm6[2],xmm1[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: packssdw %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v16i16:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgw (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v16i16:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgw (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v16i16:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX512BW-NEXT: vpavgw (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i16>, <16 x i16>* %a
|
|
|
|
%2 = load <16 x i16>, <16 x i16>* %b
|
|
|
|
%3 = zext <16 x i16> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i16> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, %4
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i16>
|
|
|
|
store <16 x i16> %8, <16 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v32i16(<32 x i16>* %a, <32 x i16>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i16:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm10
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm9
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm14
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm3
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm9 = xmm9[0],xmm0[0],xmm9[1],xmm0[1],xmm9[2],xmm0[2],xmm9[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm15
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm14, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm14 = xmm14[0],xmm0[0],xmm14[1],xmm0[1],xmm14[2],xmm0[2],xmm14[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm13, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm14
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm7 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm7
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm14
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm14
|
|
|
|
; SSE2-NEXT: packssdw %xmm7, %xmm14
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm6
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm6, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm5
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm4
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: packssdw %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm14, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v32i16:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm5 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm6 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm7, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm6, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm5, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpbroadcastd {{.*}}(%rip), %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128,0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm3, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v32i16:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpbroadcastd {{.*}}(%rip), %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm1, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i16:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqu16 (%rsi), %zmm0
|
|
|
|
; AVX512BW-NEXT: vpavgw (%rdi), %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu16 %zmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i16>, <32 x i16>* %a
|
|
|
|
%2 = load <32 x i16>, <32 x i16>* %b
|
|
|
|
%3 = zext <32 x i16> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i16> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, %4
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i16>
|
|
|
|
store <32 x i16> %8, <32 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v4i8_2(<4 x i8>* %a, <4 x i8>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movd %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v4i8_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v4i8_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v4i8_2:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX512BW-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i8>, <4 x i8>* %a
|
|
|
|
%2 = load <4 x i8>, <4 x i8>* %b
|
|
|
|
%3 = zext <4 x i8> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i8> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i8>
|
|
|
|
store <4 x i8> %8, <4 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v8i8_2(<8 x i8>* %a, <8 x i8>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v8i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v8i8_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v8i8_2:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i8>, <8 x i8>* %a
|
|
|
|
%2 = load <8 x i8>, <8 x i8>* %b
|
|
|
|
%3 = zext <8 x i8> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i8> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i8>
|
|
|
|
store <8 x i8> %8, <8 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v16i8_2(<16 x i8>* %a, <16 x i8>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v16i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgb (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v16i8_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX2-NEXT: vpavgb (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v16i8_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512F-NEXT: vpavgb (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v16i8_2:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512BW-NEXT: vpavgb (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i8>, <16 x i8>* %a
|
|
|
|
%2 = load <16 x i8>, <16 x i8>* %b
|
|
|
|
%3 = zext <16 x i8> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i8> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i8>
|
|
|
|
store <16 x i8> %8, <16 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v32i8_2(<32 x i8>* %a, <32 x i8>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i8_2:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm10 = xmm8[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3],xmm8[4],xmm4[4],xmm8[5],xmm4[5],xmm8[6],xmm4[6],xmm8[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm10 = xmm10[0],xmm4[0],xmm10[1],xmm4[1],xmm10[2],xmm4[2],xmm10[3],xmm4[3],xmm10[4],xmm4[4],xmm10[5],xmm4[5],xmm10[6],xmm4[6],xmm10[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm4[4],xmm12[5],xmm4[5],xmm12[6],xmm4[6],xmm12[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm4[0],xmm10[1],xmm4[1],xmm10[2],xmm4[2],xmm10[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm15 = xmm11[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm11 = xmm11[0],xmm4[0],xmm11[1],xmm4[1],xmm11[2],xmm4[2],xmm11[3],xmm4[3],xmm11[4],xmm4[4],xmm11[5],xmm4[5],xmm11[6],xmm4[6],xmm11[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm14
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm4[4],xmm14[5],xmm4[5],xmm14[6],xmm4[6],xmm14[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm4[0],xmm11[1],xmm4[1],xmm11[2],xmm4[2],xmm11[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm15 = xmm15[0],xmm4[0],xmm15[1],xmm4[1],xmm15[2],xmm4[2],xmm15[3],xmm4[3],xmm15[4],xmm4[4],xmm15[5],xmm4[5],xmm15[6],xmm4[6],xmm15[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm4[4],xmm9[5],xmm4[5],xmm9[6],xmm4[6],xmm9[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm4[0],xmm15[1],xmm4[1],xmm15[2],xmm4[2],xmm15[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3],xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3],xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3],xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm4[4],xmm13[5],xmm4[5],xmm13[6],xmm4[6],xmm13[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm14, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm0
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm7 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm13, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v32i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgb (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v32i8_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgb (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i8_2:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512BW-NEXT: vpavgb (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i8>, <32 x i8>* %a
|
|
|
|
%2 = load <32 x i8>, <32 x i8>* %b
|
|
|
|
%3 = zext <32 x i8> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i8> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i8>
|
|
|
|
store <32 x i8> %8, <32 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v64i8_2(<64 x i8>* %a, <64 x i8>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v64i8_2:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm15
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm13
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm3
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm7 = xmm15[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm15 = xmm15[0],xmm0[0],xmm15[1],xmm0[1],xmm15[2],xmm0[2],xmm15[3],xmm0[3],xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm0[4],xmm8[5],xmm0[5],xmm8[6],xmm0[6],xmm8[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm0[0],xmm15[1],xmm0[1],xmm15[2],xmm0[2],xmm15[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3],xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm14
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm13[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm13 = xmm13[0],xmm0[0],xmm13[1],xmm0[1],xmm13[2],xmm0[2],xmm13[3],xmm0[3],xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm13, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm0[4],xmm9[5],xmm0[5],xmm9[6],xmm0[6],xmm9[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm13 = xmm13[0],xmm0[0],xmm13[1],xmm0[1],xmm13[2],xmm0[2],xmm13[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3],xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm5 = xmm2[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm11
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm0[4],xmm10[5],xmm0[5],xmm10[6],xmm0[6],xmm10[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm3[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: paddd %xmm3, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm10
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm11
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm12
|
|
|
|
; SSE2-NEXT: paddd %xmm13, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm9
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm14, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm15
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm8
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm14, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: packuswb %xmm9, %xmm13
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm12
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm12, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm13
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm11, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm10
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm10, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm13, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm15, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v64i8_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm7, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vpaddd %ymm6, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpaddd %ymm5, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm3, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm2, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm0, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpbroadcastd {{.*}}(%rip), %ymm8
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm0, %ymm9
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm1, %ymm10
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm6, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm7, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm11
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm12
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm6
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm7
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm10, %ymm8
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm9, %ymm3
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128,0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm9 = ymm3[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm9, %xmm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm8, %ymm8
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm8 = ymm8[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm8, %xmm1
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm7, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm6 = ymm6[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm6[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm4, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm5, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm4[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm12, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm11, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm3, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm0, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpbroadcastd {{.*}}(%rip), %zmm4
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm2, %xmm1
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm3, %xmm2
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v64i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX512BW: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu8 (%rsi), %zmm0
|
|
|
|
; AVX512BW-NEXT: vpavgb %zmm0, %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu8 %zmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <64 x i8>, <64 x i8>* %a
|
|
|
|
%2 = load <64 x i8>, <64 x i8>* %b
|
|
|
|
%3 = zext <64 x i8> %1 to <64 x i32>
|
|
|
|
%4 = zext <64 x i8> %2 to <64 x i32>
|
|
|
|
%5 = add nuw nsw <64 x i32> %4, %4
|
|
|
|
%6 = add nuw nsw <64 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <64 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <64 x i32> %7 to <64 x i8>
|
|
|
|
store <64 x i8> %8, <64 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
define void @avg_v4i16_2(<4 x i16>* %a, <4 x i16>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v4i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vpavgw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v4i16_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vpavgw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v4i16_2:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i16>, <4 x i16>* %a
|
|
|
|
%2 = load <4 x i16>, <4 x i16>* %b
|
|
|
|
%3 = zext <4 x i16> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i16> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i16>
|
|
|
|
store <4 x i16> %8, <4 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v8i16_2(<8 x i16>* %a, <8 x i16>* %b) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgw (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v8i16_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX2-NEXT: vpavgw (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v8i16_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512F-NEXT: vpavgw (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v8i16_2:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512BW-NEXT: vpavgw (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i16>, <8 x i16>* %a
|
|
|
|
%2 = load <8 x i16>, <8 x i16>* %b
|
|
|
|
%3 = zext <8 x i16> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i16> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i16>
|
|
|
|
store <8 x i16> %8, <8 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v16i16_2(<16 x i16>* %a, <16 x i16>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v16i16_2:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm5
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: pxor %xmm6, %xmm6
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm6[4],xmm8[5],xmm6[5],xmm8[6],xmm6[6],xmm8[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm6[0],xmm4[1],xmm6[1],xmm4[2],xmm6[2],xmm4[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1],xmm5[2],xmm6[2],xmm5[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm6[4],xmm3[5],xmm6[5],xmm3[6],xmm6[6],xmm3[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm6[4],xmm2[5],xmm6[5],xmm2[6],xmm6[6],xmm2[7],xmm6[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm6[0],xmm1[1],xmm6[1],xmm1[2],xmm6[2],xmm1[3],xmm6[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: packssdw %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v16i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgw (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v16i16_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgw (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v16i16_2:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512BW-NEXT: vpavgw (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i16>, <16 x i16>* %a
|
|
|
|
%2 = load <16 x i16>, <16 x i16>* %b
|
|
|
|
%3 = zext <16 x i16> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i16> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i16>
|
|
|
|
store <16 x i16> %8, <16 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v32i16_2(<32 x i16>* %a, <32 x i16>* %b) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i16_2:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm10
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm9
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm14
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm3
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm9 = xmm9[0],xmm0[0],xmm9[1],xmm0[1],xmm9[2],xmm0[2],xmm9[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm15
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm14, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm14 = xmm14[0],xmm0[0],xmm14[1],xmm0[1],xmm14[2],xmm0[2],xmm14[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm13, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm14
|
|
|
|
; SSE2-NEXT: paddd -{{[0-9]+}}(%rsp), %xmm7 # 16-byte Folded Reload
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [1,1,1,1]
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm7
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm14
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm14
|
|
|
|
; SSE2-NEXT: packssdw %xmm7, %xmm14
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm6
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm6, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm5
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm4
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: packssdw %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm14, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v32i16_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm5 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm6 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm7, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm6, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm5, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpbroadcastd {{.*}}(%rip), %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128,0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm3, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v32i16_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpbroadcastd {{.*}}(%rip), %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm1, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX512BW: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu16 (%rdi), %zmm0
|
|
|
|
; AVX512BW-NEXT: vpavgw (%rsi), %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu16 %zmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i16>, <32 x i16>* %a
|
|
|
|
%2 = load <32 x i16>, <32 x i16>* %b
|
|
|
|
%3 = zext <32 x i16> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i16> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i16>
|
|
|
|
store <32 x i16> %8, <32 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v4i8_const(<4 x i8>* %a) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: pavgb {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movd %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v4i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v4i8_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v4i8_const:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i8>, <4 x i8>* %a
|
|
|
|
%2 = zext <4 x i8> %1 to <4 x i32>
|
|
|
|
%3 = add nuw nsw <4 x i32> %2, <i32 1, i32 2, i32 3, i32 4>
|
|
|
|
%4 = lshr <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <4 x i32> %4 to <4 x i8>
|
|
|
|
store <4 x i8> %5, <4 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v8i8_const(<8 x i8>* %a) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgb {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movq %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v8i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v8i8_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v8i8_const:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i8>, <8 x i8>* %a
|
|
|
|
%2 = zext <8 x i8> %1 to <8 x i32>
|
|
|
|
%3 = add nuw nsw <8 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <8 x i32> %4 to <8 x i8>
|
|
|
|
store <8 x i8> %5, <8 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v16i8_const(<16 x i8>* %a) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v16i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgb {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v16i8_const:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX2-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v16i8_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512F-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v16i8_const:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512BW-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i8>, <16 x i8>* %a
|
|
|
|
%2 = zext <16 x i8> %1 to <16 x i32>
|
|
|
|
%3 = add nuw nsw <16 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <16 x i32> %4 to <16 x i8>
|
|
|
|
store <16 x i8> %5, <16 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v32i8_const(<32 x i8>* %a) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i8_const:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm2
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm2[2,3,0,1]
|
|
|
|
; SSE2-NEXT: pxor %xmm1, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm1[0],xmm8[1],xmm1[1],xmm8[2],xmm1[2],xmm8[3],xmm1[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm3
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm6 = xmm4[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm1[0],xmm6[1],xmm1[1],xmm6[2],xmm1[2],xmm6[3],xmm1[3],xmm6[4],xmm1[4],xmm6[5],xmm1[5],xmm6[6],xmm1[6],xmm6[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm7
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm1[0],xmm7[1],xmm1[1],xmm7[2],xmm1[2],xmm7[3],xmm1[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm1[4],xmm6[5],xmm1[5],xmm6[6],xmm1[6],xmm6[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm1[0],xmm4[1],xmm1[1],xmm4[2],xmm1[2],xmm4[3],xmm1[3],xmm4[4],xmm1[4],xmm4[5],xmm1[5],xmm4[6],xmm1[6],xmm4[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm1[4],xmm4[5],xmm1[5],xmm4[6],xmm1[6],xmm4[7],xmm1[7]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm9 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm1 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm1 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm4
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm0
|
|
|
|
; SSE2-NEXT: pand %xmm1, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm5, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v32i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgb {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v32i8_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgb {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i8_const:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512BW-NEXT: vpavgb {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i8>, <32 x i8>* %a
|
|
|
|
%2 = zext <32 x i8> %1 to <32 x i32>
|
|
|
|
%3 = add nuw nsw <32 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
2015-12-01 05:46:08 +08:00
|
|
|
%4 = lshr <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%5 = trunc <32 x i32> %4 to <32 x i8>
|
|
|
|
store <32 x i8> %5, <32 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v64i8_const(<64 x i8>* %a) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v64i8_const:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm7
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm14
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm11[2,3,0,1]
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm9
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3],xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm12
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm12 = xmm12[0],xmm0[0],xmm12[1],xmm0[1],xmm12[2],xmm0[2],xmm12[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm13 = xmm14[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm13 = xmm13[0],xmm0[0],xmm13[1],xmm0[1],xmm13[2],xmm0[2],xmm13[3],xmm0[3],xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm13, %xmm10
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm14 = xmm14[0],xmm0[0],xmm14[1],xmm0[1],xmm14[2],xmm0[2],xmm14[3],xmm0[3],xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm14, %xmm15
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm0[0],xmm15[1],xmm0[1],xmm15[2],xmm0[2],xmm15[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm1[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm8
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm7[2,3,0,1]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3],xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3],xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm9 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm1, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: packuswb %xmm14, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm10
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: packuswb %xmm13, %xmm10
|
|
|
|
; SSE2-NEXT: packuswb %xmm10, %xmm15
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm12
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: packuswb %xmm11, %xmm12
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm1, %xmm12
|
|
|
|
; SSE2-NEXT: movdqu %xmm12, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm15, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm8, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v64i8_const:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm8 = [1,2,3,4,5,6,7,8]
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm8
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm9
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm7, %ymm3
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128,0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm7 = ymm3[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm7, %xmm7
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm6 = ymm6[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm6 = xmm6[0],xmm7[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm5 = ymm5[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm5, %xmm5
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm5[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm6, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm9, %ymm5
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm5 = ymm5[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm5, %xmm5
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm8, %ymm6
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm6 = ymm6[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm5 = xmm6[0],xmm5[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm4, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vmovdqa32 {{.*#+}} zmm4 = [1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8]
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm3, %xmm3
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm2, %xmm2
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm2, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v64i8_const:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqu8 (%rdi), %zmm0
|
|
|
|
; AVX512BW-NEXT: vpavgb {{.*}}(%rip), %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu8 %zmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <64 x i8>, <64 x i8>* %a
|
|
|
|
%2 = zext <64 x i8> %1 to <64 x i32>
|
|
|
|
%3 = add nuw nsw <64 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
2015-12-01 05:46:08 +08:00
|
|
|
%4 = lshr <64 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%5 = trunc <64 x i32> %4 to <64 x i8>
|
|
|
|
store <64 x i8> %5, <64 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v4i16_const(<4 x i16>* %a) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgw {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movq %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v4i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX2-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v4i16_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX512F-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v4i16_const:
|
|
|
|
; AVX512BW: # BB#0:
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512BW-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i16>, <4 x i16>* %a
|
|
|
|
%2 = zext <4 x i16> %1 to <4 x i32>
|
|
|
|
%3 = add nuw nsw <4 x i32> %2, <i32 1, i32 2, i32 3, i32 4>
|
|
|
|
%4 = lshr <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <4 x i32> %4 to <4 x i16>
|
|
|
|
store <4 x i16> %5, <4 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v8i16_const(<8 x i16>* %a) {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgw {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v8i16_const:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX2-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v8i16_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512F-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v8i16_const:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX512BW-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i16>, <8 x i16>* %a
|
|
|
|
%2 = zext <8 x i16> %1 to <8 x i32>
|
|
|
|
%3 = add nuw nsw <8 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <8 x i32> %4 to <8 x i16>
|
|
|
|
store <8 x i16> %5, <8 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v16i16_const(<16 x i16>* %a) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v16i16_const:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm3
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm5 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v16i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgw {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-LABEL: avg_v16i16_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgw {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v16i16_const:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512BW-NEXT: vpavgw {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i16>, <16 x i16>* %a
|
|
|
|
%2 = zext <16 x i16> %1 to <16 x i32>
|
|
|
|
%3 = add nuw nsw <16 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <16 x i32> %4 to <16 x i16>
|
|
|
|
store <16 x i16> %5, <16 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @avg_v32i16_const(<32 x i16>* %a) {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i16_const:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm7
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm6
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pxor %xmm8, %xmm8
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm8[0],xmm1[1],xmm8[1],xmm1[2],xmm8[2],xmm1[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm8[4],xmm0[5],xmm8[5],xmm0[6],xmm8[6],xmm0[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm8[0],xmm2[1],xmm8[1],xmm2[2],xmm8[2],xmm2[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm8[4],xmm4[5],xmm8[5],xmm4[6],xmm8[6],xmm4[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm8[0],xmm3[1],xmm8[1],xmm3[2],xmm8[2],xmm3[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm8[4],xmm6[5],xmm8[5],xmm6[6],xmm8[6],xmm6[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm8[0],xmm5[1],xmm8[1],xmm5[2],xmm8[2],xmm5[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm8 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm7
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm9 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm7
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm5
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm5
|
|
|
|
; SSE2-NEXT: packssdw %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm6
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: packssdw %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm4
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm5, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v32i16_const:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [1,2,3,4,5,6,7,8]
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128,0,1,4,5,8,9,12,13,128,128,128,128,128,128,128,128]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm3 = ymm3[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm2, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v32i16_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vmovdqa32 {{.*#+}} zmm2 = [1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8]
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm1, (%rax)
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX512BW: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu16 (%rdi), %zmm0
|
|
|
|
; AVX512BW-NEXT: vpavgw {{.*}}(%rip), %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: vmovdqu16 %zmm0, (%rax)
|
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i16>, <32 x i16>* %a
|
|
|
|
%2 = zext <32 x i16> %1 to <32 x i32>
|
|
|
|
%3 = add nuw nsw <32 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
2015-12-01 05:46:08 +08:00
|
|
|
%4 = lshr <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%5 = trunc <32 x i32> %4 to <32 x i16>
|
|
|
|
store <32 x i16> %5, <32 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|