2016-08-26 01:17:46 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
2017-06-23 22:16:50 +08:00
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefix=SSE2
|
2017-06-23 22:38:00 +08:00
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefix=AVX --check-prefix=AVX1
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefix=AVX --check-prefix=AVX2
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f | FileCheck %s --check-prefix=AVX --check-prefix=AVX512 --check-prefix=AVX512F
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw | FileCheck %s --check-prefix=AVX --check-prefix=AVX512 --check-prefix=AVX512BW
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v4i8(<4 x i8>* %a, <4 x i8>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i8:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movd %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v4i8:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; AVX-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i8>, <4 x i8>* %a
|
|
|
|
%2 = load <4 x i8>, <4 x i8>* %b
|
|
|
|
%3 = zext <4 x i8> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i8> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, %4
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i8>
|
|
|
|
store <4 x i8> %8, <4 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v8i8(<8 x i8>* %a, <8 x i8>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v8i8:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX-NEXT: vpavgb %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i8>, <8 x i8>* %a
|
|
|
|
%2 = load <8 x i8>, <8 x i8>* %b
|
|
|
|
%3 = zext <8 x i8> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i8> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, %4
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i8>
|
|
|
|
store <8 x i8> %8, <8 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v16i8(<16 x i8>* %a, <16 x i8>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v16i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgb (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v16i8:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX-NEXT: vpavgb (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i8>, <16 x i8>* %a
|
|
|
|
%2 = load <16 x i8>, <16 x i8>* %b
|
|
|
|
%3 = zext <16 x i8> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i8> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, %4
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i8>
|
|
|
|
store <16 x i8> %8, <16 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v32i8(<32 x i8>* %a, <32 x i8>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i8:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm3
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm5 = xmm5[8],xmm4[8],xmm5[9],xmm4[9],xmm5[10],xmm4[10],xmm5[11],xmm4[11],xmm5[12],xmm4[12],xmm5[13],xmm4[13],xmm5[14],xmm4[14],xmm5[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm4[4],xmm12[5],xmm4[5],xmm12[6],xmm4[6],xmm12[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm7 = xmm7[8],xmm4[8],xmm7[9],xmm4[9],xmm7[10],xmm4[10],xmm7[11],xmm4[11],xmm7[12],xmm4[12],xmm7[13],xmm4[13],xmm7[14],xmm4[14],xmm7[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm11
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm4[4],xmm11[5],xmm4[5],xmm11[6],xmm4[6],xmm11[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm4[0],xmm7[1],xmm4[1],xmm7[2],xmm4[2],xmm7[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3],xmm8[4],xmm4[4],xmm8[5],xmm4[5],xmm8[6],xmm4[6],xmm8[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm4[4],xmm10[5],xmm4[5],xmm10[6],xmm4[6],xmm10[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm4[8],xmm2[9],xmm4[9],xmm2[10],xmm4[10],xmm2[11],xmm4[11],xmm2[12],xmm4[12],xmm2[13],xmm4[13],xmm2[14],xmm4[14],xmm2[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm4[4],xmm9[5],xmm4[5],xmm9[6],xmm4[6],xmm9[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm9
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3],xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm3 = xmm3[8],xmm4[8],xmm3[9],xmm4[9],xmm3[10],xmm4[10],xmm3[11],xmm4[11],xmm3[12],xmm4[12],xmm3[13],xmm4[13],xmm3[14],xmm4[14],xmm3[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm6
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3],xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm7
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm1
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm9
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
2017-05-19 02:50:05 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm9, %xmm2
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm0
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v32i8:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm0, %xmm9
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm8, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm9, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm7, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm9
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm2, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm4, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm5, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm6, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm9, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v32i8:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgb (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX512-LABEL: avg_v32i8:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX512-NEXT: vpavgb (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512-NEXT: vzeroupper
|
|
|
|
; AVX512-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i8>, <32 x i8>* %a
|
|
|
|
%2 = load <32 x i8>, <32 x i8>* %b
|
|
|
|
%3 = zext <32 x i8> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i8> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, %4
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i8>
|
|
|
|
store <32 x i8> %8, <32 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v64i8(<64 x i8>* %a, <64 x i8>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v64i8:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm6
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm5
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm13
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm11
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm4
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm4 = xmm4[8],xmm0[8],xmm4[9],xmm0[9],xmm4[10],xmm0[10],xmm4[11],xmm0[11],xmm4[12],xmm0[12],xmm4[13],xmm0[13],xmm4[14],xmm0[14],xmm4[15],xmm0[15]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3],xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm15
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm15 = xmm15[8],xmm0[8],xmm15[9],xmm0[9],xmm15[10],xmm0[10],xmm15[11],xmm0[11],xmm15[12],xmm0[12],xmm15[13],xmm0[13],xmm15[14],xmm0[14],xmm15[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm14
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm15 = xmm15[0],xmm0[0],xmm15[1],xmm0[1],xmm15[2],xmm0[2],xmm15[3],xmm0[3]
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm0[4],xmm8[5],xmm0[5],xmm8[6],xmm0[6],xmm8[7],xmm0[7]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm10 = xmm10[8],xmm0[8],xmm10[9],xmm0[9],xmm10[10],xmm0[10],xmm10[11],xmm0[11],xmm10[12],xmm0[12],xmm10[13],xmm0[13],xmm10[14],xmm0[14],xmm10[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm3
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm7 = xmm7[8],xmm0[8],xmm7[9],xmm0[9],xmm7[10],xmm0[10],xmm7[11],xmm0[11],xmm7[12],xmm0[12],xmm7[13],xmm0[13],xmm7[14],xmm0[14],xmm7[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm10
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3],xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm5
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa %xmm13, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm4 = xmm4[8],xmm0[8],xmm4[9],xmm0[9],xmm4[10],xmm0[10],xmm4[11],xmm0[11],xmm4[12],xmm0[12],xmm4[13],xmm0[13],xmm4[14],xmm0[14],xmm4[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm14, %xmm12
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3]
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm4
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm13 = xmm13[0],xmm0[0],xmm13[1],xmm0[1],xmm13[2],xmm0[2],xmm13[3],xmm0[3],xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm13, %xmm15
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm15
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm13 = xmm13[0],xmm0[0],xmm13[1],xmm0[1],xmm13[2],xmm0[2],xmm13[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm13
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm6 = xmm6[8],xmm0[8],xmm6[9],xmm0[9],xmm6[10],xmm0[10],xmm6[11],xmm0[11],xmm6[12],xmm0[12],xmm6[13],xmm0[13],xmm6[14],xmm0[14],xmm6[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm0[4],xmm9[5],xmm0[5],xmm9[6],xmm0[6],xmm9[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm9
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm6
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3],xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm14
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm14
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm0[8],xmm2[9],xmm0[9],xmm2[10],xmm0[10],xmm2[11],xmm0[11],xmm2[12],xmm0[12],xmm2[13],xmm0[13],xmm2[14],xmm0[14],xmm2[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm11
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm7
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm3 = xmm3[8],xmm0[8],xmm3[9],xmm0[9],xmm3[10],xmm0[10],xmm3[11],xmm0[11],xmm3[12],xmm0[12],xmm3[13],xmm0[13],xmm3[14],xmm0[14],xmm3[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm0[4],xmm8[5],xmm0[5],xmm8[6],xmm0[6],xmm8[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm8
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3],xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm5
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm7
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm7
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm10
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: packuswb %xmm1, %xmm10
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm1 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm10, %xmm2
|
2017-10-04 00:59:13 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm12
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm12
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: packuswb %xmm12, %xmm4
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm15
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: packuswb %xmm15, %xmm13
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm13
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm9, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm11
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm14
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: packuswb %xmm14, %xmm11
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm11
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm7
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm7
|
|
|
|
; SSE2-NEXT: movdqu %xmm7, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm11, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm13, (%rax)
|
2017-10-04 00:59:13 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v64i8:
|
|
|
|
; AVX1: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: subq $24, %rsp
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm9 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm14 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, (%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm0, %xmm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm3, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm5, %xmm13
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm6, %xmm12
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm15, %xmm11
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm8, %xmm10
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm1, %xmm9, %xmm8
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm14, %xmm9
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd -{{[0-9]+}}(%rsp), %xmm3, %xmm4 # 16-byte Folded Reload
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd (%rsp), %xmm7, %xmm7 # 16-byte Folded Reload
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd -{{[0-9]+}}(%rsp), %xmm5, %xmm3 # 16-byte Folded Reload
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd -{{[0-9]+}}(%rsp), %xmm5, %xmm2 # 16-byte Folded Reload
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm6, %xmm5, %xmm1
|
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm14
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm6, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm6, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm15
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm15, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm13, %xmm13
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm12, %xmm12
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm11, %xmm11
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm10, %xmm10
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm8, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm9, %xmm9
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm2, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm1
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm14, %xmm14
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm5 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm14, %xmm14
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm14, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm6
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm6, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm13, %xmm2
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm6 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm6, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm11, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm12, %xmm7
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm6, %xmm2
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm10, %xmm6
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm6, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm9, %xmm6
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm6, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm4, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm4, %xmm3, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm4, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm1, (%rax)
|
|
|
|
; AVX1-NEXT: addq $24, %rsp
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v64i8:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vpcmpeqd %ymm8, %ymm8, %ymm8
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm0, %ymm9
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm1, %ymm10
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm6, %ymm1
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm7, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm11
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm12
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm5, %ymm5
|
2017-06-23 22:16:50 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm4, %ymm4
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm6
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm7
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm10, %ymm8
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm9, %ymm3
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm9 = ymm3[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm9, %xmm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm8, %ymm8
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm8 = ymm8[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm8, %xmm1
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm7, %ymm1
|
2017-06-23 22:16:50 +08:00
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm6 = ymm6[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm6[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm4, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm5, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm4[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm12, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm11, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm0, %zmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm1, %zmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm2, %zmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm3, %zmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpternlogd $255, %zmm4, %zmm4, %zmm4
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm3, %zmm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm2, %xmm1
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm3, %xmm2
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512F-NEXT: vzeroupper
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v64i8:
|
|
|
|
; AVX512BW: # BB#0:
|
2017-08-01 01:35:44 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqa64 (%rsi), %zmm0
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb (%rdi), %zmm0, %zmm0
|
2017-08-01 23:31:24 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu32 %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512BW-NEXT: vzeroupper
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <64 x i8>, <64 x i8>* %a
|
|
|
|
%2 = load <64 x i8>, <64 x i8>* %b
|
|
|
|
%3 = zext <64 x i8> %1 to <64 x i32>
|
|
|
|
%4 = zext <64 x i8> %2 to <64 x i32>
|
|
|
|
%5 = add nuw nsw <64 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <64 x i32> %5, %4
|
|
|
|
%7 = lshr <64 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <64 x i32> %7 to <64 x i8>
|
|
|
|
store <64 x i8> %8, <64 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v4i16(<4 x i16>* %a, <4 x i16>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i16:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v4i16:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX-NEXT: vpavgw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i16>, <4 x i16>* %a
|
|
|
|
%2 = load <4 x i16>, <4 x i16>* %b
|
|
|
|
%3 = zext <4 x i16> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i16> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, %4
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i16>
|
|
|
|
store <4 x i16> %8, <4 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v8i16(<8 x i16>* %a, <8 x i16>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i16:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgw (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v8i16:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovdqa (%rsi), %xmm0
|
|
|
|
; AVX-NEXT: vpavgw (%rdi), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i16>, <8 x i16>* %a
|
|
|
|
%2 = load <8 x i16>, <8 x i16>* %b
|
|
|
|
%3 = zext <8 x i16> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i16> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, %4
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i16>
|
|
|
|
store <8 x i16> %8, <8 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v16i16(<16 x i16>* %a, <16 x i16>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v16i16:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm2
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm4
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pxor %xmm5, %xmm5
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm5[4],xmm6[5],xmm5[5],xmm6[6],xmm5[6],xmm6[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm5[4],xmm7[5],xmm5[5],xmm7[6],xmm5[6],xmm7[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm5[4],xmm2[5],xmm5[5],xmm2[6],xmm5[6],xmm2[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pcmpeqd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: packssdw %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v16i16:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm1, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm4, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpxor %xmm4, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm4[1],xmm0[2],xmm4[3],xmm0[4],xmm4[5],xmm0[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm4[1],xmm1[2],xmm4[3],xmm1[4],xmm4[5],xmm1[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm2[0],xmm4[1],xmm2[2],xmm4[3],xmm2[4],xmm4[5],xmm2[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm3[0],xmm4[1],xmm3[2],xmm4[3],xmm3[4],xmm4[5],xmm3[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v16i16:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgw (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX512-LABEL: avg_v16i16:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vmovdqa (%rsi), %ymm0
|
|
|
|
; AVX512-NEXT: vpavgw (%rdi), %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512-NEXT: vzeroupper
|
|
|
|
; AVX512-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i16>, <16 x i16>* %a
|
|
|
|
%2 = load <16 x i16>, <16 x i16>* %b
|
|
|
|
%3 = zext <16 x i16> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i16> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, %4
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i16>
|
|
|
|
store <16 x i16> %8, <16 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v32i16(<32 x i16>* %a, <32 x i16>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i16:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm10
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm3
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
2017-05-19 02:50:05 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm13
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm9, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm7
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm9 = xmm9[0],xmm0[0],xmm9[1],xmm0[1],xmm9[2],xmm0[2],xmm9[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm6
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm5
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm2
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm13, %xmm4
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm7
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pslld $16, %xmm9
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm9
|
|
|
|
; SSE2-NEXT: packssdw %xmm7, %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pslld $16, %xmm6
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm6, %xmm1
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm5
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm4
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: packssdw %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm9, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v32i16:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm5 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm6 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm8 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm0, %xmm9
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm4
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm8, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm9, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm7, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm9
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm7
|
|
|
|
; AVX1-NEXT: vpxor %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm7 = xmm7[0],xmm0[1],xmm7[2],xmm0[3],xmm7[4],xmm0[5],xmm7[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm0[1],xmm1[2],xmm0[3],xmm1[4],xmm0[5],xmm1[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm2[0],xmm0[1],xmm2[2],xmm0[3],xmm2[4],xmm0[5],xmm2[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm3[0],xmm0[1],xmm3[2],xmm0[3],xmm3[4],xmm0[5],xmm3[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm2, %xmm3, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm4[0],xmm0[1],xmm4[2],xmm0[3],xmm4[4],xmm0[5],xmm4[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm5[0],xmm0[1],xmm5[2],xmm0[3],xmm5[4],xmm0[5],xmm5[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm6[0],xmm0[1],xmm6[2],xmm0[3],xmm6[4],xmm0[5],xmm6[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm9[0],xmm0[1],xmm9[2],xmm0[3],xmm9[4],xmm0[5],xmm9[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm1, (%rax)
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v32i16:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm1, %ymm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm2, %ymm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm3, %ymm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpcmpeqd %ymm4, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm3, %ymm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
2017-02-06 02:33:14 +08:00
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm3, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v32i16:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm1, %zmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpternlogd $255, %zmm2, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm2, %zmm1, %zmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm1, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512F-NEXT: vzeroupper
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i16:
|
|
|
|
; AVX512BW: # BB#0:
|
2017-08-01 01:35:44 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqa64 (%rsi), %zmm0
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgw (%rdi), %zmm0, %zmm0
|
2017-08-01 23:31:24 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu32 %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512BW-NEXT: vzeroupper
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i16>, <32 x i16>* %a
|
|
|
|
%2 = load <32 x i16>, <32 x i16>* %b
|
|
|
|
%3 = zext <32 x i16> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i16> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, %4
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i16>
|
|
|
|
store <32 x i16> %8, <32 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v4i8_2(<4 x i8>* %a, <4 x i8>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movd %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v4i8_2:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
|
|
|
|
; AVX-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i8>, <4 x i8>* %a
|
|
|
|
%2 = load <4 x i8>, <4 x i8>* %b
|
|
|
|
%3 = zext <4 x i8> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i8> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i8>
|
|
|
|
store <4 x i8> %8, <4 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v8i8_2(<8 x i8>* %a, <8 x i8>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgb %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v8i8_2:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i8>, <8 x i8>* %a
|
|
|
|
%2 = load <8 x i8>, <8 x i8>* %b
|
|
|
|
%3 = zext <8 x i8> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i8> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i8>
|
|
|
|
store <8 x i8> %8, <8 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v16i8_2(<16 x i8>* %a, <16 x i8>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v16i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgb (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v16i8_2:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX-NEXT: vpavgb (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i8>, <16 x i8>* %a
|
|
|
|
%2 = load <16 x i8>, <16 x i8>* %b
|
|
|
|
%3 = zext <16 x i8> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i8> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i8>
|
|
|
|
store <16 x i8> %8, <16 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v32i8_2(<32 x i8>* %a, <32 x i8>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i8_2:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm3
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm5 = xmm5[8],xmm4[8],xmm5[9],xmm4[9],xmm5[10],xmm4[10],xmm5[11],xmm4[11],xmm5[12],xmm4[12],xmm5[13],xmm4[13],xmm5[14],xmm4[14],xmm5[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3],xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm4[4],xmm12[5],xmm4[5],xmm12[6],xmm4[6],xmm12[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm7 = xmm7[8],xmm4[8],xmm7[9],xmm4[9],xmm7[10],xmm4[10],xmm7[11],xmm4[11],xmm7[12],xmm4[12],xmm7[13],xmm4[13],xmm7[14],xmm4[14],xmm7[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm11
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm4[4],xmm11[5],xmm4[5],xmm11[6],xmm4[6],xmm11[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm4[0],xmm7[1],xmm4[1],xmm7[2],xmm4[2],xmm7[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3],xmm8[4],xmm4[4],xmm8[5],xmm4[5],xmm8[6],xmm4[6],xmm8[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm4[4],xmm10[5],xmm4[5],xmm10[6],xmm4[6],xmm10[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm4[0],xmm8[1],xmm4[1],xmm8[2],xmm4[2],xmm8[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm4[8],xmm2[9],xmm4[9],xmm2[10],xmm4[10],xmm2[11],xmm4[11],xmm2[12],xmm4[12],xmm2[13],xmm4[13],xmm2[14],xmm4[14],xmm2[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm4[4],xmm9[5],xmm4[5],xmm9[6],xmm4[6],xmm9[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm9
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3],xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm3 = xmm3[8],xmm4[8],xmm3[9],xmm4[9],xmm3[10],xmm4[10],xmm3[11],xmm4[11],xmm3[12],xmm4[12],xmm3[13],xmm4[13],xmm3[14],xmm4[14],xmm3[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm6
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3],xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm7
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm1
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm9
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm1
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm9, %xmm2
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm0
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: pand %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v32i8_2:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm0, %xmm9
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm8, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm9, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm7, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm9
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm2, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm4, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm5, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm6, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm0, %xmm9, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v32i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgb (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX512-LABEL: avg_v32i8_2:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512-NEXT: vpavgb (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512-NEXT: vzeroupper
|
|
|
|
; AVX512-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i8>, <32 x i8>* %a
|
|
|
|
%2 = load <32 x i8>, <32 x i8>* %b
|
|
|
|
%3 = zext <32 x i8> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i8> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i8>
|
|
|
|
store <32 x i8> %8, <32 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v64i8_2(<64 x i8>* %a, <64 x i8>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v64i8_2:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm14
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm12
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm14, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm7 = xmm7[8],xmm0[8],xmm7[9],xmm0[9],xmm7[10],xmm0[10],xmm7[11],xmm0[11],xmm7[12],xmm0[12],xmm7[13],xmm0[13],xmm7[14],xmm0[14],xmm7[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm15
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm14 = xmm14[0],xmm0[0],xmm14[1],xmm0[1],xmm14[2],xmm0[2],xmm14[3],xmm0[3],xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm14, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm0[4],xmm8[5],xmm0[5],xmm8[6],xmm0[6],xmm8[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm14 = xmm14[0],xmm0[0],xmm14[1],xmm0[1],xmm14[2],xmm0[2],xmm14[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm12, %xmm6
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm6 = xmm6[8],xmm0[8],xmm6[9],xmm0[9],xmm6[10],xmm0[10],xmm6[11],xmm0[11],xmm6[12],xmm0[12],xmm6[13],xmm0[13],xmm6[14],xmm0[14],xmm6[15],xmm0[15]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm0[0],xmm6[1],xmm0[1],xmm6[2],xmm0[2],xmm6[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm12 = xmm12[0],xmm0[0],xmm12[1],xmm0[1],xmm12[2],xmm0[2],xmm12[3],xmm0[3],xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm12, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm9 = xmm9[4],xmm0[4],xmm9[5],xmm0[5],xmm9[6],xmm0[6],xmm9[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm12 = xmm12[0],xmm0[0],xmm12[1],xmm0[1],xmm12[2],xmm0[2],xmm12[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm5 = xmm5[8],xmm0[8],xmm5[9],xmm0[9],xmm5[10],xmm0[10],xmm5[11],xmm0[11],xmm5[12],xmm0[12],xmm5[13],xmm0[13],xmm5[14],xmm0[14],xmm5[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm11
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm0[0],xmm5[1],xmm0[1],xmm5[2],xmm0[2],xmm5[3],xmm0[3]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm10 = xmm10[4],xmm0[4],xmm10[5],xmm0[5],xmm10[6],xmm0[6],xmm10[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm4
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm4 = xmm4[8],xmm0[8],xmm4[9],xmm0[9],xmm4[10],xmm0[10],xmm4[11],xmm0[11],xmm4[12],xmm0[12],xmm4[13],xmm0[13],xmm4[14],xmm0[14],xmm4[15],xmm0[15]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm3
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm1, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm3, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd %xmm3, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm10
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm11
|
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm12
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm9
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm13, %xmm13
|
|
|
|
; SSE2-NEXT: paddd %xmm14, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm8
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm15, %xmm15
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm1
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm15
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm15
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm7
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm15, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm14
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm14
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm13
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: packuswb %xmm13, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm12
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: packuswb %xmm9, %xmm12
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm12
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: packuswb %xmm11, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm10
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm10, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm5 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm4
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm1
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm12, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm14, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v64i8_2:
|
|
|
|
; AVX1: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm9 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm10 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm11 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm12 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm13 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm14 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm15 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpaddd %xmm6, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm6, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpaddd %xmm5, %xmm5, %xmm6
|
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm4, %xmm5
|
|
|
|
; AVX1-NEXT: vpaddd %xmm3, %xmm3, %xmm4
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm2, %xmm3
|
|
|
|
; AVX1-NEXT: vpaddd %xmm1, %xmm1, %xmm2
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm0, %xmm1
|
|
|
|
; AVX1-NEXT: vpaddd %xmm15, %xmm15, %xmm15
|
|
|
|
; AVX1-NEXT: vpaddd %xmm14, %xmm14, %xmm14
|
|
|
|
; AVX1-NEXT: vpaddd %xmm13, %xmm13, %xmm13
|
|
|
|
; AVX1-NEXT: vpaddd %xmm12, %xmm12, %xmm12
|
|
|
|
; AVX1-NEXT: vpaddd %xmm11, %xmm11, %xmm11
|
|
|
|
; AVX1-NEXT: vpaddd %xmm10, %xmm10, %xmm10
|
|
|
|
; AVX1-NEXT: vpaddd %xmm9, %xmm9, %xmm9
|
|
|
|
; AVX1-NEXT: vpaddd %xmm8, %xmm8, %xmm8
|
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm8, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm9, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm10, %xmm10
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm11, %xmm9
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm12, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm7, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm13, %xmm11
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm14, %xmm13
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm15, %xmm12
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm2, %xmm15
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm4, %xmm14
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm5
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm3, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm6
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm8
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm7 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm8, %xmm8
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm8, %xmm6, %xmm8
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm9, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm10, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm4, %xmm6, %xmm4
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm8, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm11, %xmm6
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm3, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm6, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm12, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm13, %xmm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm0, %xmm6, %xmm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm0, %ymm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm15, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm3, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm3, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm14, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm3, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm2, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm2
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm7, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
|
|
|
|
; AVX1-NEXT: vmovups %ymm1, (%rax)
|
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v64i8_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm7, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vpaddd %ymm6, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpaddd %ymm5, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpaddd %ymm3, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm2, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm0, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpcmpeqd %ymm8, %ymm8, %ymm8
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm0, %ymm9
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm1, %ymm10
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm6, %ymm1
|
|
|
|
; AVX2-NEXT: vpsubd %ymm8, %ymm7, %ymm0
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm11
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm12
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm6
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm7
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm10, %ymm8
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm9, %ymm3
|
2017-02-06 02:33:14 +08:00
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm9 = ymm3[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm9, %xmm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm8, %ymm8
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm8 = ymm8[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm8, %xmm1
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm7, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm6 = ymm6[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm6[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm4, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm5, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm4[0],xmm1[0]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm12, %ymm4
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm4 = ymm4[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %ymm2, %ymm11, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm4[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm3, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm0, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpternlogd $255, %zmm4, %zmm4, %zmm4
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm4, %zmm3, %zmm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm2, %xmm1
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm3, %xmm2
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512F-NEXT: vzeroupper
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v64i8_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX512BW: # BB#0:
|
2017-08-01 01:35:44 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqa64 (%rsi), %zmm0
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb %zmm0, %zmm0, %zmm0
|
2017-08-01 23:31:24 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu32 %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512BW-NEXT: vzeroupper
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <64 x i8>, <64 x i8>* %a
|
|
|
|
%2 = load <64 x i8>, <64 x i8>* %b
|
|
|
|
%3 = zext <64 x i8> %1 to <64 x i32>
|
|
|
|
%4 = zext <64 x i8> %2 to <64 x i32>
|
|
|
|
%5 = add nuw nsw <64 x i32> %4, %4
|
|
|
|
%6 = add nuw nsw <64 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <64 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <64 x i32> %7 to <64 x i8>
|
|
|
|
store <64 x i8> %8, <64 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v4i16_2(<4 x i16>* %a, <4 x i16>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movq %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v4i16_2:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
|
|
|
|
; AVX-NEXT: vpavgw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i16>, <4 x i16>* %a
|
|
|
|
%2 = load <4 x i16>, <4 x i16>* %b
|
|
|
|
%3 = zext <4 x i16> %1 to <4 x i32>
|
|
|
|
%4 = zext <4 x i16> %2 to <4 x i32>
|
|
|
|
%5 = add nuw nsw <4 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <4 x i32> %5, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <4 x i32> %6, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <4 x i32> %7 to <4 x i16>
|
|
|
|
store <4 x i16> %8, <4 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v8i16_2(<8 x i16>* %a, <8 x i16>* %b) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgw (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v8i16_2:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX-NEXT: vpavgw (%rsi), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i16>, <8 x i16>* %a
|
|
|
|
%2 = load <8 x i16>, <8 x i16>* %b
|
|
|
|
%3 = zext <8 x i16> %1 to <8 x i32>
|
|
|
|
%4 = zext <8 x i16> %2 to <8 x i32>
|
|
|
|
%5 = add nuw nsw <8 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <8 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <8 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <8 x i32> %7 to <8 x i16>
|
|
|
|
store <8 x i16> %8, <8 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v16i16_2(<16 x i16>* %a, <16 x i16>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v16i16_2:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm2
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm4
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pxor %xmm5, %xmm5
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm5[4],xmm6[5],xmm5[5],xmm6[6],xmm5[6],xmm6[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm5[4],xmm7[5],xmm5[5],xmm7[6],xmm5[6],xmm7[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm2, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm5[4],xmm2[5],xmm5[5],xmm2[6],xmm5[6],xmm2[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pcmpeqd %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm4, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: packssdw %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v16i16_2:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm1, %xmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm4, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm4, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpxor %xmm4, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm4[1],xmm0[2],xmm4[3],xmm0[4],xmm4[5],xmm0[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm4[1],xmm1[2],xmm4[3],xmm1[4],xmm4[5],xmm1[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm2[0],xmm4[1],xmm2[2],xmm4[3],xmm2[4],xmm4[5],xmm2[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm3[0],xmm4[1],xmm3[2],xmm4[3],xmm3[4],xmm4[5],xmm3[6],xmm4[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v16i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgw (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX512-LABEL: avg_v16i16_2:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512-NEXT: vpavgw (%rsi), %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512-NEXT: vzeroupper
|
|
|
|
; AVX512-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i16>, <16 x i16>* %a
|
|
|
|
%2 = load <16 x i16>, <16 x i16>* %b
|
|
|
|
%3 = zext <16 x i16> %1 to <16 x i32>
|
|
|
|
%4 = zext <16 x i16> %2 to <16 x i32>
|
|
|
|
%5 = add nuw nsw <16 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <16 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <16 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <16 x i32> %7 to <16 x i16>
|
|
|
|
store <16 x i16> %8, <16 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v32i16_2(<32 x i16>* %a, <32 x i16>* %b) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i16_2:
|
|
|
|
; SSE2: # BB#0:
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm10
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm8
|
|
|
|
; SSE2-NEXT: movdqa (%rsi), %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa 16(%rsi), %xmm1
|
|
|
|
; SSE2-NEXT: movdqa 32(%rsi), %xmm2
|
|
|
|
; SSE2-NEXT: movdqa 48(%rsi), %xmm3
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
2017-05-19 02:50:05 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm11 = xmm11[0],xmm0[0],xmm11[1],xmm0[1],xmm11[2],xmm0[2],xmm11[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm10, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm12 = xmm12[4],xmm0[4],xmm12[5],xmm0[5],xmm12[6],xmm0[6],xmm12[7],xmm0[7]
|
Add LiveRangeShrink pass to shrink live range within BB.
Summary: LiveRangeShrink pass moves instruction right after the definition with the same BB if the instruction and its operands all have more than one use. This pass is inexpensive and guarantees optimal live-range within BB.
Reviewers: davidxl, wmi, hfinkel, MatzeB, andreadb
Reviewed By: MatzeB, andreadb
Subscribers: hiraditya, jyknight, sanjoy, skatkov, gberry, jholewinski, qcolombet, javed.absar, krytarowski, atrick, spatel, RKSimon, andreadb, MatzeB, mehdi_amini, mgorny, efriedma, davide, dberlin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32563
llvm-svn: 304371
2017-06-01 07:25:25 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm8, %xmm13
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm13 = xmm13[4],xmm0[4],xmm13[5],xmm0[5],xmm13[6],xmm0[6],xmm13[7],xmm0[7]
|
2017-06-23 22:16:50 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm9, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: paddd %xmm6, %xmm7
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm9 = xmm9[0],xmm0[0],xmm9[1],xmm0[1],xmm9[2],xmm0[2],xmm9[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm6
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm11, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm12, %xmm5
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm10, %xmm2
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm13, %xmm4
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3]
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm3
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm0, %xmm0
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm4
|
|
|
|
; SSE2-NEXT: psubd %xmm0, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm7
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: pslld $16, %xmm9
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm9
|
|
|
|
; SSE2-NEXT: packssdw %xmm7, %xmm9
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pslld $16, %xmm6
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm6, %xmm1
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm5
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm4
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: packssdw %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm9, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v32i16_2:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm5 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm6 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm8 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm0, %xmm9
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm4
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm8, %xmm7
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm9, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsubd %xmm0, %xmm7, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm9
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm7
|
|
|
|
; AVX1-NEXT: vpxor %xmm0, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm7 = xmm7[0],xmm0[1],xmm7[2],xmm0[3],xmm7[4],xmm0[5],xmm7[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm0[1],xmm1[2],xmm0[3],xmm1[4],xmm0[5],xmm1[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm7, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm2[0],xmm0[1],xmm2[2],xmm0[3],xmm2[4],xmm0[5],xmm2[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm3[0],xmm0[1],xmm3[2],xmm0[3],xmm3[4],xmm0[5],xmm3[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm2, %xmm3, %xmm2
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm2 = xmm4[0],xmm0[1],xmm4[2],xmm0[3],xmm4[4],xmm0[5],xmm4[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm5[0],xmm0[1],xmm5[2],xmm0[3],xmm5[4],xmm0[5],xmm5[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm6[0],xmm0[1],xmm6[2],xmm0[3],xmm6[4],xmm0[5],xmm6[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm9[0],xmm0[1],xmm9[2],xmm0[3],xmm9[4],xmm0[5],xmm9[6],xmm0[7]
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm1, (%rax)
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v32i16_2:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm1, %ymm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm2, %ymm2
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm3, %ymm3
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX2-NEXT: vpcmpeqd %ymm4, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsubd %ymm4, %ymm3, %ymm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
2017-02-06 02:33:14 +08:00
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm2, %ymm1
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vpshufb %ymm4, %ymm3, %ymm2
|
|
|
|
; AVX2-NEXT: vpermq {{.*#+}} ymm2 = ymm2[0,2,2,3]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v32i16_2:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm1, %zmm1
|
[x86] transform vector inc/dec to use -1 constant (PR33483)
Convert vector increment or decrement to sub/add with an all-ones constant:
add X, <1, 1...> --> sub X, <-1, -1...>
sub X, <1, 1...> --> add X, <-1, -1...>
The all-ones vector constant can be materialized using a pcmpeq instruction that is
commonly recognized as an idiom (has no register dependency), so that's better than
loading a splat 1 constant.
AVX512 uses 'vpternlogd' for 512-bit vectors because there is apparently no better
way to produce 512 one-bits.
The general advantages of this lowering are:
1. pcmpeq has lower latency than a memop on every uarch I looked at in Agner's tables,
so in theory, this could be better for perf, but...
2. That seems unlikely to affect any OOO implementation, and I can't measure any real
perf difference from this transform on Haswell or Jaguar, but...
3. It doesn't look like it from the diffs, but this is an overall size win because we
eliminate 16 - 64 constant bytes in the case of a vector load. If we're broadcasting
a scalar load (which might itself be a bug), then we're replacing a scalar constant
load + broadcast with a single cheap op, so that should always be smaller/better too.
4. This makes the DAG/isel output more consistent - we use pcmpeq already for padd x, -1
and psub x, -1, so we should use that form for +1 too because we can. If there's some
reason to favor a constant load on some CPU, let's make the reverse transform for all
of these cases (either here in the DAG or in a later machine pass).
This should fix:
https://bugs.llvm.org/show_bug.cgi?id=33483
Differential Revision: https://reviews.llvm.org/D34336
llvm-svn: 306289
2017-06-26 22:19:26 +08:00
|
|
|
; AVX512F-NEXT: vpternlogd $255, %zmm2, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsubd %zmm2, %zmm1, %zmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm0, (%rax)
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm1, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512F-NEXT: vzeroupper
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i16_2:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX512BW: # BB#0:
|
2017-08-01 01:35:44 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqa64 (%rdi), %zmm0
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgw (%rsi), %zmm0, %zmm0
|
2017-08-01 23:31:24 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu32 %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512BW-NEXT: vzeroupper
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i16>, <32 x i16>* %a
|
|
|
|
%2 = load <32 x i16>, <32 x i16>* %b
|
|
|
|
%3 = zext <32 x i16> %1 to <32 x i32>
|
|
|
|
%4 = zext <32 x i16> %2 to <32 x i32>
|
|
|
|
%5 = add nuw nsw <32 x i32> %3, %4
|
|
|
|
%6 = add nuw nsw <32 x i32> %5, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%7 = lshr <32 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%8 = trunc <32 x i32> %7 to <32 x i16>
|
|
|
|
store <32 x i16> %8, <32 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v4i8_const(<4 x i8>* %a) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; SSE2-NEXT: pavgb {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movd %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v4i8_const:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
|
|
|
|
; AVX-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovd %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i8>, <4 x i8>* %a
|
|
|
|
%2 = zext <4 x i8> %1 to <4 x i32>
|
|
|
|
%3 = add nuw nsw <4 x i32> %2, <i32 1, i32 2, i32 3, i32 4>
|
|
|
|
%4 = lshr <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <4 x i32> %4 to <4 x i8>
|
|
|
|
store <4 x i8> %5, <4 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v8i8_const(<8 x i8>* %a) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgb {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movq %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v8i8_const:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i8>, <8 x i8>* %a
|
|
|
|
%2 = zext <8 x i8> %1 to <8 x i32>
|
|
|
|
%3 = add nuw nsw <8 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <8 x i32> %4 to <8 x i8>
|
|
|
|
store <8 x i8> %5, <8 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v16i8_const(<16 x i8>* %a) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v16i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgb {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v16i8_const:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX-NEXT: vpavgb {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i8>, <16 x i8>* %a
|
|
|
|
%2 = zext <16 x i8> %1 to <16 x i32>
|
|
|
|
%3 = add nuw nsw <16 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <16 x i32> %4 to <16 x i8>
|
|
|
|
store <16 x i8> %5, <16 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v32i8_const(<32 x i8>* %a) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i8_const:
|
|
|
|
; SSE2: # BB#0:
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm3
|
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3],xmm1[4],xmm4[4],xmm1[5],xmm4[5],xmm1[6],xmm4[6],xmm1[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm3 = xmm3[8],xmm4[8],xmm3[9],xmm4[9],xmm3[10],xmm4[10],xmm3[11],xmm4[11],xmm3[12],xmm4[12],xmm3[13],xmm4[13],xmm3[14],xmm4[14],xmm3[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm4[4],xmm6[5],xmm4[5],xmm6[6],xmm4[6],xmm6[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1],xmm3[2],xmm4[2],xmm3[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3],xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm4[8],xmm0[9],xmm4[9],xmm0[10],xmm4[10],xmm0[11],xmm4[11],xmm0[12],xmm4[12],xmm0[13],xmm4[13],xmm0[14],xmm4[14],xmm0[15],xmm4[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm8 = xmm8[4],xmm4[4],xmm8[5],xmm4[5],xmm8[6],xmm4[6],xmm8[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm9 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm8
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm2
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm6
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm1
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm0
|
2017-11-03 19:33:48 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v32i8_const:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm0 = [1,2,3,4]
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm7, %xmm9
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm7 = [5,6,7,8]
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm8, %xmm1
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX1-NEXT: vpackssdw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpackssdw %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm2
|
|
|
|
; AVX1-NEXT: vpackssdw %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm9, %xmm3
|
|
|
|
; AVX1-NEXT: vpackssdw %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v32i8_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgb {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX512-LABEL: avg_v32i8_const:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512-NEXT: vpavgb {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512-NEXT: vzeroupper
|
|
|
|
; AVX512-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i8>, <32 x i8>* %a
|
|
|
|
%2 = zext <32 x i8> %1 to <32 x i32>
|
|
|
|
%3 = add nuw nsw <32 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
2015-12-01 05:46:08 +08:00
|
|
|
%4 = lshr <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%5 = trunc <32 x i32> %4 to <32 x i8>
|
|
|
|
store <32 x i8> %5, <32 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v64i8_const(<64 x i8>* %a) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v64i8_const:
|
|
|
|
; SSE2: # BB#0:
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm5
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm6
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm15
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm11
|
|
|
|
; SSE2-NEXT: pxor %xmm0, %xmm0
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm10
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm10 = xmm10[0],xmm0[0],xmm10[1],xmm0[1],xmm10[2],xmm0[2],xmm10[3],xmm0[3]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm9
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm11 = xmm11[8],xmm0[8],xmm11[9],xmm0[9],xmm11[10],xmm0[10],xmm11[11],xmm0[11],xmm11[12],xmm0[12],xmm11[13],xmm0[13],xmm11[14],xmm0[14],xmm11[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm11, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm11 = xmm11[4],xmm0[4],xmm11[5],xmm0[5],xmm11[6],xmm0[6],xmm11[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm14
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm14 = xmm14[0],xmm0[0],xmm14[1],xmm0[1],xmm14[2],xmm0[2],xmm14[3],xmm0[3],xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm14, %xmm13
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm13 = xmm13[0],xmm0[0],xmm13[1],xmm0[1],xmm13[2],xmm0[2],xmm13[3],xmm0[3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm14 = xmm14[4],xmm0[4],xmm14[5],xmm0[5],xmm14[6],xmm0[6],xmm14[7],xmm0[7]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm15 = xmm15[8],xmm0[8],xmm15[9],xmm0[9],xmm15[10],xmm0[10],xmm15[11],xmm0[11],xmm15[12],xmm0[12],xmm15[13],xmm0[13],xmm15[14],xmm0[14],xmm15[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm15, %xmm12
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm12 = xmm12[0],xmm0[0],xmm12[1],xmm0[1],xmm12[2],xmm0[2],xmm12[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm15 = xmm15[4],xmm0[4],xmm15[5],xmm0[5],xmm15[6],xmm0[6],xmm15[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm0[0],xmm3[1],xmm0[1],xmm3[2],xmm0[2],xmm3[3],xmm0[3],xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm8
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm8 = xmm8[0],xmm0[0],xmm8[1],xmm0[1],xmm8[2],xmm0[2],xmm8[3],xmm0[3]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm0[4],xmm3[5],xmm0[5],xmm3[6],xmm0[6],xmm3[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm6 = xmm6[8],xmm0[8],xmm6[9],xmm0[9],xmm6[10],xmm0[10],xmm6[11],xmm0[11],xmm6[12],xmm0[12],xmm6[13],xmm0[13],xmm6[14],xmm0[14],xmm6[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm4
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm4 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm5 = xmm5[8],xmm0[8],xmm5[9],xmm0[9],xmm5[10],xmm0[10],xmm5[11],xmm0[11],xmm5[12],xmm0[12],xmm5[13],xmm0[13],xmm5[14],xmm0[14],xmm5[15],xmm0[15]
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm7
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm7 = xmm7[0],xmm0[0],xmm7[1],xmm0[1],xmm7[2],xmm0[2],xmm7[3],xmm0[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm5 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [5,6,7,8]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm6
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm3
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm15
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm14
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm11
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [1,2,3,4]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm7
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm4
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm12
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm13
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm9 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa %xmm9, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: paddd %xmm0, %xmm10
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm5
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm7
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm7
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm7, %xmm1
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm6
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm4
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm4
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm8
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm8
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm12
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm15
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm15
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm12
|
|
|
|
; SSE2-NEXT: packuswb %xmm15, %xmm12
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm13
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm14
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm14
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm13
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: packuswb %xmm14, %xmm13
|
|
|
|
; SSE2-NEXT: packuswb %xmm12, %xmm13
|
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm2 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: psrld $1, %xmm11
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm11
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm11, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm10
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm3
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: pand %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: packuswb %xmm3, %xmm10
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm10
|
|
|
|
; SSE2-NEXT: movdqu %xmm10, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm13, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm8, (%rax)
|
2017-02-15 19:46:15 +08:00
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v64i8_const:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm9 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm14 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm11 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm8 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm13 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm10 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm12 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm0 = [5,6,7,8]
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm7, %xmm15
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm13, %xmm13
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm7
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm11, %xmm11
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm3, %xmm1
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm1, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm9, %xmm9
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm2, %xmm0
|
|
|
|
; AVX1-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp) # 16-byte Spill
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [1,2,3,4]
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm12, %xmm0
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm10, %xmm10
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm8, %xmm8
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm6, %xmm1
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm14, %xmm6
|
|
|
|
; AVX1-NEXT: vpaddd -{{[0-9]+}}(%rsp), %xmm2, %xmm12 # 16-byte Folded Reload
|
|
|
|
; AVX1-NEXT: vpmovzxbd {{.*#+}} xmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
|
|
|
|
; AVX1-NEXT: vpaddd %xmm2, %xmm3, %xmm14
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm3
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm5 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm10, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm15, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm0, %xmm2, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm8, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm13, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm7, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm2, %ymm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm11, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm2
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm3 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm12, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm9, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm14, %xmm3
|
|
|
|
; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm4 # 16-byte Reload
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpand %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpackuswb %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
|
|
|
|
; AVX1-NEXT: vmovups %ymm1, (%rax)
|
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v64i8_const:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm4 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm5 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm6 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vpmovzxbd {{.*#+}} ymm7 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm8 = [1,2,3,4,5,6,7,8]
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm6, %ymm6
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm4, %ymm4
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm3, %ymm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm8, %ymm0, %ymm0
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm8
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm5, %ymm5
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm6, %ymm6
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm7, %ymm7
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm7, %xmm0
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm0, %xmm7, %xmm0
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm6, %xmm7
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX2-NEXT: vpackuswb %xmm0, %xmm6, %xmm0
|
2017-11-01 19:47:44 +08:00
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm5, %xmm6
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm6, %xmm5, %xmm5
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm4, %xmm6
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm6, %xmm4, %xmm4
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX2-NEXT: vpackuswb %xmm5, %xmm4, %xmm4
|
2017-11-01 19:47:44 +08:00
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm4, %ymm0
|
2017-11-03 19:33:48 +08:00
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm3, %xmm4
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm2, %xmm4
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm4, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vpackuswb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm3
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm8, %xmm3
|
|
|
|
; AVX2-NEXT: vpackssdw %xmm3, %xmm8, %xmm3
|
|
|
|
; AVX2-NEXT: vpackuswb %xmm1, %xmm3, %xmm1
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
2017-11-01 19:47:44 +08:00
|
|
|
; AVX2-NEXT: vmovdqu %ymm1, (%rax)
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm2 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
|
|
|
; AVX512F-NEXT: vpmovzxbd {{.*#+}} zmm3 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero,mem[4],zero,zero,zero,mem[5],zero,zero,zero,mem[6],zero,zero,zero,mem[7],zero,zero,zero,mem[8],zero,zero,zero,mem[9],zero,zero,zero,mem[10],zero,zero,zero,mem[11],zero,zero,zero,mem[12],zero,zero,zero,mem[13],zero,zero,zero,mem[14],zero,zero,zero,mem[15],zero,zero,zero
|
2017-07-04 13:46:11 +08:00
|
|
|
; AVX512F-NEXT: vbroadcasti64x4 {{.*#+}} zmm4 = [1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8]
|
|
|
|
; AVX512F-NEXT: # zmm4 = mem[0,1,2,3,0,1,2,3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm4, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm2, %zmm2
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm3, %zmm3
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm3, %xmm3
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm2, %xmm2
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512F-NEXT: vmovdqu %ymm2, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512F-NEXT: vzeroupper
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v64i8_const:
|
|
|
|
; AVX512BW: # BB#0:
|
2017-08-01 01:35:44 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqa64 (%rdi), %zmm0
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgb {{.*}}(%rip), %zmm0, %zmm0
|
2017-08-01 23:31:24 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu32 %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512BW-NEXT: vzeroupper
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <64 x i8>, <64 x i8>* %a
|
|
|
|
%2 = zext <64 x i8> %1 to <64 x i32>
|
|
|
|
%3 = add nuw nsw <64 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
2015-12-01 05:46:08 +08:00
|
|
|
%4 = lshr <64 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%5 = trunc <64 x i32> %4 to <64 x i8>
|
|
|
|
store <64 x i8> %5, <64 x i8>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v4i16_const(<4 x i16>* %a) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v4i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; SSE2-NEXT: pavgw {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movq %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v4i16_const:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
|
|
|
|
; AVX-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovq %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <4 x i16>, <4 x i16>* %a
|
|
|
|
%2 = zext <4 x i16> %1 to <4 x i32>
|
|
|
|
%3 = add nuw nsw <4 x i32> %2, <i32 1, i32 2, i32 3, i32 4>
|
|
|
|
%4 = lshr <4 x i32> %3, <i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <4 x i32> %4 to <4 x i16>
|
|
|
|
store <4 x i16> %5, <4 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v8i16_const(<8 x i16>* %a) nounwind {
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-LABEL: avg_v8i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; SSE2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pavgw {{.*}}(%rip), %xmm0
|
|
|
|
; SSE2-NEXT: movdqu %xmm0, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX-LABEL: avg_v8i16_const:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vmovdqa (%rdi), %xmm0
|
|
|
|
; AVX-NEXT: vpavgw {{.*}}(%rip), %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: vmovdqu %xmm0, (%rax)
|
|
|
|
; AVX-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <8 x i16>, <8 x i16>* %a
|
|
|
|
%2 = zext <8 x i16> %1 to <8 x i32>
|
|
|
|
%3 = add nuw nsw <8 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <8 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <8 x i32> %4 to <8 x i16>
|
|
|
|
store <8 x i16> %5, <8 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v16i16_const(<16 x i16>* %a) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v16i16_const:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm3
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pxor %xmm4, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1],xmm1[2],xmm4[2],xmm1[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm3 = xmm3[4],xmm4[4],xmm3[5],xmm4[5],xmm3[6],xmm4[6],xmm3[7],xmm4[7]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm4 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm3
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm5 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v16i16_const:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm4 = [1,2,3,4]
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm3, %xmm3
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm5 = [5,6,7,8]
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpaddd %xmm4, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpaddd %xmm5, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vpackusdw %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-LABEL: avg_v16i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX2: # BB#0:
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX2-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX2-NEXT: vpavgw {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX512-LABEL: avg_v16i16_const:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vmovdqa (%rdi), %ymm0
|
|
|
|
; AVX512-NEXT: vpavgw {{.*}}(%rip), %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX512-NEXT: vzeroupper
|
|
|
|
; AVX512-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <16 x i16>, <16 x i16>* %a
|
|
|
|
%2 = zext <16 x i16> %1 to <16 x i32>
|
|
|
|
%3 = add nuw nsw <16 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
|
|
|
%4 = lshr <16 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
|
|
|
%5 = trunc <16 x i32> %4 to <16 x i16>
|
|
|
|
store <16 x i16> %5, <16 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|
2017-09-12 15:50:35 +08:00
|
|
|
define void @avg_v32i16_const(<32 x i16>* %a) nounwind {
|
2016-08-26 01:17:46 +08:00
|
|
|
; SSE2-LABEL: avg_v32i16_const:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: movdqa (%rdi), %xmm7
|
|
|
|
; SSE2-NEXT: movdqa 16(%rdi), %xmm6
|
|
|
|
; SSE2-NEXT: movdqa 32(%rdi), %xmm4
|
|
|
|
; SSE2-NEXT: movdqa 48(%rdi), %xmm0
|
|
|
|
; SSE2-NEXT: pxor %xmm8, %xmm8
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm8[0],xmm1[1],xmm8[1],xmm1[2],xmm8[2],xmm1[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm8[4],xmm0[5],xmm8[5],xmm0[6],xmm8[6],xmm0[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm2 = xmm2[0],xmm8[0],xmm2[1],xmm8[1],xmm2[2],xmm8[2],xmm2[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm4 = xmm4[4],xmm8[4],xmm4[5],xmm8[5],xmm4[6],xmm8[6],xmm4[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm3 = xmm3[0],xmm8[0],xmm3[1],xmm8[1],xmm3[2],xmm8[2],xmm3[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm6 = xmm6[4],xmm8[4],xmm6[5],xmm8[5],xmm6[6],xmm8[6],xmm6[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklwd {{.*#+}} xmm5 = xmm5[0],xmm8[0],xmm5[1],xmm8[1],xmm5[2],xmm8[2],xmm5[3],xmm8[3]
|
|
|
|
; SSE2-NEXT: punpckhwd {{.*#+}} xmm7 = xmm7[4],xmm8[4],xmm7[5],xmm8[5],xmm7[6],xmm8[6],xmm7[7],xmm8[7]
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm8 = [5,6,7,8]
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm7
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm9 = [1,2,3,4]
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm5
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm6
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm3
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm4
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm2
|
|
|
|
; SSE2-NEXT: paddd %xmm8, %xmm0
|
|
|
|
; SSE2-NEXT: paddd %xmm9, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm4
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrld $1, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm7
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm7
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm5
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm5
|
|
|
|
; SSE2-NEXT: packssdw %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm6
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm6
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm3
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm3
|
|
|
|
; SSE2-NEXT: packssdw %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm4
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm4
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm2
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm2
|
|
|
|
; SSE2-NEXT: packssdw %xmm4, %xmm2
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm0
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm0
|
|
|
|
; SSE2-NEXT: pslld $16, %xmm1
|
|
|
|
; SSE2-NEXT: psrad $16, %xmm1
|
|
|
|
; SSE2-NEXT: packssdw %xmm0, %xmm1
|
|
|
|
; SSE2-NEXT: movdqu %xmm1, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm2, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm3, (%rax)
|
|
|
|
; SSE2-NEXT: movdqu %xmm5, (%rax)
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-LABEL: avg_v32i16_const:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm8 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm5 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm6 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
|
|
|
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm0 = [1,2,3,4]
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm7, %xmm9
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm7 = [5,6,7,8]
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
|
|
|
|
; AVX1-NEXT: vpaddd %xmm7, %xmm8, %xmm1
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm1, %xmm1
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm0, %xmm0
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vpackusdw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm1, %xmm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm4, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm5, %xmm3
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm2, %xmm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm6, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrld $1, %xmm9, %xmm4
|
|
|
|
; AVX1-NEXT: vpackusdw %xmm3, %xmm4, %xmm3
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm2, %ymm2
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm0, (%rax)
|
2017-10-29 04:51:27 +08:00
|
|
|
; AVX1-NEXT: vmovups %ymm2, (%rax)
|
2017-06-23 22:38:00 +08:00
|
|
|
; AVX1-NEXT: vzeroupper
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-LABEL: avg_v32i16_const:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm2 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm3 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} ymm4 = [1,2,3,4,5,6,7,8]
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpaddd %ymm4, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrld $1, %ymm3, %ymm3
|
2017-11-02 05:52:29 +08:00
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm3, %xmm4
|
|
|
|
; AVX2-NEXT: vpackusdw %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm2, %xmm4
|
|
|
|
; AVX2-NEXT: vpackusdw %xmm4, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm3
|
|
|
|
; AVX2-NEXT: vpackusdw %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm3
|
|
|
|
; AVX2-NEXT: vpackusdw %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX2-NEXT: vmovdqu %ymm0, (%rax)
|
|
|
|
; AVX2-NEXT: vmovdqu %ymm2, (%rax)
|
|
|
|
; AVX2-NEXT: vzeroupper
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v32i16_const:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
|
|
|
; AVX512F-NEXT: vpmovzxwd {{.*#+}} zmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
|
2017-07-04 13:46:11 +08:00
|
|
|
; AVX512F-NEXT: vbroadcasti64x4 {{.*#+}} zmm2 = [1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8]
|
|
|
|
; AVX512F-NEXT: # zmm2 = mem[0,1,2,3,0,1,2,3]
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpaddd %zmm2, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm0, %zmm0
|
|
|
|
; AVX512F-NEXT: vpsrld $1, %zmm1, %zmm1
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm1, (%rax)
|
|
|
|
; AVX512F-NEXT: vpmovdw %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512F-NEXT: vzeroupper
|
2016-08-26 01:17:46 +08:00
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-LABEL: avg_v32i16_const:
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
; AVX512BW: # BB#0:
|
2017-08-01 01:35:44 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqa64 (%rdi), %zmm0
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: vpavgw {{.*}}(%rip), %zmm0, %zmm0
|
2017-08-01 23:31:24 +08:00
|
|
|
; AVX512BW-NEXT: vmovdqu32 %zmm0, (%rax)
|
2017-03-03 17:03:24 +08:00
|
|
|
; AVX512BW-NEXT: vzeroupper
|
2015-12-01 05:46:08 +08:00
|
|
|
; AVX512BW-NEXT: retq
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%1 = load <32 x i16>, <32 x i16>* %a
|
|
|
|
%2 = zext <32 x i16> %1 to <32 x i32>
|
|
|
|
%3 = add nuw nsw <32 x i32> %2, <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8>
|
2015-12-01 05:46:08 +08:00
|
|
|
%4 = lshr <32 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
|
[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
This patch detects the AVG pattern in vectorized code, which is simply
c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of
either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to
i32 before any arithmetic operations. The following IR shows such an example:
%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>
and with this patch it will be converted to a X86ISD::AVG instruction.
The pattern recognition is done when combining instructions just before type
legalization during instruction selection. We do it here because after type
legalization, it is much more difficult to do pattern recognition based
on many instructions that are doing type conversions. Therefore, for
target-specific instructions (like X86ISD::AVG), we need to take care of type
legalization by ourselves. However, as X86ISD::AVG behaves similarly to
ISD::ADD, I am wondering if there is a way to legalize operands and result
types of X86ISD::AVG together with ISD::ADD. It seems that the current design
doesn't support this idea.
Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of
variant vector sizes.
Differential revision: http://reviews.llvm.org/D14761
llvm-svn: 253952
2015-11-24 13:44:19 +08:00
|
|
|
%5 = trunc <32 x i32> %4 to <32 x i16>
|
|
|
|
store <32 x i16> %5, <32 x i16>* undef, align 4
|
|
|
|
ret void
|
|
|
|
}
|
2017-09-12 15:50:35 +08:00
|
|
|
|
|
|
|
define <16 x i8> @avg_v16i8_3(<16 x i8> %a, <16 x i8> %b) nounwind {
|
|
|
|
; SSE2-LABEL: avg_v16i8_3:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: pavgb %xmm1, %xmm0
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX-LABEL: avg_v16i8_3:
|
|
|
|
; AVX: # BB#0:
|
|
|
|
; AVX-NEXT: vpavgb %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX-NEXT: retq
|
|
|
|
%za = zext <16 x i8> %a to <16 x i16>
|
|
|
|
%zb = zext <16 x i8> %b to <16 x i16>
|
|
|
|
%add = add nuw nsw <16 x i16> %za, %zb
|
|
|
|
%add1 = add nuw nsw <16 x i16> %add, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
|
|
|
|
%lshr = lshr <16 x i16> %add1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
|
|
|
|
%res = trunc <16 x i16> %lshr to <16 x i8>
|
|
|
|
ret <16 x i8> %res
|
|
|
|
}
|
|
|
|
|
|
|
|
define <32 x i8> @avg_v32i8_3(<32 x i8> %a, <32 x i8> %b) nounwind {
|
|
|
|
; SSE2-LABEL: avg_v32i8_3:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: pxor %xmm5, %xmm5
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm6 = xmm6[8],xmm5[8],xmm6[9],xmm5[9],xmm6[10],xmm5[10],xmm6[11],xmm5[11],xmm6[12],xmm5[12],xmm6[13],xmm5[13],xmm6[14],xmm5[14],xmm6[15],xmm5[15]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3],xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm7
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm7 = xmm7[8],xmm5[8],xmm7[9],xmm5[9],xmm7[10],xmm5[10],xmm7[11],xmm5[11],xmm7[12],xmm5[12],xmm7[13],xmm5[13],xmm7[14],xmm5[14],xmm7[15],xmm5[15]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1],xmm1[2],xmm5[2],xmm1[3],xmm5[3],xmm1[4],xmm5[4],xmm1[5],xmm5[5],xmm1[6],xmm5[6],xmm1[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm4 = xmm4[8],xmm5[8],xmm4[9],xmm5[9],xmm4[10],xmm5[10],xmm4[11],xmm5[11],xmm4[12],xmm5[12],xmm4[13],xmm5[13],xmm4[14],xmm5[14],xmm4[15],xmm5[15]
|
|
|
|
; SSE2-NEXT: paddw %xmm6, %xmm4
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm5[0],xmm2[1],xmm5[1],xmm2[2],xmm5[2],xmm2[3],xmm5[3],xmm2[4],xmm5[4],xmm2[5],xmm5[5],xmm2[6],xmm5[6],xmm2[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: paddw %xmm2, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm5[8],xmm2[9],xmm5[9],xmm2[10],xmm5[10],xmm2[11],xmm5[11],xmm2[12],xmm5[12],xmm2[13],xmm5[13],xmm2[14],xmm5[14],xmm2[15],xmm5[15]
|
|
|
|
; SSE2-NEXT: paddw %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1],xmm3[2],xmm5[2],xmm3[3],xmm5[3],xmm3[4],xmm5[4],xmm3[5],xmm5[5],xmm3[6],xmm5[6],xmm3[7],xmm5[7]
|
|
|
|
; SSE2-NEXT: paddw %xmm3, %xmm1
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm3, %xmm3
|
|
|
|
; SSE2-NEXT: psubw %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: psubw %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: psubw %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: psubw %xmm3, %xmm1
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm4
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm3 = [255,255,255,255,255,255,255,255]
|
|
|
|
; SSE2-NEXT: pand %xmm3, %xmm4
|
|
|
|
; SSE2-NEXT: pand %xmm3, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: pand %xmm3, %xmm2
|
|
|
|
; SSE2-NEXT: pand %xmm3, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm2, %xmm1
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX1-LABEL: avg_v32i8_3:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm3 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm4 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
|
|
|
|
; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm5
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm6 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero,xmm5[4],zero,xmm5[5],zero,xmm5[6],zero,xmm5[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm6, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm5 = xmm5[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm5 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero,xmm5[4],zero,xmm5[5],zero,xmm5[6],zero,xmm5[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm5 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubw %xmm1, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsubw %xmm1, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubw %xmm1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubw %xmm1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm4, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm4 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX1-NEXT: vpshufb %xmm4, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpshufb %xmm4, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm3[0],xmm2[0]
|
|
|
|
; AVX1-NEXT: vpshufb %xmm4, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpshufb %xmm4, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v32i8_3:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vpavgb %ymm1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512-LABEL: avg_v32i8_3:
|
|
|
|
; AVX512: # BB#0:
|
|
|
|
; AVX512-NEXT: vpavgb %ymm1, %ymm0, %ymm0
|
|
|
|
; AVX512-NEXT: retq
|
|
|
|
%za = zext <32 x i8> %a to <32 x i16>
|
|
|
|
%zb = zext <32 x i8> %b to <32 x i16>
|
|
|
|
%add = add nuw nsw <32 x i16> %za, %zb
|
|
|
|
%add1 = add nuw nsw <32 x i16> %add, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
|
|
|
|
%lshr = lshr <32 x i16> %add1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
|
|
|
|
%res = trunc <32 x i16> %lshr to <32 x i8>
|
|
|
|
ret <32 x i8> %res
|
|
|
|
}
|
|
|
|
|
|
|
|
define <64 x i8> @avg_v64i8_3(<64 x i8> %a, <64 x i8> %b) nounwind {
|
|
|
|
; SSE2-LABEL: avg_v64i8_3:
|
|
|
|
; SSE2: # BB#0:
|
|
|
|
; SSE2-NEXT: pxor %xmm9, %xmm9
|
|
|
|
; SSE2-NEXT: movdqa %xmm0, %xmm10
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm10 = xmm10[8],xmm9[8],xmm10[9],xmm9[9],xmm10[10],xmm9[10],xmm10[11],xmm9[11],xmm10[12],xmm9[12],xmm10[13],xmm9[13],xmm10[14],xmm9[14],xmm10[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm9[0],xmm0[1],xmm9[1],xmm0[2],xmm9[2],xmm0[3],xmm9[3],xmm0[4],xmm9[4],xmm0[5],xmm9[5],xmm0[6],xmm9[6],xmm0[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm1, %xmm11
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm11 = xmm11[8],xmm9[8],xmm11[9],xmm9[9],xmm11[10],xmm9[10],xmm11[11],xmm9[11],xmm11[12],xmm9[12],xmm11[13],xmm9[13],xmm11[14],xmm9[14],xmm11[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm9[0],xmm1[1],xmm9[1],xmm1[2],xmm9[2],xmm1[3],xmm9[3],xmm1[4],xmm9[4],xmm1[5],xmm9[5],xmm1[6],xmm9[6],xmm1[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm2, %xmm12
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm12 = xmm12[8],xmm9[8],xmm12[9],xmm9[9],xmm12[10],xmm9[10],xmm12[11],xmm9[11],xmm12[12],xmm9[12],xmm12[13],xmm9[13],xmm12[14],xmm9[14],xmm12[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm9[0],xmm2[1],xmm9[1],xmm2[2],xmm9[2],xmm2[3],xmm9[3],xmm2[4],xmm9[4],xmm2[5],xmm9[5],xmm2[6],xmm9[6],xmm2[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm3, %xmm13
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm13 = xmm13[8],xmm9[8],xmm13[9],xmm9[9],xmm13[10],xmm9[10],xmm13[11],xmm9[11],xmm13[12],xmm9[12],xmm13[13],xmm9[13],xmm13[14],xmm9[14],xmm13[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm3 = xmm3[0],xmm9[0],xmm3[1],xmm9[1],xmm3[2],xmm9[2],xmm3[3],xmm9[3],xmm3[4],xmm9[4],xmm3[5],xmm9[5],xmm3[6],xmm9[6],xmm3[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: movdqa %xmm4, %xmm8
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm8 = xmm8[8],xmm9[8],xmm8[9],xmm9[9],xmm8[10],xmm9[10],xmm8[11],xmm9[11],xmm8[12],xmm9[12],xmm8[13],xmm9[13],xmm8[14],xmm9[14],xmm8[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: paddw %xmm10, %xmm8
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm4 = xmm4[0],xmm9[0],xmm4[1],xmm9[1],xmm4[2],xmm9[2],xmm4[3],xmm9[3],xmm4[4],xmm9[4],xmm4[5],xmm9[5],xmm4[6],xmm9[6],xmm4[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: paddw %xmm4, %xmm0
|
|
|
|
; SSE2-NEXT: movdqa %xmm5, %xmm4
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm4 = xmm4[8],xmm9[8],xmm4[9],xmm9[9],xmm4[10],xmm9[10],xmm4[11],xmm9[11],xmm4[12],xmm9[12],xmm4[13],xmm9[13],xmm4[14],xmm9[14],xmm4[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: paddw %xmm11, %xmm4
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm5 = xmm5[0],xmm9[0],xmm5[1],xmm9[1],xmm5[2],xmm9[2],xmm5[3],xmm9[3],xmm5[4],xmm9[4],xmm5[5],xmm9[5],xmm5[6],xmm9[6],xmm5[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: paddw %xmm5, %xmm1
|
|
|
|
; SSE2-NEXT: movdqa %xmm6, %xmm5
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm5 = xmm5[8],xmm9[8],xmm5[9],xmm9[9],xmm5[10],xmm9[10],xmm5[11],xmm9[11],xmm5[12],xmm9[12],xmm5[13],xmm9[13],xmm5[14],xmm9[14],xmm5[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: paddw %xmm12, %xmm5
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm6 = xmm6[0],xmm9[0],xmm6[1],xmm9[1],xmm6[2],xmm9[2],xmm6[3],xmm9[3],xmm6[4],xmm9[4],xmm6[5],xmm9[5],xmm6[6],xmm9[6],xmm6[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: paddw %xmm6, %xmm2
|
|
|
|
; SSE2-NEXT: movdqa %xmm7, %xmm6
|
|
|
|
; SSE2-NEXT: punpckhbw {{.*#+}} xmm6 = xmm6[8],xmm9[8],xmm6[9],xmm9[9],xmm6[10],xmm9[10],xmm6[11],xmm9[11],xmm6[12],xmm9[12],xmm6[13],xmm9[13],xmm6[14],xmm9[14],xmm6[15],xmm9[15]
|
|
|
|
; SSE2-NEXT: paddw %xmm13, %xmm6
|
|
|
|
; SSE2-NEXT: punpcklbw {{.*#+}} xmm7 = xmm7[0],xmm9[0],xmm7[1],xmm9[1],xmm7[2],xmm9[2],xmm7[3],xmm9[3],xmm7[4],xmm9[4],xmm7[5],xmm9[5],xmm7[6],xmm9[6],xmm7[7],xmm9[7]
|
|
|
|
; SSE2-NEXT: paddw %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: pcmpeqd %xmm7, %xmm7
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm8
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm0
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm4
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm1
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm6
|
|
|
|
; SSE2-NEXT: psubw %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm3
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm6
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm2
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm5
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm1
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm4
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm0
|
|
|
|
; SSE2-NEXT: psrlw $1, %xmm8
|
|
|
|
; SSE2-NEXT: movdqa {{.*#+}} xmm7 = [255,255,255,255,255,255,255,255]
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm8
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm0
|
|
|
|
; SSE2-NEXT: packuswb %xmm8, %xmm0
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm4
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm1
|
|
|
|
; SSE2-NEXT: packuswb %xmm4, %xmm1
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm5
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm2
|
|
|
|
; SSE2-NEXT: packuswb %xmm5, %xmm2
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm6
|
|
|
|
; SSE2-NEXT: pand %xmm7, %xmm3
|
|
|
|
; SSE2-NEXT: packuswb %xmm6, %xmm3
|
|
|
|
; SSE2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX1-LABEL: avg_v64i8_3:
|
|
|
|
; AVX1: # BB#0:
|
|
|
|
; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm4
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm5 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero,xmm4[4],zero,xmm4[5],zero,xmm4[6],zero,xmm4[7],zero
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm4 = xmm4[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm4 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero,xmm4[4],zero,xmm4[5],zero,xmm4[6],zero,xmm4[7],zero
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm6 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
|
|
|
|
; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm7
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm8 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero,xmm7[4],zero,xmm7[5],zero,xmm7[6],zero,xmm7[7],zero
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm7 = xmm7[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm11 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero,xmm7[4],zero,xmm7[5],zero,xmm7[6],zero,xmm7[7],zero
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm9 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm10 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
|
|
|
|
; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm1
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm7 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm7, %xmm5, %xmm12
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm1, %xmm4, %xmm13
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm4 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm4, %xmm6, %xmm14
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm2, %xmm0, %xmm15
|
|
|
|
; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm2
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm6 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm6, %xmm8, %xmm6
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm2, %xmm11, %xmm2
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm7 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm7, %xmm9, %xmm7
|
|
|
|
; AVX1-NEXT: vpshufd {{.*#+}} xmm3 = xmm3[2,3,0,1]
|
|
|
|
; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
|
|
|
|
; AVX1-NEXT: vpaddw %xmm3, %xmm10, %xmm3
|
|
|
|
; AVX1-NEXT: vpcmpeqd %xmm5, %xmm5, %xmm5
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm12, %xmm8
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm13, %xmm4
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm14, %xmm0
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm15, %xmm1
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vpsubw %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm3, %xmm9
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm7, %xmm5
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm6, %xmm6
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpsrlw $1, %xmm8, %xmm7
|
|
|
|
; AVX1-NEXT: vmovdqa {{.*#+}} xmm3 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm7, %xmm7
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm4, %xmm4
|
|
|
|
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm4 = xmm7[0],xmm4[0]
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm0, %xmm0
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm0, %ymm0
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm6, %xmm1
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm2, %xmm2
|
|
|
|
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm5, %xmm2
|
|
|
|
; AVX1-NEXT: vpshufb %xmm3, %xmm9, %xmm3
|
|
|
|
; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
|
|
|
|
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
|
|
|
|
; AVX1-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX2-LABEL: avg_v64i8_3:
|
|
|
|
; AVX2: # BB#0:
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm4
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm4 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero,xmm4[4],zero,xmm4[5],zero,xmm4[6],zero,xmm4[7],zero,xmm4[8],zero,xmm4[9],zero,xmm4[10],zero,xmm4[11],zero,xmm4[12],zero,xmm4[13],zero,xmm4[14],zero,xmm4[15],zero
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm5
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm5 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero,xmm5[4],zero,xmm5[5],zero,xmm5[6],zero,xmm5[7],zero,xmm5[8],zero,xmm5[9],zero,xmm5[10],zero,xmm5[11],zero,xmm5[12],zero,xmm5[13],zero,xmm5[14],zero,xmm5[15],zero
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm2, %xmm6
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero,xmm6[4],zero,xmm6[5],zero,xmm6[6],zero,xmm6[7],zero,xmm6[8],zero,xmm6[9],zero,xmm6[10],zero,xmm6[11],zero,xmm6[12],zero,xmm6[13],zero,xmm6[14],zero,xmm6[15],zero
|
|
|
|
; AVX2-NEXT: vpaddw %ymm6, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero,xmm2[8],zero,xmm2[9],zero,xmm2[10],zero,xmm2[11],zero,xmm2[12],zero,xmm2[13],zero,xmm2[14],zero,xmm2[15],zero
|
|
|
|
; AVX2-NEXT: vpaddw %ymm2, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm3, %xmm2
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero,xmm2[8],zero,xmm2[9],zero,xmm2[10],zero,xmm2[11],zero,xmm2[12],zero,xmm2[13],zero,xmm2[14],zero,xmm2[15],zero
|
|
|
|
; AVX2-NEXT: vpaddw %ymm2, %ymm5, %ymm2
|
|
|
|
; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero,xmm3[8],zero,xmm3[9],zero,xmm3[10],zero,xmm3[11],zero,xmm3[12],zero,xmm3[13],zero,xmm3[14],zero,xmm3[15],zero
|
|
|
|
; AVX2-NEXT: vpaddw %ymm3, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpcmpeqd %ymm3, %ymm3, %ymm3
|
|
|
|
; AVX2-NEXT: vpsubw %ymm3, %ymm4, %ymm4
|
|
|
|
; AVX2-NEXT: vpsubw %ymm3, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsubw %ymm3, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsubw %ymm3, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrlw $1, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: vpsrlw $1, %ymm2, %ymm2
|
|
|
|
; AVX2-NEXT: vpsrlw $1, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vpsrlw $1, %ymm4, %ymm3
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm3, %xmm4
|
|
|
|
; AVX2-NEXT: vmovdqa {{.*#+}} xmm5 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm3 = xmm3[0],xmm4[0]
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm4, %xmm4
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm0, %xmm0
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm4[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm0
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm2, %xmm3
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm2, %xmm2
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm3[0]
|
|
|
|
; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm3
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm3, %xmm3
|
|
|
|
; AVX2-NEXT: vpshufb %xmm5, %xmm1, %xmm1
|
|
|
|
; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm3[0]
|
|
|
|
; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
|
|
|
|
; AVX2-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512F-LABEL: avg_v64i8_3:
|
|
|
|
; AVX512F: # BB#0:
|
|
|
|
; AVX512F-NEXT: vextracti128 $1, %ymm1, %xmm4
|
|
|
|
; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm5
|
|
|
|
; AVX512F-NEXT: vextracti128 $1, %ymm3, %xmm6
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm6, %xmm4, %xmm4
|
|
|
|
; AVX512F-NEXT: vextracti128 $1, %ymm2, %xmm6
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm6, %xmm5, %xmm5
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm2, %xmm0, %xmm0
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm0
|
|
|
|
; AVX512F-NEXT: vpavgb %xmm3, %xmm1, %xmm1
|
|
|
|
; AVX512F-NEXT: vinserti128 $1, %xmm4, %ymm1, %ymm1
|
|
|
|
; AVX512F-NEXT: retq
|
|
|
|
;
|
|
|
|
; AVX512BW-LABEL: avg_v64i8_3:
|
|
|
|
; AVX512BW: # BB#0:
|
|
|
|
; AVX512BW-NEXT: vpavgb %zmm1, %zmm0, %zmm0
|
|
|
|
; AVX512BW-NEXT: retq
|
|
|
|
%za = zext <64 x i8> %a to <64 x i16>
|
|
|
|
%zb = zext <64 x i8> %b to <64 x i16>
|
|
|
|
%add = add nuw nsw <64 x i16> %za, %zb
|
|
|
|
%add1 = add nuw nsw <64 x i16> %add, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
|
|
|
|
%lshr = lshr <64 x i16> %add1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
|
|
|
|
%res = trunc <64 x i16> %lshr to <64 x i8>
|
|
|
|
ret <64 x i8> %res
|
|
|
|
}
|