2017-06-26 23:53:11 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
|
|
|
; RUN: llc < %s -mtriple=i686-unknown-unknown | FileCheck %s --check-prefix=X86
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s --check-prefix=X64
|
|
|
|
|
|
|
|
@a = external global i32
|
|
|
|
@b = external global i32
|
|
|
|
@c = external global i32
|
|
|
|
|
|
|
|
define i32 @fn1(i32, i32) {
|
|
|
|
; X86-LABEL: fn1:
|
2017-12-05 01:18:51 +08:00
|
|
|
; X86: # %bb.0:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
|
|
|
|
; X86-NEXT: testl %eax, %eax
|
|
|
|
; X86-NEXT: je .LBB0_2
|
2017-12-05 01:18:51 +08:00
|
|
|
; X86-NEXT: # %bb.1:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
|
|
|
|
; X86-NEXT: .LBB0_2:
|
|
|
|
; X86-NEXT: retl
|
|
|
|
;
|
|
|
|
; X64-LABEL: fn1:
|
2017-12-05 01:18:51 +08:00
|
|
|
; X64: # %bb.0:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X64-NEXT: movl %edi, %eax
|
2018-09-20 02:59:08 +08:00
|
|
|
; X64-NEXT: testl %esi, %esi
|
|
|
|
; X64-NEXT: cmovel %esi, %eax
|
2017-06-26 23:53:11 +08:00
|
|
|
; X64-NEXT: retq
|
|
|
|
%3 = icmp ne i32 %1, 0
|
|
|
|
%4 = select i1 %3, i32 %0, i32 0
|
|
|
|
ret i32 %4
|
|
|
|
}
|
|
|
|
|
|
|
|
define void @fn2() {
|
|
|
|
; X86-LABEL: fn2:
|
2017-12-05 01:18:51 +08:00
|
|
|
; X86: # %bb.0:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X86-NEXT: movl b, %eax
|
|
|
|
; X86-NEXT: decl a
|
|
|
|
; X86-NEXT: jne .LBB1_2
|
2017-12-05 01:18:51 +08:00
|
|
|
; X86-NEXT: # %bb.1:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X86-NEXT: xorl %eax, %eax
|
|
|
|
; X86-NEXT: .LBB1_2:
|
|
|
|
; X86-NEXT: movl %eax, c
|
|
|
|
; X86-NEXT: retl
|
|
|
|
;
|
|
|
|
; X64-LABEL: fn2:
|
2017-12-05 01:18:51 +08:00
|
|
|
; X64: # %bb.0:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X64-NEXT: xorl %eax, %eax
|
|
|
|
; X64-NEXT: decl {{.*}}(%rip)
|
[x86] Teach the cmov converter to aggressively convert cmovs with memory
operands into control flow.
We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.
Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.
Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.20 Cycles Throughput Bottleneck: Port1
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | | 1.0 | | | | | CP | mov rcx, rdi
| 0* | | | | | | | | xor edi, edi
| 2^ | 0.1 | 0.6 | 0.5 0.5 | 0.5 0.5 | | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rax, qword ptr [rsi]
| 3 | 1.8 | 0.6 | | | | 0.6 | CP | cmovbe rax, rdi
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | cmp byte ptr [rcx+0xf], 0x10
| 0F | | | | | | | | jb 0xf
Total Num Of Uops: 9
```
After, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles Throughput Bottleneck: Port5
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | 0.5 | 0.5 | | | | | | mov rax, rdi
| 0* | | | | | | | | xor edi, edi
| 2^ | 0.5 | 0.5 | 1.0 1.0 | | | | | cmp byte ptr [rsi+0xf], 0xf
| 1 | 0.5 | 0.5 | | | | | | mov ecx, 0x0
| 1 | | | | | | 1.0 | CP | jnbe 0x39
| 2^ | | | | 1.0 1.0 | | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
| 0F | | | | | | | | jnb 0x3c
Total Num Of Uops: 7
```
The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles Throughput Bottleneck: FrontEnd
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 0* | | | | | | | | | | mov rcx, rdi
| 0* | | | | | | | | | | xor edi, edi
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | | cmp byte ptr [rsi+0xf], 0xf
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov rax, qword ptr [rsi]
| 3 | 1.0 | 1.0 | | | | | 1.0 | | | cmovbe rax, rdi
| 2^ | 0.5 | | 0.5 0.5 | 0.5 0.5 | | | 0.5 | | | cmp byte ptr [rcx+0xf], 0x10
| 0F | | | | | | | | | | jb 0xf
Total Num Of Uops: 8
```
After, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles Throughput Bottleneck: FrontEnd
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 0* | | | | | | | | | | mov rax, rdi
| 0* | | | | | | | | | | xor edi, edi
| 2^ | | | 1.0 1.0 | | | 1.0 | | | | cmp byte ptr [rsi+0xf], 0xf
| 1 | | 1.0 | | | | | | | | mov ecx, 0x0
| 1 | | | | | | | 1.0 | | | jnbe 0x39
| 2^ | 1.0 | | | 1.0 1.0 | | | | | | cmp byte ptr [rax+0xf], 0x10
| 0F | | | | | | | | | | jnb 0x3c
Total Num Of Uops: 6
```
Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.
I have run a suite of internal benchmarks with this change. I saw a few
very significant improvements and a very few minor regressions,
but overall this change rarely has a significant effect. However, the
improvements were very significant, and in quite important routines
responsible for a great deal of our C++ CPU cycles. The gains pretty
clealy outweigh the regressions for us.
I also ran the test-suite and SPEC2006. Only 11 binaries changed at all
and none of them showed any regressions.
Amjad Aboud at Intel also ran this over their benchmarks and saw no
regressions.
Differential Revision: https://reviews.llvm.org/D36858
llvm-svn: 311226
2017-08-19 13:01:19 +08:00
|
|
|
; X64-NEXT: je .LBB1_2
|
2017-12-05 01:18:51 +08:00
|
|
|
; X64-NEXT: # %bb.1:
|
[x86] Teach the cmov converter to aggressively convert cmovs with memory
operands into control flow.
We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.
Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.
Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.20 Cycles Throughput Bottleneck: Port1
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | | 1.0 | | | | | CP | mov rcx, rdi
| 0* | | | | | | | | xor edi, edi
| 2^ | 0.1 | 0.6 | 0.5 0.5 | 0.5 0.5 | | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rax, qword ptr [rsi]
| 3 | 1.8 | 0.6 | | | | 0.6 | CP | cmovbe rax, rdi
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | cmp byte ptr [rcx+0xf], 0x10
| 0F | | | | | | | | jb 0xf
Total Num Of Uops: 9
```
After, SNB:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles Throughput Bottleneck: Port5
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | 0.5 | 0.5 | | | | | | mov rax, rdi
| 0* | | | | | | | | xor edi, edi
| 2^ | 0.5 | 0.5 | 1.0 1.0 | | | | | cmp byte ptr [rsi+0xf], 0xf
| 1 | 0.5 | 0.5 | | | | | | mov ecx, 0x0
| 1 | | | | | | 1.0 | CP | jnbe 0x39
| 2^ | | | | 1.0 1.0 | | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
| 0F | | | | | | | | jnb 0x3c
Total Num Of Uops: 7
```
The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles Throughput Bottleneck: FrontEnd
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 0* | | | | | | | | | | mov rcx, rdi
| 0* | | | | | | | | | | xor edi, edi
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | 1.0 | | | | cmp byte ptr [rsi+0xf], 0xf
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | mov rax, qword ptr [rsi]
| 3 | 1.0 | 1.0 | | | | | 1.0 | | | cmovbe rax, rdi
| 2^ | 0.5 | | 0.5 0.5 | 0.5 0.5 | | | 0.5 | | | cmp byte ptr [rcx+0xf], 0x10
| 0F | | | | | | | | | | jb 0xf
Total Num Of Uops: 8
```
After, HSW:
```
Throughput Analysis Report
--------------------------
Block Throughput: 1.50 Cycles Throughput Bottleneck: FrontEnd
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 0* | | | | | | | | | | mov rax, rdi
| 0* | | | | | | | | | | xor edi, edi
| 2^ | | | 1.0 1.0 | | | 1.0 | | | | cmp byte ptr [rsi+0xf], 0xf
| 1 | | 1.0 | | | | | | | | mov ecx, 0x0
| 1 | | | | | | | 1.0 | | | jnbe 0x39
| 2^ | 1.0 | | | 1.0 1.0 | | | | | | cmp byte ptr [rax+0xf], 0x10
| 0F | | | | | | | | | | jnb 0x3c
Total Num Of Uops: 6
```
Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.
I have run a suite of internal benchmarks with this change. I saw a few
very significant improvements and a very few minor regressions,
but overall this change rarely has a significant effect. However, the
improvements were very significant, and in quite important routines
responsible for a great deal of our C++ CPU cycles. The gains pretty
clealy outweigh the regressions for us.
I also ran the test-suite and SPEC2006. Only 11 binaries changed at all
and none of them showed any regressions.
Amjad Aboud at Intel also ran this over their benchmarks and saw no
regressions.
Differential Revision: https://reviews.llvm.org/D36858
llvm-svn: 311226
2017-08-19 13:01:19 +08:00
|
|
|
; X64-NEXT: movl {{.*}}(%rip), %eax
|
|
|
|
; X64-NEXT: .LBB1_2:
|
2017-06-26 23:53:11 +08:00
|
|
|
; X64-NEXT: movl %eax, {{.*}}(%rip)
|
|
|
|
; X64-NEXT: retq
|
|
|
|
%1 = load volatile i32, i32* @b, align 4
|
|
|
|
%2 = load i32, i32* @a, align 4
|
|
|
|
%3 = add nsw i32 %2, -1
|
|
|
|
store i32 %3, i32* @a, align 4
|
|
|
|
%4 = icmp ne i32 %3, 0
|
|
|
|
%5 = select i1 %4, i32 %1, i32 0
|
|
|
|
store i32 %5, i32* @c, align 4
|
|
|
|
ret void
|
|
|
|
}
|
|
|
|
|