2018-01-14 00:55:28 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
|
|
|
; RUN: llc < %s -mtriple=x86_64-unknown -mcpu=bdver1 | FileCheck %s
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
|
|
|
|
; clang -Oz -c test1.cpp -emit-llvm -S -o
|
|
|
|
; Verify that we generate shld insruction when we are optimizing for size,
|
2018-01-14 00:55:28 +08:00
|
|
|
; even for X86_64 processors that are known to have poor latency double
|
2014-06-08 05:23:09 +08:00
|
|
|
; precision shift instructions.
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
; uint64_t lshift10(uint64_t a, uint64_t b)
|
|
|
|
; {
|
|
|
|
; return (a << 10) | (b >> 54);
|
|
|
|
; }
|
|
|
|
|
2015-08-11 01:00:44 +08:00
|
|
|
; Function Attrs: minsize nounwind readnone uwtable
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
define i64 @_Z8lshift10mm(i64 %a, i64 %b) #0 {
|
2018-01-14 00:55:28 +08:00
|
|
|
; CHECK-LABEL: _Z8lshift10mm:
|
|
|
|
; CHECK: # %bb.0: # %entry
|
|
|
|
; CHECK-NEXT: movq %rdi, %rax
|
2018-09-20 02:59:08 +08:00
|
|
|
; CHECK-NEXT: shldq $10, %rsi, %rax
|
2018-01-14 00:55:28 +08:00
|
|
|
; CHECK-NEXT: retq
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
entry:
|
|
|
|
%shl = shl i64 %a, 10
|
|
|
|
%shr = lshr i64 %b, 54
|
|
|
|
%or = or i64 %shr, %shl
|
|
|
|
ret i64 %or
|
|
|
|
}
|
|
|
|
|
2019-12-25 08:11:33 +08:00
|
|
|
attributes #0 = { minsize nounwind readnone uwtable "less-precise-fpmad"="false" "frame-pointer"="none" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
|
|
|
|
|
|
|
|
; clang -Os -c test2.cpp -emit-llvm -S
|
|
|
|
; Verify that we generate shld insruction when we are optimizing for size,
|
|
|
|
; even for X86_64 processors that are known to have poor latency double
|
2014-06-08 05:23:09 +08:00
|
|
|
; precision shift instructions.
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
; uint64_t lshift11(uint64_t a, uint64_t b)
|
|
|
|
; {
|
|
|
|
; return (a << 11) | (b >> 53);
|
|
|
|
; }
|
|
|
|
|
|
|
|
; Function Attrs: nounwind optsize readnone uwtable
|
|
|
|
define i64 @_Z8lshift11mm(i64 %a, i64 %b) #1 {
|
2018-01-14 00:55:28 +08:00
|
|
|
; CHECK-LABEL: _Z8lshift11mm:
|
|
|
|
; CHECK: # %bb.0: # %entry
|
|
|
|
; CHECK-NEXT: movq %rdi, %rax
|
2018-09-20 02:59:08 +08:00
|
|
|
; CHECK-NEXT: shldq $11, %rsi, %rax
|
2018-01-14 00:55:28 +08:00
|
|
|
; CHECK-NEXT: retq
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
entry:
|
|
|
|
%shl = shl i64 %a, 11
|
|
|
|
%shr = lshr i64 %b, 53
|
|
|
|
%or = or i64 %shr, %shl
|
|
|
|
ret i64 %or
|
|
|
|
}
|
|
|
|
|
[PGO][PGSO] Enable size optimizations in code gen / target passes for cold code.
Summary: Split off of D67120.
Reviewers: davidxl
Subscribers: hiraditya, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71288
2019-11-08 00:52:05 +08:00
|
|
|
define i64 @_Z8lshift11mm_pgso(i64 %a, i64 %b) !prof !14 {
|
|
|
|
; CHECK-LABEL: _Z8lshift11mm_pgso:
|
|
|
|
; CHECK: # %bb.0: # %entry
|
|
|
|
; CHECK-NEXT: movq %rdi, %rax
|
|
|
|
; CHECK-NEXT: shldq $11, %rsi, %rax
|
|
|
|
; CHECK-NEXT: retq
|
|
|
|
entry:
|
|
|
|
%shl = shl i64 %a, 11
|
|
|
|
%shr = lshr i64 %b, 53
|
|
|
|
%or = or i64 %shr, %shl
|
|
|
|
ret i64 %or
|
|
|
|
}
|
|
|
|
|
2019-12-25 08:11:33 +08:00
|
|
|
attributes #1 = { nounwind optsize readnone uwtable "less-precise-fpmad"="false" "frame-pointer"="none" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
|
|
|
|
; clang -O2 -c test2.cpp -emit-llvm -S
|
|
|
|
; Verify that we do not generate shld insruction when we are not optimizing
|
|
|
|
; for size for X86_64 processors that are known to have poor latency double
|
2014-06-08 05:23:09 +08:00
|
|
|
; precision shift instructions.
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
; uint64_t lshift12(uint64_t a, uint64_t b)
|
|
|
|
; {
|
|
|
|
; return (a << 12) | (b >> 52);
|
|
|
|
; }
|
|
|
|
|
|
|
|
; Function Attrs: nounwind optsize readnone uwtable
|
|
|
|
define i64 @_Z8lshift12mm(i64 %a, i64 %b) #2 {
|
2018-01-14 00:55:28 +08:00
|
|
|
; CHECK-LABEL: _Z8lshift12mm:
|
|
|
|
; CHECK: # %bb.0: # %entry
|
|
|
|
; CHECK-NEXT: shlq $12, %rdi
|
|
|
|
; CHECK-NEXT: shrq $52, %rsi
|
|
|
|
; CHECK-NEXT: leaq (%rsi,%rdi), %rax
|
|
|
|
; CHECK-NEXT: retq
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
entry:
|
|
|
|
%shl = shl i64 %a, 12
|
|
|
|
%shr = lshr i64 %b, 52
|
|
|
|
%or = or i64 %shr, %shl
|
|
|
|
ret i64 %or
|
|
|
|
}
|
|
|
|
|
2019-12-25 08:11:33 +08:00
|
|
|
attributes #2= { nounwind readnone uwtable "less-precise-fpmad"="false" "frame-pointer"="none" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
|
SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures. While generating SHLD/SHRD instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, shl, or and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions. For these AMD's processors, I disabled folding (or (x << c) | (y >> (64 - c))) when we are not optimizing for size.
It might be beneficial to disable this folding for some of the Intel's processors. However, since I couldn't find specific recommendations regarding using SHLD/SHRD instructions on Intel's processors, I haven't disabled this peephole for Intel.
llvm-svn: 195383
2013-11-22 07:21:26 +08:00
|
|
|
|
[PGO][PGSO] Enable size optimizations in code gen / target passes for cold code.
Summary: Split off of D67120.
Reviewers: davidxl
Subscribers: hiraditya, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71288
2019-11-08 00:52:05 +08:00
|
|
|
!llvm.module.flags = !{!0}
|
|
|
|
!0 = !{i32 1, !"ProfileSummary", !1}
|
|
|
|
!1 = !{!2, !3, !4, !5, !6, !7, !8, !9}
|
|
|
|
!2 = !{!"ProfileFormat", !"InstrProf"}
|
|
|
|
!3 = !{!"TotalCount", i64 10000}
|
|
|
|
!4 = !{!"MaxCount", i64 10}
|
|
|
|
!5 = !{!"MaxInternalCount", i64 1}
|
|
|
|
!6 = !{!"MaxFunctionCount", i64 1000}
|
|
|
|
!7 = !{!"NumCounts", i64 3}
|
|
|
|
!8 = !{!"NumFunctions", i64 3}
|
|
|
|
!9 = !{!"DetailedSummary", !10}
|
|
|
|
!10 = !{!11, !12, !13}
|
|
|
|
!11 = !{i32 10000, i64 100, i32 1}
|
|
|
|
!12 = !{i32 999000, i64 100, i32 1}
|
|
|
|
!13 = !{i32 999999, i64 1, i32 2}
|
|
|
|
!14 = !{!"function_entry_count", i64 0}
|