2006-09-26 16:52:32 +08:00
|
|
|
#include <linux/linkage.h>
|
2016-01-27 05:12:04 +08:00
|
|
|
#include <asm/cpufeatures.h>
|
2011-05-18 06:29:14 +08:00
|
|
|
#include <asm/alternative-asm.h>
|
2016-01-12 00:04:34 +08:00
|
|
|
#include <asm/export.h>
|
2006-09-26 16:52:32 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
x86/lib/clear_page_64.S: Convert to ALTERNATIVE_2 macro
Move clear_page() up so that we can get 2-byte forward JMPs when
patching:
apply_alternatives: feat: 3*32+16, old: (ffffffff8130adb0, len: 5), repl: (ffffffff81d0b859, len: 5)
ffffffff8130adb0: alt_insn: 90 90 90 90 90
recompute_jump: new_displ: 0x0000003e
ffffffff81d0b859: rpl_insn: eb 3e 66 66 90
even though the compiler generated 5-byte JMPs which we padded with 5
NOPs.
Also, make the REP_GOOD version be the default as the majority of
machines set REP_GOOD. This way we get to save ourselves the JMP:
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_REP_GOOD, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 0
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_ERMS, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 5
ffffffff81cf0e92: e9 69 2a 61 ff jmpq ffffffff81303900
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 69 2a 61 ff jmpq ffffffff8091631e
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-01-18 19:57:41 +08:00
|
|
|
* Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
|
|
|
|
* recommended to use this when possible and we do use them by default.
|
|
|
|
* If enhanced REP MOVSB/STOSB is not available, try to use fast string.
|
|
|
|
* Otherwise, use original.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Zero a page.
|
|
|
|
* %rdi - page
|
|
|
|
*/
|
x86/asm: Optimize clear_page()
Currently, we CALL clear_page() which then JMPs to the proper function
chosen by the alternatives.
What we should do instead is CALL the proper function directly. (This
was something Ingo suggested a while ago). So let's do that.
Measuring our favourite kernel build workload shows that there are no
significant changes in performance.
AMD
===
-- /tmp/before 2017-02-09 18:01:46.451961188 +0100
++ /tmp/after 2017-02-09 18:01:54.883961175 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 1028960.373643 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.41% )
+ 1023086.018961 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.20% )
- 518,744 context-switches # 0.504 K/sec ( +- 1.04% )
+ 518,254 context-switches # 0.507 K/sec ( +- 1.01% )
- 38,112 cpu-migrations # 0.037 K/sec ( +- 1.95% )
+ 37,917 cpu-migrations # 0.037 K/sec ( +- 1.02% )
- 20,874,266 page-faults # 0.020 M/sec ( +- 0.07% )
+ 20,918,897 page-faults # 0.020 M/sec ( +- 0.18% )
- 2,043,646,230,667 cycles # 1.986 GHz ( +- 0.14% ) (66.67%)
+ 2,045,305,584,032 cycles # 1.999 GHz ( +- 0.16% ) (66.67%)
- 553,698,855,431 stalled-cycles-frontend # 27.09% frontend cycles idle ( +- 0.07% ) (66.67%)
+ 555,099,401,413 stalled-cycles-frontend # 27.14% frontend cycles idle ( +- 0.13% ) (66.67%)
- 621,544,286,390 stalled-cycles-backend # 30.41% backend cycles idle ( +- 0.39% ) (66.67%)
+ 621,371,430,254 stalled-cycles-backend # 30.38% backend cycles idle ( +- 0.32% ) (66.67%)
- 1,738,364,431,659 instructions # 0.85 insn per cycle
+ 1,739,895,771,901 instructions # 0.85 insn per cycle
- # 0.36 stalled cycles per insn ( +- 0.11% ) (66.67%)
+ # 0.36 stalled cycles per insn ( +- 0.13% ) (66.67%)
- 391,170,943,850 branches # 380.161 M/sec ( +- 0.13% ) (66.67%)
+ 391,398,551,757 branches # 382.567 M/sec ( +- 0.13% ) (66.67%)
- 22,567,810,411 branch-misses # 5.77% of all branches ( +- 0.11% ) (66.67%)
+ 22,574,726,683 branch-misses # 5.77% of all branches ( +- 0.13% ) (66.67%)
- 171.480741921 seconds time elapsed ( +- 1.41% )
+ 170.509229451 seconds time elapsed ( +- 1.20% )
Intel
=====
-- /tmp/before 2017-02-09 20:36:19.851947473 +0100
++ /tmp/after 2017-02-09 20:36:30.151947458 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 2207248.598126 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.69% )
+ 2213300.106631 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.73% )
- 899,342 context-switches # 0.407 K/sec ( +- 0.68% )
+ 898,381 context-switches # 0.406 K/sec ( +- 0.79% )
- 80,553 cpu-migrations # 0.036 K/sec ( +- 1.13% )
+ 80,979 cpu-migrations # 0.037 K/sec ( +- 1.11% )
- 36,171,148 page-faults # 0.016 M/sec ( +- 0.02% )
+ 36,179,791 page-faults # 0.016 M/sec ( +- 0.02% )
- 6,665,288,826,484 cycles # 3.020 GHz ( +- 0.07% ) (83.33%)
+ 6,671,638,410,799 cycles # 3.014 GHz ( +- 0.06% ) (83.33%)
- 5,065,975,115,197 stalled-cycles-frontend # 76.01% frontend cycles idle ( +- 0.11% ) (83.33%)
+ 5,076,835,183,223 stalled-cycles-frontend # 76.10% frontend cycles idle ( +- 0.11% ) (83.33%)
- 3,841,556,350,614 stalled-cycles-backend # 57.64% backend cycles idle ( +- 0.13% ) (66.67%)
+ 3,852,823,974,333 stalled-cycles-backend # 57.75% backend cycles idle ( +- 0.12% ) (66.67%)
- 4,148,398,171,079 instructions # 0.62 insn per cycle
+ 4,148,997,156,059 instructions # 0.62 insn per cycle
- # 1.22 stalled cycles per insn ( +- 0.10% ) (83.33%)
+ # 1.22 stalled cycles per insn ( +- 0.11% ) (83.33%)
- 887,187,118,591 branches # 401.943 M/sec ( +- 0.09% ) (83.33%)
+ 887,271,341,121 branches # 400.882 M/sec ( +- 0.11% ) (83.33%)
- 30,139,439,034 branch-misses # 3.40% of all branches ( +- 0.09% ) (83.33%)
+ 30,134,864,997 branch-misses # 3.40% of all branches ( +- 0.06% ) (83.33%)
- 275.904405540 seconds time elapsed ( +- 0.69% )
+ 276.660352016 seconds time elapsed ( +- 0.73% )
allmodconfig vmlinux size grows by a ~1Kb but that's fine - we optimize
our calling of the clear_page variants.
text data bss dec hex filename
9051979 23067670 27009024 59128673 3863b61 vmlinux
9053000 23067670 27009024 59129694 3863f5e vmlinux.clear_page
Reported-by: kernel test robot <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170215111927.emdgxf2pide3kwro@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-09 08:34:49 +08:00
|
|
|
ENTRY(clear_page_rep)
|
2006-09-26 16:52:32 +08:00
|
|
|
movl $4096/8,%ecx
|
|
|
|
xorl %eax,%eax
|
|
|
|
rep stosq
|
|
|
|
ret
|
x86/asm: Optimize clear_page()
Currently, we CALL clear_page() which then JMPs to the proper function
chosen by the alternatives.
What we should do instead is CALL the proper function directly. (This
was something Ingo suggested a while ago). So let's do that.
Measuring our favourite kernel build workload shows that there are no
significant changes in performance.
AMD
===
-- /tmp/before 2017-02-09 18:01:46.451961188 +0100
++ /tmp/after 2017-02-09 18:01:54.883961175 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 1028960.373643 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.41% )
+ 1023086.018961 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.20% )
- 518,744 context-switches # 0.504 K/sec ( +- 1.04% )
+ 518,254 context-switches # 0.507 K/sec ( +- 1.01% )
- 38,112 cpu-migrations # 0.037 K/sec ( +- 1.95% )
+ 37,917 cpu-migrations # 0.037 K/sec ( +- 1.02% )
- 20,874,266 page-faults # 0.020 M/sec ( +- 0.07% )
+ 20,918,897 page-faults # 0.020 M/sec ( +- 0.18% )
- 2,043,646,230,667 cycles # 1.986 GHz ( +- 0.14% ) (66.67%)
+ 2,045,305,584,032 cycles # 1.999 GHz ( +- 0.16% ) (66.67%)
- 553,698,855,431 stalled-cycles-frontend # 27.09% frontend cycles idle ( +- 0.07% ) (66.67%)
+ 555,099,401,413 stalled-cycles-frontend # 27.14% frontend cycles idle ( +- 0.13% ) (66.67%)
- 621,544,286,390 stalled-cycles-backend # 30.41% backend cycles idle ( +- 0.39% ) (66.67%)
+ 621,371,430,254 stalled-cycles-backend # 30.38% backend cycles idle ( +- 0.32% ) (66.67%)
- 1,738,364,431,659 instructions # 0.85 insn per cycle
+ 1,739,895,771,901 instructions # 0.85 insn per cycle
- # 0.36 stalled cycles per insn ( +- 0.11% ) (66.67%)
+ # 0.36 stalled cycles per insn ( +- 0.13% ) (66.67%)
- 391,170,943,850 branches # 380.161 M/sec ( +- 0.13% ) (66.67%)
+ 391,398,551,757 branches # 382.567 M/sec ( +- 0.13% ) (66.67%)
- 22,567,810,411 branch-misses # 5.77% of all branches ( +- 0.11% ) (66.67%)
+ 22,574,726,683 branch-misses # 5.77% of all branches ( +- 0.13% ) (66.67%)
- 171.480741921 seconds time elapsed ( +- 1.41% )
+ 170.509229451 seconds time elapsed ( +- 1.20% )
Intel
=====
-- /tmp/before 2017-02-09 20:36:19.851947473 +0100
++ /tmp/after 2017-02-09 20:36:30.151947458 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 2207248.598126 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.69% )
+ 2213300.106631 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.73% )
- 899,342 context-switches # 0.407 K/sec ( +- 0.68% )
+ 898,381 context-switches # 0.406 K/sec ( +- 0.79% )
- 80,553 cpu-migrations # 0.036 K/sec ( +- 1.13% )
+ 80,979 cpu-migrations # 0.037 K/sec ( +- 1.11% )
- 36,171,148 page-faults # 0.016 M/sec ( +- 0.02% )
+ 36,179,791 page-faults # 0.016 M/sec ( +- 0.02% )
- 6,665,288,826,484 cycles # 3.020 GHz ( +- 0.07% ) (83.33%)
+ 6,671,638,410,799 cycles # 3.014 GHz ( +- 0.06% ) (83.33%)
- 5,065,975,115,197 stalled-cycles-frontend # 76.01% frontend cycles idle ( +- 0.11% ) (83.33%)
+ 5,076,835,183,223 stalled-cycles-frontend # 76.10% frontend cycles idle ( +- 0.11% ) (83.33%)
- 3,841,556,350,614 stalled-cycles-backend # 57.64% backend cycles idle ( +- 0.13% ) (66.67%)
+ 3,852,823,974,333 stalled-cycles-backend # 57.75% backend cycles idle ( +- 0.12% ) (66.67%)
- 4,148,398,171,079 instructions # 0.62 insn per cycle
+ 4,148,997,156,059 instructions # 0.62 insn per cycle
- # 1.22 stalled cycles per insn ( +- 0.10% ) (83.33%)
+ # 1.22 stalled cycles per insn ( +- 0.11% ) (83.33%)
- 887,187,118,591 branches # 401.943 M/sec ( +- 0.09% ) (83.33%)
+ 887,271,341,121 branches # 400.882 M/sec ( +- 0.11% ) (83.33%)
- 30,139,439,034 branch-misses # 3.40% of all branches ( +- 0.09% ) (83.33%)
+ 30,134,864,997 branch-misses # 3.40% of all branches ( +- 0.06% ) (83.33%)
- 275.904405540 seconds time elapsed ( +- 0.69% )
+ 276.660352016 seconds time elapsed ( +- 0.73% )
allmodconfig vmlinux size grows by a ~1Kb but that's fine - we optimize
our calling of the clear_page variants.
text data bss dec hex filename
9051979 23067670 27009024 59128673 3863b61 vmlinux
9053000 23067670 27009024 59129694 3863f5e vmlinux.clear_page
Reported-by: kernel test robot <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170215111927.emdgxf2pide3kwro@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-09 08:34:49 +08:00
|
|
|
ENDPROC(clear_page_rep)
|
|
|
|
EXPORT_SYMBOL_GPL(clear_page_rep)
|
2006-09-26 16:52:32 +08:00
|
|
|
|
x86/lib/clear_page_64.S: Convert to ALTERNATIVE_2 macro
Move clear_page() up so that we can get 2-byte forward JMPs when
patching:
apply_alternatives: feat: 3*32+16, old: (ffffffff8130adb0, len: 5), repl: (ffffffff81d0b859, len: 5)
ffffffff8130adb0: alt_insn: 90 90 90 90 90
recompute_jump: new_displ: 0x0000003e
ffffffff81d0b859: rpl_insn: eb 3e 66 66 90
even though the compiler generated 5-byte JMPs which we padded with 5
NOPs.
Also, make the REP_GOOD version be the default as the majority of
machines set REP_GOOD. This way we get to save ourselves the JMP:
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_REP_GOOD, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 0
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_ERMS, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 5
ffffffff81cf0e92: e9 69 2a 61 ff jmpq ffffffff81303900
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 69 2a 61 ff jmpq ffffffff8091631e
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-01-18 19:57:41 +08:00
|
|
|
ENTRY(clear_page_orig)
|
2006-02-04 04:51:02 +08:00
|
|
|
xorl %eax,%eax
|
|
|
|
movl $4096/64,%ecx
|
|
|
|
.p2align 4
|
|
|
|
.Lloop:
|
|
|
|
decl %ecx
|
|
|
|
#define PUT(x) movq %rax,x*8(%rdi)
|
|
|
|
movq %rax,(%rdi)
|
|
|
|
PUT(1)
|
|
|
|
PUT(2)
|
|
|
|
PUT(3)
|
|
|
|
PUT(4)
|
|
|
|
PUT(5)
|
|
|
|
PUT(6)
|
|
|
|
PUT(7)
|
|
|
|
leaq 64(%rdi),%rdi
|
|
|
|
jnz .Lloop
|
|
|
|
nop
|
|
|
|
ret
|
x86/lib/clear_page_64.S: Convert to ALTERNATIVE_2 macro
Move clear_page() up so that we can get 2-byte forward JMPs when
patching:
apply_alternatives: feat: 3*32+16, old: (ffffffff8130adb0, len: 5), repl: (ffffffff81d0b859, len: 5)
ffffffff8130adb0: alt_insn: 90 90 90 90 90
recompute_jump: new_displ: 0x0000003e
ffffffff81d0b859: rpl_insn: eb 3e 66 66 90
even though the compiler generated 5-byte JMPs which we padded with 5
NOPs.
Also, make the REP_GOOD version be the default as the majority of
machines set REP_GOOD. This way we get to save ourselves the JMP:
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_REP_GOOD, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 0
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_ERMS, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 5
ffffffff81cf0e92: e9 69 2a 61 ff jmpq ffffffff81303900
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 69 2a 61 ff jmpq ffffffff8091631e
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-01-18 19:57:41 +08:00
|
|
|
ENDPROC(clear_page_orig)
|
x86/asm: Optimize clear_page()
Currently, we CALL clear_page() which then JMPs to the proper function
chosen by the alternatives.
What we should do instead is CALL the proper function directly. (This
was something Ingo suggested a while ago). So let's do that.
Measuring our favourite kernel build workload shows that there are no
significant changes in performance.
AMD
===
-- /tmp/before 2017-02-09 18:01:46.451961188 +0100
++ /tmp/after 2017-02-09 18:01:54.883961175 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 1028960.373643 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.41% )
+ 1023086.018961 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.20% )
- 518,744 context-switches # 0.504 K/sec ( +- 1.04% )
+ 518,254 context-switches # 0.507 K/sec ( +- 1.01% )
- 38,112 cpu-migrations # 0.037 K/sec ( +- 1.95% )
+ 37,917 cpu-migrations # 0.037 K/sec ( +- 1.02% )
- 20,874,266 page-faults # 0.020 M/sec ( +- 0.07% )
+ 20,918,897 page-faults # 0.020 M/sec ( +- 0.18% )
- 2,043,646,230,667 cycles # 1.986 GHz ( +- 0.14% ) (66.67%)
+ 2,045,305,584,032 cycles # 1.999 GHz ( +- 0.16% ) (66.67%)
- 553,698,855,431 stalled-cycles-frontend # 27.09% frontend cycles idle ( +- 0.07% ) (66.67%)
+ 555,099,401,413 stalled-cycles-frontend # 27.14% frontend cycles idle ( +- 0.13% ) (66.67%)
- 621,544,286,390 stalled-cycles-backend # 30.41% backend cycles idle ( +- 0.39% ) (66.67%)
+ 621,371,430,254 stalled-cycles-backend # 30.38% backend cycles idle ( +- 0.32% ) (66.67%)
- 1,738,364,431,659 instructions # 0.85 insn per cycle
+ 1,739,895,771,901 instructions # 0.85 insn per cycle
- # 0.36 stalled cycles per insn ( +- 0.11% ) (66.67%)
+ # 0.36 stalled cycles per insn ( +- 0.13% ) (66.67%)
- 391,170,943,850 branches # 380.161 M/sec ( +- 0.13% ) (66.67%)
+ 391,398,551,757 branches # 382.567 M/sec ( +- 0.13% ) (66.67%)
- 22,567,810,411 branch-misses # 5.77% of all branches ( +- 0.11% ) (66.67%)
+ 22,574,726,683 branch-misses # 5.77% of all branches ( +- 0.13% ) (66.67%)
- 171.480741921 seconds time elapsed ( +- 1.41% )
+ 170.509229451 seconds time elapsed ( +- 1.20% )
Intel
=====
-- /tmp/before 2017-02-09 20:36:19.851947473 +0100
++ /tmp/after 2017-02-09 20:36:30.151947458 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 2207248.598126 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.69% )
+ 2213300.106631 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.73% )
- 899,342 context-switches # 0.407 K/sec ( +- 0.68% )
+ 898,381 context-switches # 0.406 K/sec ( +- 0.79% )
- 80,553 cpu-migrations # 0.036 K/sec ( +- 1.13% )
+ 80,979 cpu-migrations # 0.037 K/sec ( +- 1.11% )
- 36,171,148 page-faults # 0.016 M/sec ( +- 0.02% )
+ 36,179,791 page-faults # 0.016 M/sec ( +- 0.02% )
- 6,665,288,826,484 cycles # 3.020 GHz ( +- 0.07% ) (83.33%)
+ 6,671,638,410,799 cycles # 3.014 GHz ( +- 0.06% ) (83.33%)
- 5,065,975,115,197 stalled-cycles-frontend # 76.01% frontend cycles idle ( +- 0.11% ) (83.33%)
+ 5,076,835,183,223 stalled-cycles-frontend # 76.10% frontend cycles idle ( +- 0.11% ) (83.33%)
- 3,841,556,350,614 stalled-cycles-backend # 57.64% backend cycles idle ( +- 0.13% ) (66.67%)
+ 3,852,823,974,333 stalled-cycles-backend # 57.75% backend cycles idle ( +- 0.12% ) (66.67%)
- 4,148,398,171,079 instructions # 0.62 insn per cycle
+ 4,148,997,156,059 instructions # 0.62 insn per cycle
- # 1.22 stalled cycles per insn ( +- 0.10% ) (83.33%)
+ # 1.22 stalled cycles per insn ( +- 0.11% ) (83.33%)
- 887,187,118,591 branches # 401.943 M/sec ( +- 0.09% ) (83.33%)
+ 887,271,341,121 branches # 400.882 M/sec ( +- 0.11% ) (83.33%)
- 30,139,439,034 branch-misses # 3.40% of all branches ( +- 0.09% ) (83.33%)
+ 30,134,864,997 branch-misses # 3.40% of all branches ( +- 0.06% ) (83.33%)
- 275.904405540 seconds time elapsed ( +- 0.69% )
+ 276.660352016 seconds time elapsed ( +- 0.73% )
allmodconfig vmlinux size grows by a ~1Kb but that's fine - we optimize
our calling of the clear_page variants.
text data bss dec hex filename
9051979 23067670 27009024 59128673 3863b61 vmlinux
9053000 23067670 27009024 59129694 3863f5e vmlinux.clear_page
Reported-by: kernel test robot <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170215111927.emdgxf2pide3kwro@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-09 08:34:49 +08:00
|
|
|
EXPORT_SYMBOL_GPL(clear_page_orig)
|
2006-02-04 04:51:02 +08:00
|
|
|
|
x86/asm: Optimize clear_page()
Currently, we CALL clear_page() which then JMPs to the proper function
chosen by the alternatives.
What we should do instead is CALL the proper function directly. (This
was something Ingo suggested a while ago). So let's do that.
Measuring our favourite kernel build workload shows that there are no
significant changes in performance.
AMD
===
-- /tmp/before 2017-02-09 18:01:46.451961188 +0100
++ /tmp/after 2017-02-09 18:01:54.883961175 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 1028960.373643 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.41% )
+ 1023086.018961 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.20% )
- 518,744 context-switches # 0.504 K/sec ( +- 1.04% )
+ 518,254 context-switches # 0.507 K/sec ( +- 1.01% )
- 38,112 cpu-migrations # 0.037 K/sec ( +- 1.95% )
+ 37,917 cpu-migrations # 0.037 K/sec ( +- 1.02% )
- 20,874,266 page-faults # 0.020 M/sec ( +- 0.07% )
+ 20,918,897 page-faults # 0.020 M/sec ( +- 0.18% )
- 2,043,646,230,667 cycles # 1.986 GHz ( +- 0.14% ) (66.67%)
+ 2,045,305,584,032 cycles # 1.999 GHz ( +- 0.16% ) (66.67%)
- 553,698,855,431 stalled-cycles-frontend # 27.09% frontend cycles idle ( +- 0.07% ) (66.67%)
+ 555,099,401,413 stalled-cycles-frontend # 27.14% frontend cycles idle ( +- 0.13% ) (66.67%)
- 621,544,286,390 stalled-cycles-backend # 30.41% backend cycles idle ( +- 0.39% ) (66.67%)
+ 621,371,430,254 stalled-cycles-backend # 30.38% backend cycles idle ( +- 0.32% ) (66.67%)
- 1,738,364,431,659 instructions # 0.85 insn per cycle
+ 1,739,895,771,901 instructions # 0.85 insn per cycle
- # 0.36 stalled cycles per insn ( +- 0.11% ) (66.67%)
+ # 0.36 stalled cycles per insn ( +- 0.13% ) (66.67%)
- 391,170,943,850 branches # 380.161 M/sec ( +- 0.13% ) (66.67%)
+ 391,398,551,757 branches # 382.567 M/sec ( +- 0.13% ) (66.67%)
- 22,567,810,411 branch-misses # 5.77% of all branches ( +- 0.11% ) (66.67%)
+ 22,574,726,683 branch-misses # 5.77% of all branches ( +- 0.13% ) (66.67%)
- 171.480741921 seconds time elapsed ( +- 1.41% )
+ 170.509229451 seconds time elapsed ( +- 1.20% )
Intel
=====
-- /tmp/before 2017-02-09 20:36:19.851947473 +0100
++ /tmp/after 2017-02-09 20:36:30.151947458 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 2207248.598126 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.69% )
+ 2213300.106631 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.73% )
- 899,342 context-switches # 0.407 K/sec ( +- 0.68% )
+ 898,381 context-switches # 0.406 K/sec ( +- 0.79% )
- 80,553 cpu-migrations # 0.036 K/sec ( +- 1.13% )
+ 80,979 cpu-migrations # 0.037 K/sec ( +- 1.11% )
- 36,171,148 page-faults # 0.016 M/sec ( +- 0.02% )
+ 36,179,791 page-faults # 0.016 M/sec ( +- 0.02% )
- 6,665,288,826,484 cycles # 3.020 GHz ( +- 0.07% ) (83.33%)
+ 6,671,638,410,799 cycles # 3.014 GHz ( +- 0.06% ) (83.33%)
- 5,065,975,115,197 stalled-cycles-frontend # 76.01% frontend cycles idle ( +- 0.11% ) (83.33%)
+ 5,076,835,183,223 stalled-cycles-frontend # 76.10% frontend cycles idle ( +- 0.11% ) (83.33%)
- 3,841,556,350,614 stalled-cycles-backend # 57.64% backend cycles idle ( +- 0.13% ) (66.67%)
+ 3,852,823,974,333 stalled-cycles-backend # 57.75% backend cycles idle ( +- 0.12% ) (66.67%)
- 4,148,398,171,079 instructions # 0.62 insn per cycle
+ 4,148,997,156,059 instructions # 0.62 insn per cycle
- # 1.22 stalled cycles per insn ( +- 0.10% ) (83.33%)
+ # 1.22 stalled cycles per insn ( +- 0.11% ) (83.33%)
- 887,187,118,591 branches # 401.943 M/sec ( +- 0.09% ) (83.33%)
+ 887,271,341,121 branches # 400.882 M/sec ( +- 0.11% ) (83.33%)
- 30,139,439,034 branch-misses # 3.40% of all branches ( +- 0.09% ) (83.33%)
+ 30,134,864,997 branch-misses # 3.40% of all branches ( +- 0.06% ) (83.33%)
- 275.904405540 seconds time elapsed ( +- 0.69% )
+ 276.660352016 seconds time elapsed ( +- 0.73% )
allmodconfig vmlinux size grows by a ~1Kb but that's fine - we optimize
our calling of the clear_page variants.
text data bss dec hex filename
9051979 23067670 27009024 59128673 3863b61 vmlinux
9053000 23067670 27009024 59129694 3863f5e vmlinux.clear_page
Reported-by: kernel test robot <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170215111927.emdgxf2pide3kwro@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-09 08:34:49 +08:00
|
|
|
ENTRY(clear_page_erms)
|
x86/lib/clear_page_64.S: Convert to ALTERNATIVE_2 macro
Move clear_page() up so that we can get 2-byte forward JMPs when
patching:
apply_alternatives: feat: 3*32+16, old: (ffffffff8130adb0, len: 5), repl: (ffffffff81d0b859, len: 5)
ffffffff8130adb0: alt_insn: 90 90 90 90 90
recompute_jump: new_displ: 0x0000003e
ffffffff81d0b859: rpl_insn: eb 3e 66 66 90
even though the compiler generated 5-byte JMPs which we padded with 5
NOPs.
Also, make the REP_GOOD version be the default as the majority of
machines set REP_GOOD. This way we get to save ourselves the JMP:
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_REP_GOOD, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 0
old insn VA: 0xffffffff813038b0, CPU feat: X86_FEATURE_ERMS, size: 5, padlen: 0
clear_page:
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 0b 00 00 00 jmpq ffffffff813038c0
repl insn: 0xffffffff81cf0e92, size: 5
ffffffff81cf0e92: e9 69 2a 61 ff jmpq ffffffff81303900
ffffffff813038b0 <clear_page>:
ffffffff813038b0: e9 69 2a 61 ff jmpq ffffffff8091631e
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-01-18 19:57:41 +08:00
|
|
|
movl $4096,%ecx
|
|
|
|
xorl %eax,%eax
|
|
|
|
rep stosb
|
|
|
|
ret
|
x86/asm: Optimize clear_page()
Currently, we CALL clear_page() which then JMPs to the proper function
chosen by the alternatives.
What we should do instead is CALL the proper function directly. (This
was something Ingo suggested a while ago). So let's do that.
Measuring our favourite kernel build workload shows that there are no
significant changes in performance.
AMD
===
-- /tmp/before 2017-02-09 18:01:46.451961188 +0100
++ /tmp/after 2017-02-09 18:01:54.883961175 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 1028960.373643 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.41% )
+ 1023086.018961 cpu-clock (msec) # 6.000 CPUs utilized ( +- 1.20% )
- 518,744 context-switches # 0.504 K/sec ( +- 1.04% )
+ 518,254 context-switches # 0.507 K/sec ( +- 1.01% )
- 38,112 cpu-migrations # 0.037 K/sec ( +- 1.95% )
+ 37,917 cpu-migrations # 0.037 K/sec ( +- 1.02% )
- 20,874,266 page-faults # 0.020 M/sec ( +- 0.07% )
+ 20,918,897 page-faults # 0.020 M/sec ( +- 0.18% )
- 2,043,646,230,667 cycles # 1.986 GHz ( +- 0.14% ) (66.67%)
+ 2,045,305,584,032 cycles # 1.999 GHz ( +- 0.16% ) (66.67%)
- 553,698,855,431 stalled-cycles-frontend # 27.09% frontend cycles idle ( +- 0.07% ) (66.67%)
+ 555,099,401,413 stalled-cycles-frontend # 27.14% frontend cycles idle ( +- 0.13% ) (66.67%)
- 621,544,286,390 stalled-cycles-backend # 30.41% backend cycles idle ( +- 0.39% ) (66.67%)
+ 621,371,430,254 stalled-cycles-backend # 30.38% backend cycles idle ( +- 0.32% ) (66.67%)
- 1,738,364,431,659 instructions # 0.85 insn per cycle
+ 1,739,895,771,901 instructions # 0.85 insn per cycle
- # 0.36 stalled cycles per insn ( +- 0.11% ) (66.67%)
+ # 0.36 stalled cycles per insn ( +- 0.13% ) (66.67%)
- 391,170,943,850 branches # 380.161 M/sec ( +- 0.13% ) (66.67%)
+ 391,398,551,757 branches # 382.567 M/sec ( +- 0.13% ) (66.67%)
- 22,567,810,411 branch-misses # 5.77% of all branches ( +- 0.11% ) (66.67%)
+ 22,574,726,683 branch-misses # 5.77% of all branches ( +- 0.13% ) (66.67%)
- 171.480741921 seconds time elapsed ( +- 1.41% )
+ 170.509229451 seconds time elapsed ( +- 1.20% )
Intel
=====
-- /tmp/before 2017-02-09 20:36:19.851947473 +0100
++ /tmp/after 2017-02-09 20:36:30.151947458 +0100
@@ -1,15 +1,15 @@
Performance counter stats for 'system wide' (5 runs):
- 2207248.598126 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.69% )
+ 2213300.106631 cpu-clock (msec) # 8.000 CPUs utilized ( +- 0.73% )
- 899,342 context-switches # 0.407 K/sec ( +- 0.68% )
+ 898,381 context-switches # 0.406 K/sec ( +- 0.79% )
- 80,553 cpu-migrations # 0.036 K/sec ( +- 1.13% )
+ 80,979 cpu-migrations # 0.037 K/sec ( +- 1.11% )
- 36,171,148 page-faults # 0.016 M/sec ( +- 0.02% )
+ 36,179,791 page-faults # 0.016 M/sec ( +- 0.02% )
- 6,665,288,826,484 cycles # 3.020 GHz ( +- 0.07% ) (83.33%)
+ 6,671,638,410,799 cycles # 3.014 GHz ( +- 0.06% ) (83.33%)
- 5,065,975,115,197 stalled-cycles-frontend # 76.01% frontend cycles idle ( +- 0.11% ) (83.33%)
+ 5,076,835,183,223 stalled-cycles-frontend # 76.10% frontend cycles idle ( +- 0.11% ) (83.33%)
- 3,841,556,350,614 stalled-cycles-backend # 57.64% backend cycles idle ( +- 0.13% ) (66.67%)
+ 3,852,823,974,333 stalled-cycles-backend # 57.75% backend cycles idle ( +- 0.12% ) (66.67%)
- 4,148,398,171,079 instructions # 0.62 insn per cycle
+ 4,148,997,156,059 instructions # 0.62 insn per cycle
- # 1.22 stalled cycles per insn ( +- 0.10% ) (83.33%)
+ # 1.22 stalled cycles per insn ( +- 0.11% ) (83.33%)
- 887,187,118,591 branches # 401.943 M/sec ( +- 0.09% ) (83.33%)
+ 887,271,341,121 branches # 400.882 M/sec ( +- 0.11% ) (83.33%)
- 30,139,439,034 branch-misses # 3.40% of all branches ( +- 0.09% ) (83.33%)
+ 30,134,864,997 branch-misses # 3.40% of all branches ( +- 0.06% ) (83.33%)
- 275.904405540 seconds time elapsed ( +- 0.69% )
+ 276.660352016 seconds time elapsed ( +- 0.73% )
allmodconfig vmlinux size grows by a ~1Kb but that's fine - we optimize
our calling of the clear_page variants.
text data bss dec hex filename
9051979 23067670 27009024 59128673 3863b61 vmlinux
9053000 23067670 27009024 59129694 3863f5e vmlinux.clear_page
Reported-by: kernel test robot <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170215111927.emdgxf2pide3kwro@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-09 08:34:49 +08:00
|
|
|
ENDPROC(clear_page_erms)
|
|
|
|
EXPORT_SYMBOL_GPL(clear_page_erms)
|