License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2008-10-23 13:26:29 +08:00
|
|
|
#ifndef _ASM_X86_BITOPS_H
|
|
|
|
#define _ASM_X86_BITOPS_H
|
2008-01-30 20:30:55 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Copyright 1992, Linus Torvalds.
|
2009-01-13 06:01:15 +08:00
|
|
|
*
|
|
|
|
* Note: inlines with more than a single statement should be marked
|
|
|
|
* __always_inline to avoid problems with older gcc's inlining heuristics.
|
2008-01-30 20:30:55 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _LINUX_BITOPS_H
|
|
|
|
#error only <linux/bitops.h> can be included directly
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#include <linux/compiler.h>
|
|
|
|
#include <asm/alternative.h>
|
2013-09-11 21:19:24 +08:00
|
|
|
#include <asm/rmwcc.h>
|
2014-03-14 02:00:35 +08:00
|
|
|
#include <asm/barrier.h>
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2013-07-17 06:20:14 +08:00
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
# define _BITOPS_LONG_SHIFT 5
|
|
|
|
#elif BITS_PER_LONG == 64
|
|
|
|
# define _BITOPS_LONG_SHIFT 6
|
|
|
|
#else
|
|
|
|
# error "Unexpected BITS_PER_LONG"
|
|
|
|
#endif
|
|
|
|
|
2012-05-22 18:53:45 +08:00
|
|
|
#define BIT_64(n) (U64_C(1) << (n))
|
|
|
|
|
2008-01-30 20:30:55 +08:00
|
|
|
/*
|
|
|
|
* These have to be done with inline assembly: that way the bit-setting
|
|
|
|
* is guaranteed to be atomic. All bit operations return 0 if the bit
|
|
|
|
* was cleared before the operation and != 0 if it was not.
|
|
|
|
*
|
|
|
|
* bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
|
|
|
|
*/
|
|
|
|
|
2019-04-02 19:28:13 +08:00
|
|
|
#define RLONG_ADDR(x) "m" (*(volatile long *) (x))
|
|
|
|
#define WBYTE_ADDR(x) "+m" (*(volatile char *) (x))
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2019-04-02 19:28:13 +08:00
|
|
|
#define ADDR RLONG_ADDR(addr)
|
x86, bitops: make constant-bit set/clear_bit ops faster
On Wed, 18 Jun 2008, Linus Torvalds wrote:
>
> And yes, the "lock andl" should be noticeably faster than the xchgl.
I dunno. Here's a untested (!!) patch that turns constant-bit
set/clear_bit ops into byte mask ops (lock orb/andb).
It's not exactly pretty. The reason for using the byte versions is that a
locked op is serialized in the memory pipeline anyway, so there are no
forwarding issues (that could slow down things when we access things with
different sizes), and the byte ops are a lot smaller than 32-bit and
particularly 64-bit ops (big constants, and the 64-bit ops need the REX
prefix byte too).
[ Side note: I wonder if we should turn the "test_bit()" C version into a
"char *" version too.. It could actually help with alias analysis, since
char pointers can alias anything. So it might be the RightThing(tm) to
do for multiple reasons. I dunno. It's a separate issue. ]
It does actually shrink the kernel image a bit (a couple of hundred bytes
on the text segment for my everything-compiled-in image), and while it's
totally untested the (admittedly few) code generation points I looked at
seemed sane. And "lock orb" should be noticeably faster than "lock bts".
If somebody wants to play with it, go wild. I didn't do "change_bit()",
because nobody sane uses that thing anyway. I guarantee nothing. And if it
breaks, nobody saw me do anything. You can't prove this email wasn't sent
by somebody who is good at forging smtp.
This does require a gcc that is recent enough for "__builtin_constant_p()"
to work in an inline function, but I suspect our kernel requirements are
already higher than that. And if you do have an old gcc that is supported,
the worst that would happen is that the optimization doesn't trigger.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 12:03:26 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We do the locked ops that don't return the old value as
|
|
|
|
* a mask operation on a byte.
|
|
|
|
*/
|
2019-04-02 19:28:13 +08:00
|
|
|
#define CONST_MASK_ADDR(nr, addr) WBYTE_ADDR((void *)(addr) + ((nr)>>3))
|
2008-06-20 13:28:24 +08:00
|
|
|
#define CONST_MASK(nr) (1 << ((nr) & 7))
|
x86, bitops: make constant-bit set/clear_bit ops faster
On Wed, 18 Jun 2008, Linus Torvalds wrote:
>
> And yes, the "lock andl" should be noticeably faster than the xchgl.
I dunno. Here's a untested (!!) patch that turns constant-bit
set/clear_bit ops into byte mask ops (lock orb/andb).
It's not exactly pretty. The reason for using the byte versions is that a
locked op is serialized in the memory pipeline anyway, so there are no
forwarding issues (that could slow down things when we access things with
different sizes), and the byte ops are a lot smaller than 32-bit and
particularly 64-bit ops (big constants, and the 64-bit ops need the REX
prefix byte too).
[ Side note: I wonder if we should turn the "test_bit()" C version into a
"char *" version too.. It could actually help with alias analysis, since
char pointers can alias anything. So it might be the RightThing(tm) to
do for multiple reasons. I dunno. It's a separate issue. ]
It does actually shrink the kernel image a bit (a couple of hundred bytes
on the text segment for my everything-compiled-in image), and while it's
totally untested the (admittedly few) code generation points I looked at
seemed sane. And "lock orb" should be noticeably faster than "lock bts".
If somebody wants to play with it, go wild. I didn't do "change_bit()",
because nobody sane uses that thing anyway. I guarantee nothing. And if it
breaks, nobody saw me do anything. You can't prove this email wasn't sent
by somebody who is good at forging smtp.
This does require a gcc that is recent enough for "__builtin_constant_p()"
to work in an inline function, but I suspect our kernel requirements are
already higher than that. And if you do have an old gcc that is supported,
the worst that would happen is that the optimization doesn't trigger.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 12:03:26 +08:00
|
|
|
|
2009-01-13 06:01:15 +08:00
|
|
|
static __always_inline void
|
2019-07-12 11:54:00 +08:00
|
|
|
arch_set_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-07-23 15:44:15 +08:00
|
|
|
if (__builtin_constant_p(nr)) {
|
2008-06-20 13:28:24 +08:00
|
|
|
asm volatile(LOCK_PREFIX "orb %1,%0"
|
|
|
|
: CONST_MASK_ADDR(nr, addr)
|
2020-03-11 06:17:46 +08:00
|
|
|
: "iq" (CONST_MASK(nr) & 0xff)
|
2008-06-20 13:28:24 +08:00
|
|
|
: "memory");
|
|
|
|
} else {
|
2018-02-26 19:11:51 +08:00
|
|
|
asm volatile(LOCK_PREFIX __ASM_SIZE(bts) " %1,%0"
|
2019-04-02 19:28:13 +08:00
|
|
|
: : RLONG_ADDR(addr), "Ir" (nr) : "memory");
|
2008-06-20 13:28:24 +08:00
|
|
|
}
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline void
|
|
|
|
arch___set_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-04-02 19:28:13 +08:00
|
|
|
asm volatile(__ASM_SIZE(bts) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2009-01-13 06:01:15 +08:00
|
|
|
static __always_inline void
|
2019-07-12 11:54:00 +08:00
|
|
|
arch_clear_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-07-23 15:44:15 +08:00
|
|
|
if (__builtin_constant_p(nr)) {
|
2008-06-20 13:28:24 +08:00
|
|
|
asm volatile(LOCK_PREFIX "andb %1,%0"
|
|
|
|
: CONST_MASK_ADDR(nr, addr)
|
2020-03-11 06:17:46 +08:00
|
|
|
: "iq" (CONST_MASK(nr) ^ 0xff));
|
2008-06-20 13:28:24 +08:00
|
|
|
} else {
|
2018-02-26 19:11:51 +08:00
|
|
|
asm volatile(LOCK_PREFIX __ASM_SIZE(btr) " %1,%0"
|
2019-04-02 19:28:13 +08:00
|
|
|
: : RLONG_ADDR(addr), "Ir" (nr) : "memory");
|
2008-06-20 13:28:24 +08:00
|
|
|
}
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline void
|
|
|
|
arch_clear_bit_unlock(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
|
|
|
barrier();
|
2019-07-12 11:54:00 +08:00
|
|
|
arch_clear_bit(nr, addr);
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline void
|
|
|
|
arch___clear_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-04-02 19:28:13 +08:00
|
|
|
asm volatile(__ASM_SIZE(btr) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch_clear_bit_unlock_is_negative_byte(long nr, volatile unsigned long *addr)
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
{
|
|
|
|
bool negative;
|
2017-09-06 23:18:08 +08:00
|
|
|
asm volatile(LOCK_PREFIX "andb %2,%1"
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
CC_SET(s)
|
2019-04-02 19:28:13 +08:00
|
|
|
: CC_OUT(s) (negative), WBYTE_ADDR(addr)
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
: "ir" ((char) ~(1 << nr)) : "memory");
|
|
|
|
return negative;
|
|
|
|
}
|
2019-07-12 11:54:00 +08:00
|
|
|
#define arch_clear_bit_unlock_is_negative_byte \
|
|
|
|
arch_clear_bit_unlock_is_negative_byte
|
mm: optimize PageWaiters bit use for unlock_page()
In commit 62906027091f ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.
However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.
On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd. The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.
On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result. However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.
So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too. And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.
This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.
The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.
So this introduces the new architecture primitive
clear_bit_unlock_is_negative_byte();
and adds the trivial implementation for x86. We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better. According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.
All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test". After this, it's down to
0.66%, so just a quarter of the cost it used to be.
(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).
Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-28 03:40:38 +08:00
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline void
|
|
|
|
arch___clear_bit_unlock(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-07-12 11:54:00 +08:00
|
|
|
arch___clear_bit(nr, addr);
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline void
|
|
|
|
arch___change_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-04-02 19:28:13 +08:00
|
|
|
asm volatile(__ASM_SIZE(btc) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline void
|
|
|
|
arch_change_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-07-23 15:44:15 +08:00
|
|
|
if (__builtin_constant_p(nr)) {
|
2008-10-24 22:53:33 +08:00
|
|
|
asm volatile(LOCK_PREFIX "xorb %1,%0"
|
|
|
|
: CONST_MASK_ADDR(nr, addr)
|
|
|
|
: "iq" ((u8)CONST_MASK(nr)));
|
|
|
|
} else {
|
2018-02-26 19:11:51 +08:00
|
|
|
asm volatile(LOCK_PREFIX __ASM_SIZE(btc) " %1,%0"
|
2019-04-02 19:28:13 +08:00
|
|
|
: : RLONG_ADDR(addr), "Ir" (nr) : "memory");
|
2008-10-24 22:53:33 +08:00
|
|
|
}
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch_test_and_set_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
x86/asm: 'Simplify' GEN_*_RMWcc() macros
Currently the GEN_*_RMWcc() macros include a return statement, which
pretty much mandates we directly wrap them in a (inline) function.
Macros with return statements are tricky and, as per the above, limit
use, so remove the return statement and make them
statement-expressions. This allows them to be used more widely.
Also, shuffle the arguments a bit. Place the @cc argument as 3rd, this
makes it consistent between UNARY and BINARY, but more importantly, it
makes the @arg0 argument last.
Since the @arg0 argument is now last, we can do CPP trickery and make
it an optional argument, simplifying the users; 17 out of 18
occurences do not need this argument.
Finally, change to asm symbolic names, instead of the numeric ordering
of operands, which allows us to get rid of __BINARY_RMWcc_ARG and get
cleaner code overall.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: JBeulich@suse.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: hpa@linux.intel.com
Link: https://lkml.kernel.org/r/20181003130957.108960094@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03 18:34:10 +08:00
|
|
|
return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, c, "Ir", nr);
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2016-06-09 03:38:38 +08:00
|
|
|
static __always_inline bool
|
2019-07-12 11:54:00 +08:00
|
|
|
arch_test_and_set_bit_lock(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2019-07-12 11:54:00 +08:00
|
|
|
return arch_test_and_set_bit(nr, addr);
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch___test_and_set_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2016-06-09 03:38:38 +08:00
|
|
|
bool oldbit;
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2018-02-26 19:11:51 +08:00
|
|
|
asm(__ASM_SIZE(bts) " %2,%1"
|
2016-06-09 03:38:42 +08:00
|
|
|
CC_SET(c)
|
2019-04-02 19:28:13 +08:00
|
|
|
: CC_OUT(c) (oldbit)
|
|
|
|
: ADDR, "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
return oldbit;
|
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch_test_and_clear_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
x86/asm: 'Simplify' GEN_*_RMWcc() macros
Currently the GEN_*_RMWcc() macros include a return statement, which
pretty much mandates we directly wrap them in a (inline) function.
Macros with return statements are tricky and, as per the above, limit
use, so remove the return statement and make them
statement-expressions. This allows them to be used more widely.
Also, shuffle the arguments a bit. Place the @cc argument as 3rd, this
makes it consistent between UNARY and BINARY, but more importantly, it
makes the @arg0 argument last.
Since the @arg0 argument is now last, we can do CPP trickery and make
it an optional argument, simplifying the users; 17 out of 18
occurences do not need this argument.
Finally, change to asm symbolic names, instead of the numeric ordering
of operands, which allows us to get rid of __BINARY_RMWcc_ARG and get
cleaner code overall.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: JBeulich@suse.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: hpa@linux.intel.com
Link: https://lkml.kernel.org/r/20181003130957.108960094@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03 18:34:10 +08:00
|
|
|
return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btr), *addr, c, "Ir", nr);
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
/*
|
2012-06-25 00:24:42 +08:00
|
|
|
* Note: the operation is performed atomically with respect to
|
|
|
|
* the local CPU, but not other CPUs. Portable code should not
|
|
|
|
* rely on this behaviour.
|
|
|
|
* KVM relies on this behaviour on x86 for modifying memory that is also
|
|
|
|
* accessed from a hypervisor on the same CPU if running in a VM: don't change
|
|
|
|
* this without also updating arch/x86/kernel/kvm.c
|
2008-01-30 20:30:55 +08:00
|
|
|
*/
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch___test_and_clear_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2016-06-09 03:38:38 +08:00
|
|
|
bool oldbit;
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2018-02-26 19:11:51 +08:00
|
|
|
asm volatile(__ASM_SIZE(btr) " %2,%1"
|
2016-06-09 03:38:42 +08:00
|
|
|
CC_SET(c)
|
2019-04-02 19:28:13 +08:00
|
|
|
: CC_OUT(c) (oldbit)
|
|
|
|
: ADDR, "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
return oldbit;
|
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch___test_and_change_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2016-06-09 03:38:38 +08:00
|
|
|
bool oldbit;
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2018-02-26 19:11:51 +08:00
|
|
|
asm volatile(__ASM_SIZE(btc) " %2,%1"
|
2016-06-09 03:38:42 +08:00
|
|
|
CC_SET(c)
|
2019-04-02 19:28:13 +08:00
|
|
|
: CC_OUT(c) (oldbit)
|
|
|
|
: ADDR, "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
|
|
|
|
return oldbit;
|
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
static __always_inline bool
|
|
|
|
arch_test_and_change_bit(long nr, volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
x86/asm: 'Simplify' GEN_*_RMWcc() macros
Currently the GEN_*_RMWcc() macros include a return statement, which
pretty much mandates we directly wrap them in a (inline) function.
Macros with return statements are tricky and, as per the above, limit
use, so remove the return statement and make them
statement-expressions. This allows them to be used more widely.
Also, shuffle the arguments a bit. Place the @cc argument as 3rd, this
makes it consistent between UNARY and BINARY, but more importantly, it
makes the @arg0 argument last.
Since the @arg0 argument is now last, we can do CPP trickery and make
it an optional argument, simplifying the users; 17 out of 18
occurences do not need this argument.
Finally, change to asm symbolic names, instead of the numeric ordering
of operands, which allows us to get rid of __BINARY_RMWcc_ARG and get
cleaner code overall.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: JBeulich@suse.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: hpa@linux.intel.com
Link: https://lkml.kernel.org/r/20181003130957.108960094@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-03 18:34:10 +08:00
|
|
|
return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(btc), *addr, c, "Ir", nr);
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2016-06-09 03:38:38 +08:00
|
|
|
static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2013-07-17 06:20:14 +08:00
|
|
|
return ((1UL << (nr & (BITS_PER_LONG-1))) &
|
|
|
|
(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
|
2008-01-30 20:30:55 +08:00
|
|
|
}
|
|
|
|
|
2016-06-09 03:38:38 +08:00
|
|
|
static __always_inline bool variable_test_bit(long nr, volatile const unsigned long *addr)
|
2008-01-30 20:30:55 +08:00
|
|
|
{
|
2016-06-09 03:38:38 +08:00
|
|
|
bool oldbit;
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2018-02-26 19:11:51 +08:00
|
|
|
asm volatile(__ASM_SIZE(bt) " %2,%1"
|
2016-06-09 03:38:42 +08:00
|
|
|
CC_SET(c)
|
|
|
|
: CC_OUT(c) (oldbit)
|
2019-04-02 19:28:13 +08:00
|
|
|
: "m" (*(unsigned long *)addr), "Ir" (nr) : "memory");
|
2008-01-30 20:30:55 +08:00
|
|
|
|
|
|
|
return oldbit;
|
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:00 +08:00
|
|
|
#define arch_test_bit(nr, addr) \
|
2008-03-23 16:03:07 +08:00
|
|
|
(__builtin_constant_p((nr)) \
|
|
|
|
? constant_test_bit((nr), (addr)) \
|
|
|
|
: variable_test_bit((nr), (addr)))
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2008-03-15 20:04:42 +08:00
|
|
|
/**
|
|
|
|
* __ffs - find first set bit in word
|
|
|
|
* @word: The word to search
|
|
|
|
*
|
|
|
|
* Undefined if no bit exists, so code should check against 0 first.
|
|
|
|
*/
|
x86/asm/bitops: Force inlining of test_and_set_bit and friends
Sometimes GCC mysteriously doesn't inline very small functions
we expect to be inlined, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, GCC should do better, but GCC people aren't willing
to invest time into it and are asking to use __always_inline
instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
here's an example of functions getting deinlined many times:
test_and_set_bit (166 copies, ~1260 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f ab 3e lock bts %rdi,(%rsi)
72 04 jb <test_and_set_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_set_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
test_and_clear_bit (124 copies, ~1000 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
72 04 jb <test_and_clear_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_clear_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
change_bit (3 copies, 8 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f bb 3e lock btc %rdi,(%rsi)
5d pop %rbp
c3 retq
clear_bit_unlock (2 copies, 11 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
5d pop %rbp
c3 retq
This patch works it around via s/inline/__always_inline/.
Code size decrease by ~13.5k after the patch:
text data bss dec filename
92110727 20826144 36417536 149354407 vmlinux.before
92097234 20826176 36417536 149340946 vmlinux.after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Graf <tgraf@suug.ch>
Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-08 05:51:27 +08:00
|
|
|
static __always_inline unsigned long __ffs(unsigned long word)
|
2008-03-15 20:04:42 +08:00
|
|
|
{
|
2012-09-18 19:16:14 +08:00
|
|
|
asm("rep; bsf %1,%0"
|
2008-03-23 16:03:07 +08:00
|
|
|
: "=r" (word)
|
|
|
|
: "rm" (word));
|
2008-03-15 20:04:42 +08:00
|
|
|
return word;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* ffz - find first zero bit in word
|
|
|
|
* @word: The word to search
|
|
|
|
*
|
|
|
|
* Undefined if no zero exists, so code should check against ~0UL first.
|
|
|
|
*/
|
x86/asm/bitops: Force inlining of test_and_set_bit and friends
Sometimes GCC mysteriously doesn't inline very small functions
we expect to be inlined, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, GCC should do better, but GCC people aren't willing
to invest time into it and are asking to use __always_inline
instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
here's an example of functions getting deinlined many times:
test_and_set_bit (166 copies, ~1260 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f ab 3e lock bts %rdi,(%rsi)
72 04 jb <test_and_set_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_set_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
test_and_clear_bit (124 copies, ~1000 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
72 04 jb <test_and_clear_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_clear_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
change_bit (3 copies, 8 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f bb 3e lock btc %rdi,(%rsi)
5d pop %rbp
c3 retq
clear_bit_unlock (2 copies, 11 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
5d pop %rbp
c3 retq
This patch works it around via s/inline/__always_inline/.
Code size decrease by ~13.5k after the patch:
text data bss dec filename
92110727 20826144 36417536 149354407 vmlinux.before
92097234 20826176 36417536 149340946 vmlinux.after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Graf <tgraf@suug.ch>
Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-08 05:51:27 +08:00
|
|
|
static __always_inline unsigned long ffz(unsigned long word)
|
2008-03-15 20:04:42 +08:00
|
|
|
{
|
2012-09-18 19:16:14 +08:00
|
|
|
asm("rep; bsf %1,%0"
|
2008-03-23 16:03:07 +08:00
|
|
|
: "=r" (word)
|
|
|
|
: "r" (~word));
|
2008-03-15 20:04:42 +08:00
|
|
|
return word;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* __fls: find last set bit in word
|
|
|
|
* @word: The word to search
|
|
|
|
*
|
2008-07-06 01:53:46 +08:00
|
|
|
* Undefined if no set bit exists, so code should check against 0 first.
|
2008-03-15 20:04:42 +08:00
|
|
|
*/
|
x86/asm/bitops: Force inlining of test_and_set_bit and friends
Sometimes GCC mysteriously doesn't inline very small functions
we expect to be inlined, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, GCC should do better, but GCC people aren't willing
to invest time into it and are asking to use __always_inline
instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
here's an example of functions getting deinlined many times:
test_and_set_bit (166 copies, ~1260 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f ab 3e lock bts %rdi,(%rsi)
72 04 jb <test_and_set_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_set_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
test_and_clear_bit (124 copies, ~1000 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
72 04 jb <test_and_clear_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_clear_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
change_bit (3 copies, 8 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f bb 3e lock btc %rdi,(%rsi)
5d pop %rbp
c3 retq
clear_bit_unlock (2 copies, 11 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
5d pop %rbp
c3 retq
This patch works it around via s/inline/__always_inline/.
Code size decrease by ~13.5k after the patch:
text data bss dec filename
92110727 20826144 36417536 149354407 vmlinux.before
92097234 20826176 36417536 149340946 vmlinux.after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Graf <tgraf@suug.ch>
Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-08 05:51:27 +08:00
|
|
|
static __always_inline unsigned long __fls(unsigned long word)
|
2008-03-15 20:04:42 +08:00
|
|
|
{
|
2008-03-23 16:03:07 +08:00
|
|
|
asm("bsr %1,%0"
|
|
|
|
: "=r" (word)
|
|
|
|
: "rm" (word));
|
2008-03-15 20:04:42 +08:00
|
|
|
return word;
|
|
|
|
}
|
|
|
|
|
2011-12-16 06:55:53 +08:00
|
|
|
#undef ADDR
|
|
|
|
|
2008-03-15 20:04:42 +08:00
|
|
|
#ifdef __KERNEL__
|
|
|
|
/**
|
|
|
|
* ffs - find first set bit in word
|
|
|
|
* @x: the word to search
|
|
|
|
*
|
|
|
|
* This is defined the same way as the libc and compiler builtin ffs
|
|
|
|
* routines, therefore differs in spirit from the other bitops.
|
|
|
|
*
|
|
|
|
* ffs(value) returns 0 if value is 0 or the position of the first
|
|
|
|
* set bit if value is nonzero. The first (least significant) bit
|
|
|
|
* is at position 1.
|
|
|
|
*/
|
x86/asm/bitops: Force inlining of test_and_set_bit and friends
Sometimes GCC mysteriously doesn't inline very small functions
we expect to be inlined, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, GCC should do better, but GCC people aren't willing
to invest time into it and are asking to use __always_inline
instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os
here's an example of functions getting deinlined many times:
test_and_set_bit (166 copies, ~1260 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f ab 3e lock bts %rdi,(%rsi)
72 04 jb <test_and_set_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_set_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
test_and_clear_bit (124 copies, ~1000 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
72 04 jb <test_and_clear_bit+0xf>
31 c0 xor %eax,%eax
eb 05 jmp <test_and_clear_bit+0x14>
b8 01 00 00 00 mov $0x1,%eax
5d pop %rbp
c3 retq
change_bit (3 copies, 8 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f bb 3e lock btc %rdi,(%rsi)
5d pop %rbp
c3 retq
clear_bit_unlock (2 copies, 11 calls)
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 48 0f b3 3e lock btr %rdi,(%rsi)
5d pop %rbp
c3 retq
This patch works it around via s/inline/__always_inline/.
Code size decrease by ~13.5k after the patch:
text data bss dec filename
92110727 20826144 36417536 149354407 vmlinux.before
92097234 20826176 36417536 149340946 vmlinux.after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Graf <tgraf@suug.ch>
Link: http://lkml.kernel.org/r/1454881887-1367-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-08 05:51:27 +08:00
|
|
|
static __always_inline int ffs(int x)
|
2008-03-15 20:04:42 +08:00
|
|
|
{
|
|
|
|
int r;
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
/*
|
|
|
|
* AMD64 says BSFL won't clobber the dest reg if x==0; Intel64 says the
|
|
|
|
* dest reg is undefined if x==0, but their CPU architect says its
|
|
|
|
* value is written to set it to the same as before, except that the
|
|
|
|
* top 32 bits will be cleared.
|
|
|
|
*
|
|
|
|
* We cannot do this on 32 bits because at the very least some
|
|
|
|
* 486 CPUs did not behave this way.
|
|
|
|
*/
|
|
|
|
asm("bsfl %1,%0"
|
|
|
|
: "=r" (r)
|
2012-09-10 19:04:16 +08:00
|
|
|
: "rm" (x), "0" (-1));
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
#elif defined(CONFIG_X86_CMOV)
|
2008-03-23 16:03:07 +08:00
|
|
|
asm("bsfl %1,%0\n\t"
|
|
|
|
"cmovzl %2,%0"
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
: "=&r" (r) : "rm" (x), "r" (-1));
|
2008-03-15 20:04:42 +08:00
|
|
|
#else
|
2008-03-23 16:03:07 +08:00
|
|
|
asm("bsfl %1,%0\n\t"
|
|
|
|
"jnz 1f\n\t"
|
|
|
|
"movl $-1,%0\n"
|
|
|
|
"1:" : "=r" (r) : "rm" (x));
|
2008-03-15 20:04:42 +08:00
|
|
|
#endif
|
|
|
|
return r + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* fls - find last set bit in word
|
|
|
|
* @x: the word to search
|
|
|
|
*
|
|
|
|
* This is defined in a similar way as the libc and compiler builtin
|
|
|
|
* ffs, but returns the position of the most significant set bit.
|
|
|
|
*
|
|
|
|
* fls(value) returns 0 if value is 0 or the position of the last
|
|
|
|
* set bit if value is nonzero. The last (most significant) bit is
|
|
|
|
* at position 32.
|
|
|
|
*/
|
2019-01-04 07:26:41 +08:00
|
|
|
static __always_inline int fls(unsigned int x)
|
2008-03-15 20:04:42 +08:00
|
|
|
{
|
|
|
|
int r;
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
/*
|
|
|
|
* AMD64 says BSRL won't clobber the dest reg if x==0; Intel64 says the
|
|
|
|
* dest reg is undefined if x==0, but their CPU architect says its
|
|
|
|
* value is written to set it to the same as before, except that the
|
|
|
|
* top 32 bits will be cleared.
|
|
|
|
*
|
|
|
|
* We cannot do this on 32 bits because at the very least some
|
|
|
|
* 486 CPUs did not behave this way.
|
|
|
|
*/
|
|
|
|
asm("bsrl %1,%0"
|
|
|
|
: "=r" (r)
|
2012-09-10 19:04:16 +08:00
|
|
|
: "rm" (x), "0" (-1));
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
#elif defined(CONFIG_X86_CMOV)
|
2008-03-23 16:03:07 +08:00
|
|
|
asm("bsrl %1,%0\n\t"
|
|
|
|
"cmovzl %2,%0"
|
|
|
|
: "=&r" (r) : "rm" (x), "rm" (-1));
|
2008-03-15 20:04:42 +08:00
|
|
|
#else
|
2008-03-23 16:03:07 +08:00
|
|
|
asm("bsrl %1,%0\n\t"
|
|
|
|
"jnz 1f\n\t"
|
|
|
|
"movl $-1,%0\n"
|
|
|
|
"1:" : "=r" (r) : "rm" (x));
|
2008-03-15 20:04:42 +08:00
|
|
|
#endif
|
|
|
|
return r + 1;
|
|
|
|
}
|
2008-04-05 02:49:30 +08:00
|
|
|
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
/**
|
|
|
|
* fls64 - find last set bit in a 64-bit word
|
|
|
|
* @x: the word to search
|
|
|
|
*
|
|
|
|
* This is defined in a similar way as the libc and compiler builtin
|
|
|
|
* ffsll, but returns the position of the most significant set bit.
|
|
|
|
*
|
|
|
|
* fls64(value) returns 0 if value is 0 or the position of the last
|
|
|
|
* set bit if value is nonzero. The last (most significant) bit is
|
|
|
|
* at position 64.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
static __always_inline int fls64(__u64 x)
|
|
|
|
{
|
2012-09-10 19:04:16 +08:00
|
|
|
int bitpos = -1;
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
/*
|
|
|
|
* AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
|
|
|
|
* dest reg is undefined if x==0, but their CPU architect says its
|
|
|
|
* value is written to set it to the same as before.
|
|
|
|
*/
|
2012-09-10 19:04:16 +08:00
|
|
|
asm("bsrq %1,%q0"
|
x86_64, asm: Optimise fls(), ffs() and fls64()
fls(N), ffs(N) and fls64(N) can be optimised on x86_64. Currently they use a
CMOV instruction after the BSR/BSF to set the destination register to -1 if the
value to be scanned was 0 (in which case BSR/BSF set the Z flag).
Instead, according to the AMD64 specification, we can make use of the fact that
BSR/BSF doesn't modify its output register if its input is 0. By preloading
the output with -1 and incrementing the result, we achieve the desired result
without the need for a conditional check.
The Intel x86_64 specification, however, says that the result of BSR/BSF in
such a case is undefined. That said, when queried, one of the Intel CPU
architects said that the behaviour on all Intel CPUs is that:
(1) with BSRQ/BSFQ, the 64-bit destination register is written with its
original value if the source is 0, thus, in essence, giving the effect we
want. And,
(2) with BSRL/BSFL, the lower half of the 64-bit destination register is
written with its original value if the source is 0, and the upper half is
cleared, thus giving us the effect we want (we return a 4-byte int).
Further, it was indicated that they (Intel) are unlikely to get away with
changing the behaviour.
It might be possible to optimise the 32-bit versions of these functions, but
there's a lot more variation, and so the effective non-destructive property of
BSRL/BSRF cannot be relied on.
[ hpa: specifically, some 486 chips are known to NOT have this property. ]
I have benchmarked these functions on my Core2 Duo test machine using the
following program:
#include <stdlib.h>
#include <stdio.h>
#ifndef __x86_64__
#error
#endif
#define PAGE_SHIFT 12
typedef unsigned long long __u64, u64;
typedef unsigned int __u32, u32;
#define noinline __attribute__((noinline))
static __always_inline int fls64(__u64 x)
{
long bitpos = -1;
asm("bsrq %1,%0"
: "+r" (bitpos)
: "rm" (x));
return bitpos + 1;
}
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
static __always_inline int old_fls64(__u64 x)
{
if (x == 0)
return 0;
return __fls(x) + 1;
}
static noinline // __attribute__((const))
int old_get_order(unsigned long size)
{
int order;
size = (size - 1) >> (PAGE_SHIFT - 1);
order = -1;
do {
size >>= 1;
order++;
} while (size);
return order;
}
static inline __attribute__((const))
int get_order_old_fls64(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = old_fls64(size);
return order;
}
static inline __attribute__((const))
int get_order(unsigned long size)
{
int order;
size--;
size >>= PAGE_SHIFT;
order = fls64(size);
return order;
}
unsigned long prevent_optimise_out;
static noinline unsigned long test_old_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += old_get_order(n);
}
}
return total;
}
static noinline unsigned long test_get_order_old_fls64(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order_old_fls64(n);
}
}
return total;
}
static noinline unsigned long test_get_order(void)
{
unsigned long n, total = 0;
long rep, loop;
for (rep = 1000000; rep > 0; rep--) {
for (loop = 0; loop <= 16384; loop += 4) {
n = 1UL << loop;
total += get_order(n);
}
}
return total;
}
int main(int argc, char **argv)
{
unsigned long total;
switch (argc) {
case 1: total = test_old_get_order(); break;
case 2: total = test_get_order_old_fls64(); break;
default: total = test_get_order(); break;
}
prevent_optimise_out = total;
return 0;
}
This allows me to test the use of the old fls64() implementation and the new
fls64() implementation and also to contrast these to the out-of-line loop-based
implementation of get_order(). The results were:
warthog>time ./get_order
real 1m37.191s
user 1m36.313s
sys 0m0.861s
warthog>time ./get_order x
real 0m16.892s
user 0m16.586s
sys 0m0.287s
warthog>time ./get_order x x
real 0m7.731s
user 0m7.727s
sys 0m0.002s
Using the current upstream fls64() as a basis for an inlined get_order() [the
second result above] is much faster than using the current out-of-line
loop-based get_order() [the first result above].
Using my optimised inline fls64()-based get_order() [the third result above]
is even faster still.
[ hpa: changed the selection of 32 vs 64 bits to use CONFIG_X86_64
instead of comparing BITS_PER_LONG, updated comments, rebased manually
on top of 83d99df7c4bf x86, bitops: Move fls64.h inside __KERNEL__ ]
Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@warthog.procyon.org.uk
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-12-13 22:56:54 +08:00
|
|
|
: "+r" (bitpos)
|
|
|
|
: "rm" (x));
|
|
|
|
return bitpos + 1;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
#include <asm-generic/bitops/fls64.h>
|
|
|
|
#endif
|
|
|
|
|
2010-09-29 17:08:50 +08:00
|
|
|
#include <asm-generic/bitops/find.h>
|
|
|
|
|
2008-04-05 02:49:30 +08:00
|
|
|
#include <asm-generic/bitops/sched.h>
|
|
|
|
|
2010-03-06 00:34:46 +08:00
|
|
|
#include <asm/arch_hweight.h>
|
|
|
|
|
|
|
|
#include <asm-generic/bitops/const_hweight.h>
|
2008-01-30 20:30:55 +08:00
|
|
|
|
2019-08-20 10:49:40 +08:00
|
|
|
#include <asm-generic/bitops/instrumented-atomic.h>
|
|
|
|
#include <asm-generic/bitops/instrumented-non-atomic.h>
|
|
|
|
#include <asm-generic/bitops/instrumented-lock.h>
|
2019-07-12 11:54:00 +08:00
|
|
|
|
bitops: introduce little-endian bitops for most architectures
Introduce little-endian bit operations to the big-endian architectures
which do not have native little-endian bit operations and the
little-endian architectures. (alpha, avr32, blackfin, cris, frv, h8300,
ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa)
These architectures can just include generic implementation
(asm-generic/bitops/le.h).
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24 07:42:02 +08:00
|
|
|
#include <asm-generic/bitops/le.h>
|
2008-04-05 02:49:30 +08:00
|
|
|
|
2011-07-27 07:09:04 +08:00
|
|
|
#include <asm-generic/bitops/ext2-atomic-setbit.h>
|
2008-04-05 02:49:30 +08:00
|
|
|
|
|
|
|
#endif /* __KERNEL__ */
|
2008-10-23 13:26:29 +08:00
|
|
|
#endif /* _ASM_X86_BITOPS_H */
|