Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.
This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.
The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).
This improves the performance of page lock intensive microbenchmarks by
2-3%.
Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Summary of modules changes for the 4.10 merge window:
* The rodata= cmdline parameter has been extended to additionally
apply to module mappings
* Fix a hard to hit race between module loader error/clean up
handling and ftrace registration
* Some code cleanups, notably panic.c and modules code use a
unified taint_flags table now. This is much cleaner than
duplicating the taint flag code in modules.c
Signed-off-by: Jessica Yu <jeyu@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYUf6/AAoJEMBFfjjOO8Fy5NoP+gOIus26yWWGymI495jVnX7n
wCga5JgwOL0SLBIPmiDVI7K+jz4eoQZb94eJcwkWDuw2/IvOdF1kB8ha1EOBRMSg
nb9HfIDlWiAPKkyUxe+k6XDb+BMPN3FUSYmBAKD3utsQkD1JWBLY8Id4e234y8Fo
sb3a6rLJbvIEXANrMeU7zO4/y1bVxQAeQPQbVPwlid5s76RKYH6JdGXoo6FKK0uE
Z3I8uQjqjmJ5U4vpjjWl0w+Qa7hIm/x05GpirtNxN6ztxjR+98c/4uRIry8oOX+I
KqRXDOnJ1l/rCwhp+pGLwPfCoDds+V3bknyOwYoxK3hqVVUAd8H0qd1JQ8XClwyJ
jnE0+EQpTt9brOO1Oq2XC+EDjpiuyYm3u91TFwE2VFmP98daBZsX6qY7bm03/GQq
ZLRthWPILNX9glGj4nbHQgdAKmRvYDO3SzWjFZNA75Mr2hbRKLJoWNvfgupDgjsF
giawxV/OcWXvEX92fzkwoUszpfWwoDhGsbimG2SCKYB87vNniG7wrgdjp5aWHhOL
qCUpUhCvE9/dO7kPRinqk5tnpAUGY2jMZ0QgVbpToF6FiHJJSyDjWHR9n0Bl1QTX
uAEZB/Hoav9frZ+MQC/1Yzhq5ejDbEm1ByjolJgbjl6YHBlQceL6NQpFmyEkrn7c
Tx+Q/PvG7/gfxFGMirf1
=bhCS
-----END PGP SIGNATURE-----
Merge tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.10 merge window:
- The rodata= cmdline parameter has been extended to additionally
apply to module mappings
- Fix a hard to hit race between module loader error/clean up
handling and ftrace registration
- Some code cleanups, notably panic.c and modules code use a unified
taint_flags table now. This is much cleaner than duplicating the
taint flag code in modules.c"
* tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: fix DEBUG_SET_MODULE_RONX typo
module: extend 'rodata=off' boot cmdline parameter to module mappings
module: Fix a comment above strong_try_module_get()
module: When modifying a module's text ignore modules which are going away too
module: Ensure a module's state is set accordingly during module coming cleanup code
module: remove trailing whitespace
taint/module: Clean up global and module taint flags handling
modpost: free allocated memory
Pull workqueue updates from Tejun Heo:
"Mostly patches to initialize workqueue subsystem earlier and get rid
of keventd_up().
The patches were headed for the last merge cycle but got delayed due
to a bug found late minute, which is fixed now.
Also, to help debugging, destroy_workqueue() is more chatty now on a
sanity check failure."
* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: move wq_numa_init() to workqueue_init()
workqueue: remove keventd_up()
debugobj, workqueue: remove keventd_up() usage
slab, workqueue: remove keventd_up() usage
power, workqueue: remove keventd_up() usage
tty, workqueue: remove keventd_up() usage
mce, workqueue: remove keventd_up() usage
workqueue: make workqueue available early during boot
workqueue: dump workqueue state on sanity check failures in destroy_workqueue()
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but should be
more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to go...
Lots of plain-text files have also been converted and integrated.
- Images in binary formats have been replaced with more source-friendly
versions.
- Various bits of organizational work, including the renaming of various
files discussed at the kernel summit.
- New documentation for the device_link mechanism.
...and, of course, lots of typo fixes and small updates.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
pmSJR0hVeRUmA4uw6+Su
=A0EV
-----END PGP SIGNATURE-----
Merge tag 'docs-4.10' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but
should be more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to
go... Lots of plain-text files have also been converted and
integrated.
- Images in binary formats have been replaced with more
source-friendly versions.
- Various bits of organizational work, including the renaming of
various files discussed at the kernel summit.
- New documentation for the device_link mechanism.
... and, of course, lots of typo fixes and small updates"
* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
dma-buf: Extract dma-buf.rst
Update Documentation/00-INDEX
docs: 00-INDEX: document directories/files with no docs
docs: 00-INDEX: remove non-existing entries
docs: 00-INDEX: add missing entries for documentation files/dirs
docs: 00-INDEX: consolidate process/ and admin-guide/ description
scripts: add a script to check if Documentation/00-INDEX is sane
Docs: change sh -> awk in REPORTING-BUGS
Documentation/core-api/device_link: Add initial documentation
core-api: remove an unexpected unident
ppc/idle: Add documentation for powersave=off
Doc: Correct typo, "Introdution" => "Introduction"
Documentation/atomic_ops.txt: convert to ReST markup
Documentation/local_ops.txt: convert to ReST markup
Documentation/assoc_array.txt: convert to ReST markup
docs-rst: parse-headers.pl: cleanup the documentation
docs-rst: fix media cleandocs target
docs-rst: media/Makefile: reorganize the rules
docs-rst: media: build SVG from graphviz files
docs-rst: replace bayer.png by a SVG image
...
AMD CPUs affected by the E400 erratum suffer from the issue that the
local APIC timer stops when the CPU goes into C1E. Unfortunately there
is no way to detect the affected CPUs on early boot. It's only possible
to determine the range of possibly affected CPUs from the family/model
range.
The actual decision whether to enter C1E and thus cause the bug is done
by the firmware and we need to detect that case late, after ACPI has
been initialized.
The current solution is to check in the idle routine whether the CPU is
affected by reading the MSR_K8_INT_PENDING_MSG MSR and checking for the
K8_INTP_C1E_ACTIVE_MASK bits. If one of the bits is set then the CPU is
affected and the system is switched into forced broadcast mode.
This is ineffective and on non-affected CPUs every entry to idle does
the extra RDMSR.
After doing some research it turns out that the bits are visible on the
boot CPU right after the ACPI subsystem is initialized in the early
boot process. So instead of polling for the bits in the idle loop, add
a detection function after acpi_subsystem_init() and check for the MSR
bits. If set, then the X86_BUG_AMD_APIC_C1E is set on the boot CPU and
the TSC is marked unstable when X86_FEATURE_NONSTOP_TSC is not set as it
will stop in C1E state as well.
The switch to broadcast mode cannot be done at this point because the
boot CPU still uses HPET as a clockevent device and the local APIC timer
is not yet calibrated and installed. The switch to broadcast mode on the
affected CPUs needs to be done when the local APIC timer is actually set
up.
This allows to cleanup the amd_e400_idle() function in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-4-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The newly added 'rodata_enabled' global variable is protected by
the wrong #ifdef, leading to a link error when CONFIG_DEBUG_SET_MODULE_RONX
is turned on:
kernel/module.o: In function `disable_ro_nx':
module.c:(.text.unlikely.disable_ro_nx+0x88): undefined reference to `rodata_enabled'
kernel/module.o: In function `module_disable_ro':
module.c:(.text.module_disable_ro+0x8c): undefined reference to `rodata_enabled'
kernel/module.o: In function `module_enable_ro':
module.c:(.text.module_enable_ro+0xb0): undefined reference to `rodata_enabled'
CONFIG_SET_MODULE_RONX does not exist, so use the correct one instead.
Fixes: 39290b389e ("module: extend 'rodata=off' boot cmdline parameter to module mappings")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
The current "rodata=off" parameter disables read-only kernel mappings
under CONFIG_DEBUG_RODATA:
commit d2aa1acad2 ("mm/init: Add 'rodata=off' boot cmdline parameter
to disable read-only kernel mappings")
This patch is a logical extension to module mappings ie. read-only mappings
at module loading can be disabled even if CONFIG_DEBUG_SET_MODULE_RONX
(mainly for debug use). Please note, however, that it only affects RO/RW
permissions, keeping NX set.
This is the first step to make CONFIG_DEBUG_SET_MODULE_RONX mandatory
(always-on) in the future as CONFIG_DEBUG_RODATA on x86 and arm64.
Suggested-by: and Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Link: http://lkml.kernel.org/r/20161114061505.15238-1-takahiro.akashi@linaro.org
Signed-off-by: Jessica Yu <jeyu@redhat.com>
The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.
The need for very-early boot entropy tends to be very architecture or
system design specific, so this plugin is more suited for those sorts
of special cases. The existing kernel RNG already attempts to extract
entropy from reliable runtime variation, but this plugin takes the idea to
a logical extreme by permuting a global variable based on any variation
in code execution (e.g. a different value (and permutation function)
is used to permute the global based on loop count, case statement,
if/then/else branching, etc).
To do this, the plugin starts by inserting a local variable in every
marked function. The plugin then adds logic so that the value of this
variable is modified by randomly chosen operations (add, xor and rol) and
random values (gcc generates separate static values for each location at
compile time and also injects the stack pointer at runtime). The resulting
value depends on the control flow path (e.g., loops and branches taken).
Before the function returns, the plugin mixes this local variable into
the latent_entropy global variable. The value of this global variable
is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
though it does not credit any bytes of entropy to the pool; the contents
of the global are just used to mix the pool.
Additionally, the plugin can pre-initialize arrays with build-time
random contents, so that two different kernel builds running on identical
hardware will not have the same starting values.
Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message and code comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
Workqueue is currently initialized in an early init call; however,
there are cases where early boot code has to be split and reordered to
come after workqueue initialization or the same code path which makes
use of workqueues is used both before workqueue initailization and
after. The latter cases have to gate workqueue usages with
keventd_up() tests, which is nasty and easy to get wrong.
Workqueue usages have become widespread and it'd be a lot more
convenient if it can be used very early from boot. This patch splits
workqueue initialization into two steps. workqueue_init_early() which
sets up the basic data structures so that workqueues can be created
and work items queued, and workqueue_init() which actually brings up
workqueues online and starts executing queued work items. The former
step can be done very early during boot once memory allocation,
cpumasks and idr are initialized. The latter right after kthreads
become available.
This allows work item queueing and canceling from very early boot
which is what most of these use cases want.
* As systemd_wq being initialized doesn't indicate that workqueue is
fully online anymore, update keventd_up() to test wq_online instead.
The follow-up patches will get rid of all its usages and the
function itself.
* Flushing doesn't make sense before workqueue is fully initialized.
The flush functions trigger WARN and return immediately before fully
online.
* Work items are never in-flight before fully online. Canceling can
always succeed by skipping the flush step.
* Some code paths can no longer assume to be called with irq enabled
as irq is disabled during early boot. Use irqsave/restore
operations instead.
v2: Watchdog init, which requires timer to be running, moved from
workqueue_init_early() to workqueue_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
sprint_symbol_no_offset() returns the string "function_name
[module_name]" where [module_name] is not printed for built in kernel
functions. This means that the blacklisting code will fail when
comparing module function names with the extended string.
This patch adds the functionality to block a module's module_init()
function by finding the space in the string and truncating the
comparison to that length.
Link: http://lkml.kernel.org/r/1466124387-20446-1-git-send-email-prarit@redhat.com
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")
This patch removes the following compatibility definitions and replaces
them treewide.
/* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __ref
I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
...
When I replaced kasprintf("%pf") with a direct call to
sprint_symbol_no_offset I must have broken the initcall blacklisting
feature on the arches where dereference_function_descriptor() is
non-trivial.
Fixes: c8cdd2be21 (init/main.c: simplify initcall_blacklisted())
Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.
But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.
Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things like
ti = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;
which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.
So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.
This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.
The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_ext_init() checks suitable pages with pfn_to_nid(), but
pfn_to_nid() depends on memmap which will not be setup fully until
page_alloc_init_late() is done. Use early_pfn_to_nid() instead of
pfn_to_nid() so that page extension could be still used early even
though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
allocation call sites.
Suggested by Joonsoo Kim [1], this fix basically undoes the change
introduced by commit b8f1a75d61 ("mm: call page_ext_init() after all
struct pages are initialized") and fixes the same problem with a better
approach.
[1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com
Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using kasprintf to get the function name makes us look up the name
twice, along with all the vsnprintf overhead of parsing the format
string etc. It also means there is an allocation failure case to deal
with. Since symbol_string in vsprintf.c would anyway allocate an array
of size KSYM_SYMBOL_LEN on the stack, that might as well be done up
here.
Moreover, since this is a debug feature and the blacklisted_initcalls
list is usually empty, we might as well test that and thus avoid looking
up the symbol name even once in the common case.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk() takes some locks and could not be used a safe way in NMI
context.
The chance of a deadlock is real especially when printing stacks from
all CPUs. This particular problem has been addressed on x86 by the
commit a9edc88093 ("x86/nmi: Perform a safe NMI stack trace on all
CPUs").
The patchset brings two big advantages. First, it makes the NMI
backtraces safe on all architectures for free. Second, it makes all NMI
messages almost safe on all architectures (the temporary buffer is
limited. We still should keep the number of messages in NMI context at
minimum).
Note that there already are several messages printed in NMI context:
WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
handlers. These are not easy to avoid.
This patch reuses most of the code and makes it generic. It is useful
for all messages and architectures that support NMI.
The alternative printk_func is set when entering and is reseted when
leaving NMI context. It queues IRQ work to copy the messages into the
main ring buffer in a safe context.
__printk_nmi_flush() copies all available messages and reset the buffer.
Then we could use a simple cmpxchg operations to get synchronized with
writers. There is also used a spinlock to get synchronized with other
flushers.
We do not longer use seq_buf because it depends on external lock. It
would be hard to make all supported operations safe for a lockless use.
It would be confusing and error prone to make only some operations safe.
The code is put into separate printk/nmi.c as suggested by Steven
Rostedt. It needs a per-CPU buffer and is compiled only on
architectures that call nmi_enter(). This is achieved by the new
HAVE_NMI Kconfig flag.
The are MN10300 and Xtensa architectures. We need to clean up NMI
handling there first. Let's do it separately.
The patch is heavily based on the draft from Peter Zijlstra, see
https://lkml.org/lkml/2015/6/10/327
[arnd@arndb.de: printk-nmi: use %zu format string for size_t]
[akpm@linux-foundation.org: min_t->min - all types are size_t here]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> [arm part]
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jiri Kosina <jkosina@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patch-bomb from Andrew Morton:
- some misc things
- ofs2 updates
- about half of MM
- checkpatch updates
- autofs4 update
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
autofs4: fix string.h include in auto_dev-ioctl.h
autofs4: use pr_xxx() macros directly for logging
autofs4: change log print macros to not insert newline
autofs4: make autofs log prints consistent
autofs4: fix some white space errors
autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
autofs4: fix coding style line length in autofs4_wait()
autofs4: fix coding style problem in autofs4_get_set_timeout()
autofs4: coding style fixes
autofs: show pipe inode in mount options
kallsyms: add support for relative offsets in kallsyms address table
kallsyms: don't overload absolute symbol type for percpu symbols
x86: kallsyms: disable absolute percpu symbols on !SMP
checkpatch: fix another left brace warning
checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
checkpatch: warn on bare unsigned or signed declarations without int
checkpatch: exclude asm volatile from complex macro check
mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
mm: migrate: consolidate mem_cgroup_migrate() calls
mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
...
Use list_for_each_entry() instead of list_for_each() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cpu hotplug updates from Thomas Gleixner:
"This is the first part of the ongoing cpu hotplug rework:
- Initial implementation of the state machine
- Runs all online and prepare down callbacks on the plugged cpu and
not on some random processor
- Replaces busy loop waiting with completions
- Adds tracepoints so the states can be followed"
More detailed commentary on this work from an earlier email:
"What's wrong with the current cpu hotplug infrastructure?
- Asymmetry
The hotplug notifier mechanism is asymmetric versus the bringup and
teardown. This is mostly caused by the notifier mechanism.
- Largely undocumented dependencies
While some notifiers use explicitely defined notifier priorities,
we have quite some notifiers which use numerical priorities to
express dependencies without any documentation why.
- Control processor driven
Most of the bringup/teardown of a cpu is driven by a control
processor. While it is understandable, that preperatory steps,
like idle thread creation, memory allocation for and initialization
of essential facilities needs to be done before a cpu can boot,
there is no reason why everything else must run on a control
processor. Before this patch series, bringup looks like this:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring the rest up
- All or nothing approach
There is no way to do partial bringups. That's something which is
really desired because we waste e.g. at boot substantial amount of
time just busy waiting that the cpu comes to life. That's stupid
as we could very well do preparatory steps and the initial IPI for
other cpus and then go back and do the necessary low level
synchronization with the freshly booted cpu.
- Minimal debuggability
Due to the notifier based design, it's impossible to switch between
two stages of the bringup/teardown back and forth in order to test
the correctness. So in many hotplug notifiers the cancel
mechanisms are either not existant or completely untested.
- Notifier [un]registering is tedious
To [un]register notifiers we need to protect against hotplug at
every callsite. There is no mechanism that bringup/teardown
callbacks are issued on the online cpus, so every caller needs to
do it itself. That also includes error rollback.
What's the new design?
The base of the new design is a symmetric state machine, where both
the control processor and the booting/dying cpu execute a well
defined set of states. Each state is symmetric in the end, except
for some well defined exceptions, and the bringup/teardown can be
stopped and reversed at almost all states.
So the bringup of a cpu will look like this in the future:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring itself up
The synchronization step does not require the control cpu to wait.
That mechanism can be done asynchronously via a worker or some
other mechanism.
The teardown can be made very similar, so that the dying cpu cleans
up and brings itself down. Cleanups which need to be done after
the cpu is gone, can be scheduled asynchronously as well.
There is a long way to this, as we need to refactor the notion when a
cpu is available. Today we set the cpu online right after it comes
out of the low level bringup, which is not really correct.
The proper mechanism is to set it to available, i.e. cpu local
threads, like softirqd, hotplug thread etc. can be scheduled on that
cpu, and once it finished all booting steps, it's set to online, so
general workloads can be scheduled on it. The reverse happens on
teardown. First thing to do is to forbid scheduling of general
workloads, then teardown all the per cpu resources and finally shut it
off completely.
This patch series implements the basic infrastructure for this at the
core level. This includes the following:
- Basic state machine implementation with well defined states, so
ordering and prioritization can be expressed.
- Interfaces to [un]register state callbacks
This invokes the bringup/teardown callback on all online cpus with
the proper protection in place and [un]installs the callbacks in
the state machine array.
For callbacks which have no particular ordering requirement we have
a dynamic state space, so that drivers don't have to register an
explicit hotplug state.
If a callback fails, the code automatically does a rollback to the
previous state.
- Sysfs interface to drive the state machine to a particular step.
This is only partially functional today. Full functionality and
therefor testability will be achieved once we converted all
existing hotplug notifiers over to the new scheme.
- Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
processor:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
wait for boot
bring itself up
Signal completion to control cpu
In a previous step of this work we've done a full tree mechanical
conversion of all hotplug notifiers to the new scheme. The balance
is a net removal of about 4000 lines of code.
This is not included in this series, as we decided to take a
different approach. Instead of mechanically converting everything
over, we will do a proper overhaul of the usage sites one by one so
they nicely fit into the symmetric callback scheme.
I decided to do that after I looked at the ugliness of some of the
converted sites and figured out that their hotplug mechanism is
completely buggered anyway. So there is no point to do a
mechanical conversion first as we need to go through the usage
sites one by one again in order to achieve a full symmetric and
testable behaviour"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
cpu/hotplug: Document states better
cpu/hotplug: Fix smpboot thread ordering
cpu/hotplug: Remove redundant state check
cpu/hotplug: Plug death reporting race
rcu: Make CPU_DYING_IDLE an explicit call
cpu/hotplug: Make wait for dead cpu completion based
cpu/hotplug: Let upcoming cpu bring itself fully up
arch/hotplug: Call into idle with a proper state
cpu/hotplug: Move online calls to hotplugged cpu
cpu/hotplug: Create hotplug threads
cpu/hotplug: Split out the state walk into functions
cpu/hotplug: Unpark smpboot threads from the state machine
cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
cpu/hotplug: Implement setup/removal interface
cpu/hotplug: Make target state writeable
cpu/hotplug: Add sysfs state interface
cpu/hotplug: Hand in target state to _cpu_up/down
cpu/hotplug: Convert the hotplugged cpu work to a state machine
cpu/hotplug: Convert to a state machine for the control processor
cpu/hotplug: Add tracepoints
...
Pull read-only kernel memory updates from Ingo Molnar:
"This tree adds two (security related) enhancements to the kernel's
handling of read-only kernel memory:
- extend read-only kernel memory to a new class of formerly writable
kernel data: 'post-init read-only memory' via the __ro_after_init
attribute, and mark the ARM and x86 vDSO as such read-only memory.
This kind of attribute can be used for data that requires a once
per bootup initialization sequence, but is otherwise never modified
after that point.
This feature was based on the work by PaX Team and Brad Spengler.
(by Kees Cook, the ARM vDSO bits by David Brown.)
- make CONFIG_DEBUG_RODATA always enabled on x86 and remove the
Kconfig option. This simplifies the kernel and also signals that
read-only memory is the default model and a first-class citizen.
(Kees Cook)"
* 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ARM/vdso: Mark the vDSO code read-only after init
x86/vdso: Mark the vDSO code read-only after init
lkdtm: Verify that '__ro_after_init' works correctly
arch: Introduce post-init read-only memory
x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
asm-generic: Consolidate mark_rodata_ro()
Handle the smpboot threads in the state machine.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.295777684@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the split out steps into a callback array and let the cpu_up/down
code iterate through the array functions. For now most of the
callbacks are asymmetric to resemble the current hotplug maze.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182340.671816690@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It may be useful to debug writes to the readonly sections of memory,
so provide a cmdline "rodata=off" to allow for this. This can be
expanded in the future to support "log" and "write" modes, but that
will need to be architecture-specific.
This also makes KDB software breakpoints more usable, as read-only
mappings can now be disabled on any kernel.
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/1455748879-21872-3-git-send-email-keescook@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lockdep is initialized at compile time now. Get rid of lockdep_init().
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Krinkin <krinkin.m.u@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: mm-commits@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make obsolete_checksetup() return bool due to this particular function
only using either one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit adds the invocation of rcu_end_inkernel_boot() just before
init is invoked. This allows the CONFIG_RCU_EXPEDITE_BOOT Kconfig
option to do something useful and prepares for the upcoming
rcupdate.rcu_normal_after_boot kernel parameter.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper. This workqueue has
unbound properties and thus a wide affinity inherited by all its children.
Now khelper also has special properties that we aren't much interested in:
ordered and singlethread. There is really no need about ordering as all
we do is creating kernel threads. This can be done concurrently. And
singlethread is a useless limitation as well.
The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.
The only worrysome specific is their affinity to the node of the current
CPU. It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.
This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen reported the following;
My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:
VFS: file-max limit 8192 reached
The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.
4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467
Small differences with the patch applied and 4.1 but not enough to matter.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long reported that 24TB machines hit OOM during basic setup when
struct page initialisation was deferred. One approach is to initialise
memory on demand but it interferes with page allocator paths. This patch
creates dedicated threads to initialise memory before basic setup. It
then blocks on a rw_semaphore until completion as a wait_queue and counter
is overkill. This may be slower to boot but it's simplier overall and
also gets rid of a section mangling which existed so kswapd could do the
initialisation.
[akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Waiman Long <waiman.long@hp.com
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Scott Norton <scott.norton@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the driver core / firmware changes for 4.2-rc1.
A number of small changes all over the place in the driver core, and in
the firmware subsystem. Nothing really major, full details in the
shortlog. Some of it is a bit of churn, given that the platform driver
probing changes was found to not work well, so they were reverted.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlWNoCQACgkQMUfUDdst+ym4JACdFrrXoMt2pb8nl5gMidGyM9/D
jg8AnRgdW8ArDA/xOarULd/X43eA3J3C
=Al2B
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the driver core / firmware changes for 4.2-rc1.
A number of small changes all over the place in the driver core, and
in the firmware subsystem. Nothing really major, full details in the
shortlog. Some of it is a bit of churn, given that the platform
driver probing changes was found to not work well, so they were
reverted.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
Revert "base/platform: Only insert MEM and IO resources"
Revert "base/platform: Continue on insert_resource() error"
Revert "of/platform: Use platform_device interface"
Revert "base/platform: Remove code duplication"
firmware: add missing kfree for work on async call
fs: sysfs: don't pass count == 0 to bin file readers
base:dd - Fix for typo in comment to function driver_deferred_probe_trigger().
base/platform: Remove code duplication
of/platform: Use platform_device interface
base/platform: Continue on insert_resource() error
base/platform: Only insert MEM and IO resources
firmware: use const for remaining firmware names
firmware: fix possible use after free on name on asynchronous request
firmware: check for file truncation on direct firmware loading
firmware: fix __getname() missing failure check
drivers: of/base: move of_init to driver_init
drivers/base: cacheinfo: fix annoying typo when DT nodes are absent
sysfs: disambiguate between "error code" and "failure" in comments
driver-core: fix build for !CONFIG_MODULES
driver-core: make __device_attach() static
...
Commit 73f7d1ca32 "ACPI / init: Run acpi_early_init() before
timekeeping_init()" moved the ACPI subsystem initialization,
including the ACPI mode enabling, to an earlier point in the
initialization sequence, to allow the timekeeping subsystem
use ACPI early. Unfortunately, that resulted in boot regressions
on some systems and the early ACPI initialization was moved toward
its original position in the kernel initialization code by commit
c4e1acbb35 "ACPI / init: Invoke early ACPI initialization later".
However, that turns out to be insufficient, as boot is still broken
on the Tyan S8812 mainboard.
To fix that issue, split the ACPI early initialization code into
two pieces so the majority of it still located in acpi_early_init()
and the part switching over the platform into the ACPI mode goes into
a new function, acpi_subsystem_init(), executed at the original early
ACPI initialization spot.
That fixes the Tyan S8812 boot problem, but still allows ACPI
tables to be loaded earlier which is useful to the EFI code in
efi_enter_virtual_mode().
Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141
Fixes: 73f7d1ca32 "ACPI / init: Run acpi_early_init() before timekeeping_init()"
Reported-and-tested-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
THREAD_SIZE.
E.g. architecture hexagon may have page size 1M and thread size 4096.
This would lead to a division by zero in the calculation of max_threads.
With this patch the buggy code is moved to a separate function
set_max_threads. The error is not fixed.
After fixing the problem in a separate patch the new function can be
reused to adjust max_threads after adding or removing memory.
Argument mempages of function fork_init() is removed as totalram_pages is
an exported symbol.
The creation of separate patches for refactoring to a new function and for
fixing the logic was suggested by Ingo Molnar.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patchbomb from Andrew Morton:
- arch/sh updates
- ocfs2 updates
- kernel/watchdog feature
- about half of mm/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
Documentation: update arch list in the 'memtest' entry
Kconfig: memtest: update number of test patterns up to 17
arm: add support for memtest
arm64: add support for memtest
memtest: use phys_addr_t for physical addresses
mm: move memtest under mm
mm, hugetlb: abort __get_user_pages if current has been oom killed
mm, mempool: do not allow atomic resizing
memcg: print cgroup information when system panics due to panic_on_oom
mm: numa: remove migrate_ratelimited
mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
s390: standardize mmap_rnd() usage
powerpc: standardize mmap_rnd() usage
mips: extract logic for mmap_rnd()
arm64: standardize mmap_rnd() usage
x86: standardize mmap_rnd() usage
arm: factor out mmap ASLR into mmap_rnd
...
Add ioremap_pud_enabled() and ioremap_pmd_enabled(), which return 1 when
I/O mappings with pud/pmd are enabled on the kernel.
ioremap_huge_init() calls arch_ioremap_pud_supported() and
arch_ioremap_pmd_supported() to initialize the capabilities at boot-time.
A new kernel option "nohugeiomap" is also added, so that user can disable
the huge I/O map capabilities when necessary.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull RCU changes from Ingo Molnar:
"The main changes in this cycle were:
- changes permitting use of call_rcu() and friends very early in
boot, for example, before rcu_init() is invoked.
- add in-kernel API to enable and disable expediting of normal RCU
grace periods.
- improve RCU's handling of (hotplug-) outgoing CPUs.
- NO_HZ_FULL_SYSIDLE fixes.
- tiny-RCU updates to make it more tiny.
- documentation updates.
- miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
cpu: Defer smpboot kthread unparking until CPU known to scheduler
rcu: Associate quiescent-state reports with grace period
rcu: Yet another fix for preemption and CPU hotplug
rcu: Add diagnostics to grace-period cleanup
rcutorture: Default to grace-period-initialization delays
rcu: Handle outgoing CPUs on exit from idle loop
cpu: Make CPU-offline idle-loop transition point more precise
rcu: Eliminate ->onoff_mutex from rcu_node structure
rcu: Process offlining and onlining only at grace-period start
rcu: Move rcu_report_unblock_qs_rnp() to common code
rcu: Rework preemptible expedited bitmask handling
rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
rcutorture: Enable slow grace-period initializations
rcu: Provide diagnostic option to slow down grace-period initialization
rcu: Detect stalls caused by failure to propagate up rcu_node tree
rcu: Eliminate empty HOTPLUG_CPU ifdef
rcu: Simplify sync_rcu_preempt_exp_init()
rcu: Put all orphan-callback-related code under same comment
rcu: Consolidate offline-CPU callback initialization
...
Pull trivial tree from Jiri Kosina:
"Usual trivial tree updates. Nothing outstanding -- mostly printk()
and comment fixes and unused identifier removals"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
goldfish: goldfish_tty_probe() is not using 'i' any more
powerpc: Fix comment in smu.h
qla2xxx: Fix printks in ql_log message
lib: correct link to the original source for div64_u64
si2168, tda10071, m88ds3103: Fix firmware wording
usb: storage: Fix printk in isd200_log_config()
qla2xxx: Fix printk in qla25xx_setup_mode
init/main: fix reset_device comment
ipwireless: missing assignment
goldfish: remove unreachable line of code
coredump: Fix do_coredump() comment
stacktrace.h: remove duplicate declaration task_struct
smpboot.h: Remove unused function prototype
treewide: Fix typo in printk messages
treewide: Fix typo in printk messages
mod_devicetable: fix comment for match_flags
Currently, smpboot_unpark_threads() is invoked before the incoming CPU
has been added to the scheduler's runqueue structures. This might
potentially cause the unparked kthread to run on the wrong CPU, since the
correct CPU isn't fully set up yet.
That causes a sporadic, hard to debug boot crash triggering on some
systems, reported by Borislav Petkov, and bisected down to:
2a442c9c64 ("x86: Use common outgoing-CPU-notification code")
This patch places smpboot_unpark_threads() in a CPU hotplug
notifier with priority set so that these kthreads are unparked just after
the CPU has been added to the runqueues.
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now we call ss->bind() in cgroup_init(), so cgroup_init() will
call cpuset_bind() and then the latter will access top_cpuset's
cpumask, which is NULL, because cpuset_init() is called after
cgroup_init()
The simplest fix is to swap cgroup_init() and cpuset_init().
Cc: Vladimir Davydov <vdavydov@parallels.com>
Fixes: 295458e672 ("cgroup: call cgroup_subsys->bind on cgroup subsys initialization")
Reported by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
CONFIG_INIT_FALLBACK adds config bloat without an obvious use case that
makes it worth keeping around. Delete it.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: Shuah Khan <shuah.kh@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The UP local API support can be set up from an early initcall. No need
for horrible hackery in the init code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20150115211703.827943883@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull vfs pile #2 from Al Viro:
"Next pile (and there'll be one or two more).
The large piece in this one is getting rid of /proc/*/ns/* weirdness;
among other things, it allows to (finally) make nameidata completely
opaque outside of fs/namei.c, making for easier further cleanups in
there"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_venus_readdir(): use file_inode()
fs/namei.c: fold link_path_walk() call into path_init()
path_init(): don't bother with LOOKUP_PARENT in argument
fs/namei.c: new helper (path_cleanup())
path_init(): store the "base" pointer to file in nameidata itself
make default ->i_fop have ->open() fail with ENXIO
make nameidata completely opaque outside of fs/namei.c
kill proc_ns completely
take the targets of /proc/*/ns/* symlinks to separate fs
bury struct proc_ns in fs/proc
copy address of proc_ns_ops into ns_common
new helpers: ns_alloc_inum/ns_free_inum
make proc_ns_operations work with struct ns_common * instead of void *
switch the rest of proc_ns_operations to working with &...->ns
netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
common object embedded into various struct ....ns
as I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init(). By passing
in the trace_event= kernel command line parameter, tracepoints can be
enabled at boot up. For debugging things like the initialization of
interrupts, it is needed to have tracepoints enabled very early. People
have asked about this before and this has been on my todo list. As it
can be helpful for Thomas to debug his upcoming 3.20 IRQ work, I'm
pushing this now. This way he can add tracepoints into the IRQ set up
and have users enable them when things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered. If the irq code locks up or reboots the box, having the
tracepoint output go into the kernel ring buffer is useless for
debugging. But being able to add the tp_printk kernel command line
option along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas tried
them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjv3kAAoJEEjnJuOKh9ldLNsIANAe5EmDCBw0WjR72n+G3qOH
NC8calXfkjqHU0bv8Q3dRv20KH4MHOy6l4+EiV9/ovt71LOF3NEyUJ3HuShf9a8b
sWcUhYbX3D1hViQe5sOzv9AWhBCFlKQGoNmQnydX9xa8ivRsBaTGJIGktWlHcwBE
jF1i3fj3l3vRQSS8qZFXp3bzreunlGyPoSHcT6eWQeos+utj4sKwQWTLXTLQeM+6
oQtFKRx7E5yX04qO1qFczS8qIEC6JH2C2jIRYEKUGepaELlnGkb8O7jQV/RaLF4/
6P8VhZFG9YLS7fn7vWu0SnAN+Zwz5LzgjXAZt0FhGtIhLc18Oj8ouHH1UORsdQM=
=Z4Un
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"As the merge window is still open, and this code was not as complex as
I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init().
By passing in the trace_event= kernel command line parameter,
tracepoints can be enabled at boot up. For debugging things like
the initialization of interrupts, it is needed to have tracepoints
enabled very early. People have asked about this before and this
has been on my todo list. As it can be helpful for Thomas to debug
his upcoming 3.20 IRQ work, I'm pushing this now. This way he can
add tracepoints into the IRQ set up and have users enable them when
things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered.
If the irq code locks up or reboots the box, having the tracepoint
output go into the kernel ring buffer is useless for debugging.
But being able to add the tp_printk kernel command line option
along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas
tried them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"
* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add tp_printk cmdline to have tracepoints go to printk()
tracing: Move enabling tracepoints to just after rcu_init()
Enabling tracepoints at boot up can be very useful. The tracepoint
can be initialized right after RCU has been. There's no need to
wait for the early_initcall() to be called. That's too late for some
things that can use tracepoints for debugging. Move the logic to
enable tracepoints out of the initcalls and into init/main.c to
right after rcu_init().
This also allows trace_printk() to be used early too.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.org
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull security layer updates from James Morris:
"In terms of changes, there's general maintenance to the Smack,
SELinux, and integrity code.
The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT,
which allows IMA appraisal to require signatures. Support for reading
keys from rootfs before init is call is also added"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits)
selinux: Remove security_ops extern
security: smack: fix out-of-bounds access in smk_parse_smack()
VFS: refactor vfs_read()
ima: require signature based appraisal
integrity: provide a hook to load keys when rootfs is ready
ima: load x509 certificate from the kernel
integrity: provide a function to load x509 certificate from the kernel
integrity: define a new function integrity_read_file()
Security: smack: replace kzalloc with kmem_cache for inode_smack
Smack: Lock mode for the floor and hat labels
ima: added support for new kernel cmdline parameter ima_template_fmt
ima: allocate field pointers array on demand in template_desc_init_fields()
ima: don't allocate a copy of template_fmt in template_desc_init_fields()
ima: display template format in meas. list if template name length is zero
ima: added error messages to template-related functions
ima: use atomic bit operations to protect policy update interface
ima: ignore empty and with whitespaces policy lines
ima: no need to allocate entry for comment
ima: report policy load status
ima: use path names cache
...