Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.
Guard the new implementation with a new config option:
CONFIG_USE_PERCPU_NUMA_NODE_ID.
Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:
1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.
Subsequent patches will convert x86_64 and ia64 to use this implemenation.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an implementation of ->llseek useable for the rare special case
when userspace expects the seek to succeed but the (device) file is
actually not able to perform the seek. In this case you use noop_llseek()
instead of falling back to the default implementation of ->llseek.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are more architectures that don't support ARCH_HAS_SG_CHAIN than
those that support it. This removes removes ARCH_HAS_SG_CHAIN in
asm-generic/scatterlist.h and lets arhictectures to define it.
It's clearer than defining ARCH_HAS_SG_CHAIN asm-generic/scatterlist.h and
undefing it in arhictectures that don't support it.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are only two ways to define sg_dma_len(); use sg->dma_length or
sg->length. This patch introduces NEED_SG_DMA_LENGTH that enables
architectures to choose sg->dma_length or sg->length.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first half of the attempt to use asm-generic/scatterlist.h
on every architecture.
There are only two ways to define scatterlist structure. So it's easy
to convert every architecture to use asm-generic/scatterlist.h.
This patch:
The trick for ISA_DMA_THRESHOLD in asm-generic/scatterlist.h doesn't work
for powerpc. This lets architectures defin ISA_DMA_THRESHOLD.
Hopefully, we can remove ISA_DMA_THRESHOLD in the future; we can do better
to decide if the bouncing is necessary or not.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The aio compat code was not converting the struct iovecs from 32bit to
64bit pointers, causing either EINVAL to be returned from io_getevents, or
EFAULT as the result of the I/O. This patch passes a compat flag to
io_submit to signal that pointer conversion is necessary for a given iocb
array.
A variant of this was tested by Michael Tokarev. I have also updated the
libaio test harness to exercise this code path with good success.
Further, I grabbed a copy of ltp and ran the
testcases/kernel/syscall/readv and writev tests there (compiled with -m32
on my 64bit system). All seems happy, but extra eyes on this would be
welcome.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org> [2.6.35.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and
writev AIO operations were not functioning properly. It turns out that
the code to convert the 32bit io vectors to 64 bits was never written.
The results of that can be pretty bad, but in my testing, it mostly ended
up in generating EFAULT as we walked off the list of I/O vectors provided.
This patch set fixes the problem in my environment. are greatly
appreciated.
This patch:
Factor out code that will be used by both compat_do_readv_writev and the
compat aio submission code paths.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org> [2.6.35.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 2.6.5, it had been commented, 'for backwards compatibility,
removed in 2.7.x'. Since 2.6.31, it have been marked as __deprecated.
I think that we can remove the API safely now.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sync_single_range_for_cpu and sync_single_range_for_device hooks are
unnecessary because sync_single_for_cpu and sync_single_for_device can
be used instead.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swiotlb_sync_single_range_for_cpu and swiotlb_sync_single_range_for_device
are unnecessary because swiotlb_sync_single_for_cpu and
swiotlb_sync_single_for_device can be used instead.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the definition of struct rnd_state and the inline
__seed() function to linux/random.h. It renames the static __random32()
function to prandom32() and exports it for use in modules.
prandom32() is useful as a privately-seeded pseudo random number generator
that can give the same result every time it is initialized.
For FCoE FC-BB-6 VN2VN mode self-selected unique FC address generation, we
need an pseudo-random number generator seeded with the 64-bit world-wide
port name. A truly random generator or one seeded with randomness won't
do because the same sequence of numbers should be generated each time we
boot or the link comes up.
A prandom32_seed() inline function is added to the header file. It is
inlined not for speed, but so the function won't be expanded in the base
kernel, but only in the module that uses it.
Signed-off-by: Joe Eykholt <jeykholt@cisco.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cosmetic, no changes in the compiled code. Just s/NULL/SIG_DFL/ to make
it more readable and grep-friendly.
Note: probably SIG_IGN makes more sense, we could kill ignore_signals().
But then kernel_init() should do flush_signal_handlers() before exec().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Mathias Krause <Mathias.Krause@secunet.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"statically initialize struct pid for swapper" commit 820e45db says:
Statically initialize a struct pid for the swapper process (pid_t == 0)
and attach it to init_task. This is needed so task_pid(), task_pgrp()
and task_session() interfaces work on the swapper process also.
OK, but:
- it doesn't make sense to add init_task.pids[].node into
init_struct_pid.tasks[], and in fact this just wrong.
idle threads are special, they shouldn't be visible on any
global list. In particular do_each_pid_task(init_struct_pid)
shouldn't see swapper.
This is the actual reason why kill(0, SIGKILL) from /sbin/init
(which starts with 0,0 special pids) crashes the kernel. The
signal sent to pgid/sid == 0 must never see idle threads, even
if the previous patch fixed the crash itself.
- we have other idle threads running on the non-boot CPUs, see
the next patch.
Change INIT_STRUCT_PID/INIT_PID_LINK to create the empty/unhashed
hlist_head/hlist_node. Like any other idle thread swapper can never exit,
so detach_pid()->__hlist_del() is not possible, but we could change
INIT_PID_LINK() to set pprev = &next if needed.
All we need is the valid swapper->pids[].pid == &init_struct_pid.
Reported-by: Mathias Krause <mathias.krause@secunet.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Mathias Krause <Mathias.Krause@secunet.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The trivial /sbin/init doing
int main(void)
{
kill(0, SIGKILL)
}
crashes the kernel.
This happens because __kill_pgrp_info(init_struct_pid) also sends SIGKILL
to the swapper process which runs with the uninitialized ->thread_group.
Change INIT_TASK() to initialize ->thread_group properly.
Note: the real problem is that the swapper process must not be visible to
signals, see the next patch. But this change is right anyway and fixes
the crash.
Reported-and-tested-by: Mathias Krause <mathias.krause@secunet.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Mathias Krause <Mathias.Krause@secunet.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On a system with a substantial number of processors, the early default
pid_max of 32k will not be enough. A system with 1664 CPU's, there are
25163 processes started before the login prompt. It's estimated that with
2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit
very early during the boot cycle, and processes would stall waiting for an
available pid.
This patch increases the early maximum number of pids available, and
increases the minimum number of pids that can be set during runtime.
[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Greg KH <gregkh@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: John Stoffel <john@stoffel.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add switch specific domain routines required for 16-bit routing support in
switches with hierarchical implementation of routing tables.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Modify the way how RapidIO switch operations are declared. Multiple
assignments through the linker script replaced by single initialization
call.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the functionality to enable Input receiver and Output transmitter of
every port, to allow non-maintenance traffic.
Signed-off-by: Thomas Moll <thomas.moll@sysgo.com>
Signed-off-by: Alexandre Bounine <abounine@tundra.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add RapidIO Port-Write message handling in the context of Error
Management Extensions Specification Rev.1.3.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Tested-by: Thomas Moll <thomas.moll@sysgo.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extentions to RapidIO switch support:
1. modify switch route operation declarations to allow using single
switch-specific file for family of switches that share the same route
table operations.
2. add standard route table operations for switches that that support
route table manipulation registers as defined in the Rev.1.3 of RapidIO
specification.
3. add clear-route-table operation for switches
4. add CPSxx and TSIxxx families of RapidIO switches
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Tested-by: Thomas Moll <thomas.moll@sysgo.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cacheline align the spinlock for sysv semaphores. Without the patch, the
spinlock and sem_otime [written by every semop that modified the array]
and sem_base [read in the hot path of try_atomic_semop()] can be in the
same cacheline.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes notifier_from_errno(0) to be NOTIFY_OK instead of
NOTIFY_STOP_MASK | NOTIFY_OK.
Currently, the notifiers which return encapsulated errno value have to
do something like this:
err = do_something(); // returns -errno
if (err)
return notifier_from_errno(err);
else
return NOTIFY_OK;
This change makes the above code simple:
err = do_something(); // returns -errno
return return notifier_from_errno(err);
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes, just s/atomic_t count/int nr_threads/.
With the recent changes this counter has a single user, get_nr_threads()
And, none of its callers need the really accurate number of threads, not
to mention each caller obviously races with fork/exit. It is only used to
report this value to the user-space, except first_tid() uses it to avoid
the unnecessary while_each_thread() loop in the unlikely case.
It is a bit sad we need a word in struct signal_struct for this, perhaps
we can change get_nr_threads() to approximate the number of threads using
signal->live and kill ->nr_threads later.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that task->signal can't go away get_nr_threads() doesn't need
->siglock to read signal->count.
Also, make it inline, move into sched.h, and convert 2 other proc users of
signal->count to use this (now trivial) helper.
Henceforth get_nr_threads() is the only valid user of signal->count, we
are ready to turn it into "int nr_threads" or, perhaps, kill it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the empty thread_group_cputime_free() helper. It was needed to free
the per-cpu data which we no longer have.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that task->signal can't go away we can revert the horrible hack added
by ad474caca3 ("fix for
account_group_exec_runtime(), make sure ->signal can't be freed under
rq->lock").
And we can do more cleanups sched_stats.h/posix-cpu-timers.c later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a lot of problems with accessing task_struct->signal, it can
"disappear" at any moment. Even current can't use its ->signal safely
after exit_notify(). ->siglock helps, but it is not convenient, not
always possible, and sometimes it makes sense to use task->signal even
after this task has already dead.
This patch adds the reference counter, sigcnt, into signal_struct. This
reference is owned by task_struct and it is dropped in
__put_task_struct(). Perhaps it makes sense to export
get/put_signal_struct() later, but currently I don't see the immediate
reason.
Rename __cleanup_signal() to free_signal_struct() and unexport it. With
the previous changes it does nothing except kmem_cache_free().
Change __exit_signal() to not clear/free ->signal, it will be freed when
the last reference to any thread in the thread group goes away.
Note:
- when the last thead exits signal->tty can point to nowhere, see
the next patch.
- with or without this patch signal_struct->count should go away,
or at least it should be "int nr_threads" for fs/proc. This will
be addressed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zap_other_threads() to return the number of other sub-threads found
on ->thread_group list.
Other changes are cosmetic:
- change the code to use while_each_thread() helper
- remove the obsolete comment about SIGKILL/SIGSTOP
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that nobody ever changes subprocess_info->cred we can kill this member
and related code. ____call_usermodehelper() always runs in the context of
freshly forked kernel thread, it has the proper ->cred copied from its
parent kthread, keventd.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change
subprocess_info->cred in advance. Now that we have info->init() we can
change this code to set tgcred->session_keyring in context of execing
kernel thread.
Note: since currently call_usermodehelper_keys() is never called with
UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup()
are not really needed, we could rely on install_session_keyring_to_cred()
which does key_get() on success.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first patch in this series introduced an init function to the
call_usermodehelper api so that processes could be customized by caller.
This patch takes advantage of that fact, by customizing the helper in
do_coredump to create the pipe and set its core limit to one (for our
recusrsion check). This lets us clean up the previous uglyness in the
usermodehelper internals and factor call_usermodehelper out entirely.
While I'm at it, we can also modify the helper setup to look for a core
limit value of 1 rather than zero for our recursion check
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
feature in the kernel works. We had reports of several races, including
some reports of apps bypassing our recursion check so that a process that
was forked as part of a core_pattern setup could infinitely crash and
refork until the system crashed.
We fixed those by improving our recursion checks. The new check basically
refuses to fork a process if its core limit is zero, which works well.
Unfortunately, I've been getting grief from maintainer of user space
programs that are inserted as the forked process of core_pattern. They
contend that in order for their programs (such as abrt and apport) to
work, all the running processes in a system must have their core limits
set to a non-zero value, to which I say 'yes'. I did this by design, and
think thats the right way to do things.
But I've been asked to ease this burden on user space enough times that I
thought I would take a look at it. The first suggestion was to make the
recursion check fail on a non-zero 'special' number, like one. That way
the core collector process could set its core size ulimit to 1, and enable
the kernel's recursion detection. This isn't a bad idea on the surface,
but I don't like it since its opt-in, in that if a program like abrt or
apport has a bug and fails to set such a core limit, we're left with a
recursively crashing system again.
So I've come up with this. What I've done is modify the
call_usermodehelper api such that an extra parameter is added, a function
pointer which will be called by the user helper task, after it forks, but
before it exec's the required process. This will give the caller the
opportunity to get a call back in the processes context, allowing it to do
whatever it needs to to the process in the kernel prior to exec-ing the
user space code. In the case of do_coredump, this callback is ues to set
the core ulimit of the helper process to 1. This elimnates the opt-in
problem that I had above, as it allows the ulimit for core sizes to be set
to the value of 1, which is what the recursion check looks for in
do_coredump.
This patch:
Create new function call_usermodehelper_fns() and allow it to assign both
an init and cleanup function, as we'll as arbitrary data.
The init function is called from the context of the forked process and
allows for customization of the helper process prior to calling exec. Its
return code gates the continuation of the process, or causes its exit.
Also add an arbitrary data pointer to the subprocess_info struct allowing
for data to be passed from the caller to the new process, and the
subsequent cleanup process
Also, use this patch to cleanup the cleanup function. It currently takes
an argp and envp pointer for freeing, which is ugly. Lets instead just
make the subprocess_info structure public, and pass that to the cleanup
and init routines
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems). Part of the reason is that
the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
node 0 for newly created tasks.
This patch changes the rotor to be initialized to a random node number of
the cpuset.
[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system. There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().
For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
allocates on odd nodes & skips even nodes).
An example is shown below. The program "lfile" writes a file consisting
of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:
# ./lfile
allocated on nodes: 2 4 6 0 1 2 6 0 2
There is a single rotor that is used for allocating both file pages & slab
pages. Writing the file allocates both a data page & a slab page
(buffer_head). This advances the RR rotor 2 nodes for each page
allocated.
A quick confirmation seems to confirm this is the cause of the uneven
allocation:
# echo 0 >/dev/cpuset/memory_spread_slab
# ./lfile
allocated on nodes: 6 7 8 9 0 1 2 3 4 5
This patch introduces a second rotor that is used for slab allocations.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we are unable to handle an error returned by
cftype.unregister_event() properly, let's make the callback
void-returning.
mem_cgroup_unregister_event() has been rewritten to be a "never fail"
function. On mem_cgroup_usage_register_event() we save old buffer for
thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
avoid allocation.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.
Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.
Now, at migrating mapped file, events happen in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit for the charge.
(IOW, PCG_USED bit is not set.)
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark the new page's page_cgroup
as PCG_USED.
Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
not on LRU or page_cgroup->mem_cgroup is NULL.)
We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.
BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.
New migration sequence with memcg is:
1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. page migration replaces radix-tree, page table, etc...
5. At remapping, new page's page_cgroup is now makrked as "USED"
We can catch 0->1 event and FILE_MAPPED will be properly updated.
And we can catch SWAPOUT event after unlock this and freeing this
page by unmap() can be caught.
7. Clear PageCgroupMigration of the old page.
So, FILE_MAPPED will be correctly updated.
Then, for what MIGRATION flag is ?
Without it, at migration failure, we may have to charge old page again
because it may be fully unmapped. "charge" means that we have to dive into
memory reclaim or something complated. So, it's better to avoid
charge it again. Before this patch, __commit_charge() was working for
both of the old/new page and fixed up all. But this technique has some
racy condtion around FILE_MAPPED and SWAPOUT etc...
Now, the kernel use MIGRATION flag and don't uncharge old page until
the end of migration.
I hope this change will make memcg's page migration much simpler. This
page migration has caused several troubles. Worth to add a flag for
simplification.
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds support for moving charge of file pages, which include
normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
bit 1 of <target cgroup>/memory.move_charge_at_immigrate.
Unlike the case of anonymous pages, file pages(and swaps) in the range
mmapped by the task will be moved even if the task hasn't done page fault,
i.e. they might not be the task's "RSS", but other task's "RSS" that maps
the same file. And mapcount of the page is ignored(the page can be moved
even if page_mapcount(page) > 1). So, conditions that the page/swap
should be met to be moved is that it must be in the range mmapped by the
target task and it must be charged to the old cgroup.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A few architectures, like OMAP, allow you to set a debouncing time for the
gpio before generating the IRQ. Teach gpiolib about that.
Mark said:
: This would be generally useful for embedded systems, especially where
: the interrupt concerned is a wake source. It allows drivers to avoid
: spurious interrupts from noisy sources so if the hardware supports it
: the driver can avoid having to explicitly wait for the signal to become
: stable and software has to cope with fewer events. We've lived without
: it for quite some time, though.
David said:
: I looked at adding debounce support to the generic GPIO calls (and thus
: gpiolib) some time back, but decided against it. I forget why at this
: time (check list archives) but it wasn't because of lack of utility in
: certain contexts.
:
: One thing to watch out for is just how variable the hardware capabilities
: are. Atmel GPIOs have something like a fixed number of 32K clock cycles
: for debounce, twl4030 had something odd, OMAPs were more like the Atmel
: chips but with a different clock. In some cases debouncing had to be
: ganged, not per-GPIO. And so forth.
Signed-off-by: Felipe Balbi <felipe.balbi@nokia.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: David Brownell <david-b@pacbell.net>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gpiolib doesn't need to modify the names and I assume most initializers
use string constants that shouldn't be modified anyhow.
[akpm@linux-foundation.org: fix drivers/gpio/cs5535-gpio.c]
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Kevin Wells <kevin.wells@nxp.com>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of the GPIO expanders supported by the max732x driver have interrupt
generation capability by reporting changes on input pins through an INT#
pin. This patch implements the irq_chip functionnality (edge detection
only).
Signed-off-by: Marc Zyngier <maz@misterjones.org>
Cc: Eric Miao <eric.y.miao@gmail.com>
Cc: Jebediah Huang <jebediah.huang@gmail.com>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a glue layer to support the sdhci driver on the ST SPEAr platform.
Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Cc: <shiraz.hashim@st.com>
Cc: Linus Walleij <linus.ml.walleij@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SDIO specification allows RAW (Read after Write) operation using
IO_RW_DIRECT command (CMD52) by setting the RAW bit. This operation is
similar to ordinary read/write commands, except that both write and read
are performed using single command/response pair. The Linux SDIO layer
already supports this internaly, only external function is missing for
drivers to make use, which is added by this patch.
This type of command is required to implement proper power save mode
support in wl1251 wifi driver.
Android has similar patch for G1 in it's tree for the same reason:
http://android.git.kernel.org/?p=kernel/common.git;a=commitdiff;h=74a47786f6ecbe6c1cf9fb15efe6a968451deb52
Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
Acked-by: Kalle Valo <kalle.valo@iki.fi>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Even though many mmc host drivers pass a pm_message_t argument to
mmc_suspend_host() that argument isn't used the by MMC core. As host
drivers are converted to dev_pm_ops they'll have to construct
pm_message_t's (as they won't be passed by the PM subsystem any more) just
to appease the mmc suspend interface.
We might as well just delete the unused paramter.
Signed-off-by: Matt Fleming <matt@console-pimps.org>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>ZZ
Acked-by: Sascha Sommer <saschasommer@freenet.de>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MMCIF is the MMC Host Interface in SuperH.
Signed-off-by: Yusuke Goda <yusuke.goda.sx@renesas.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This includes platform ops, quirks and (de)initialization callbacks.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Cc: Richard Röjfors <richard.rojfors@pelagicore.com>
Cc: David Vrabel <david.vrabel@csr.com>
Cc: Pierre Ossman <pierre@ossman.eu>
Cc: Ben Dooks <ben@simtec.co.uk>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UAC2 devices have their information about pitch control stored in a
different field. Parse it, and emulate the bits for a v1 device.
A new struct uac2_iso_endpoint_descriptor is added.
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.
After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register a device newly added pcf50633-backlight driver as a child device in
the pcf50633 core driver.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Acked-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
This patch adds a backlight driver for controlling the pcf50633 LED module.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Acked-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
This is S6E63M0 AMOLED LCD Panel(480x800) driver using 3-wired SPI
interface also almost features for lcd panel driver has been implemented
in here. and I added new structure common for all the lcd panel drivers
to include/linux/lcd.h file.
LCD Panel driver needs interfaces for controlling device power such as
power on/off and reset. these interfaces are device specific so it should
be implemented to machine code at this time, we should create new
structure for registering these functions as callbacks and also a header
file for that structure and finally registered callback functions would be
called by lcd panel driver. such header file(including new structure for
lcd panel) would be added for all the lcd panel drivers.
If anyone provides common structure for registering such callback
functions then we could reduce unnecessary header files for lcd panel. I
thought that suitable anyone could be include/linux/lcd.h so a new
lcd_platform_data structure was added there.
[akpm@linux-foundation.org: coding-style fixes]
[randy.dunlap@oracle.com: fix s6e63m0 kconfig]
[randy.dunlap@oracle.com: fix device attribute functions return types]
Signed-off-by: InKi Dae <inki.dae@samsung.com>
Reviewed-by: KyungMin Park <kyungmin.park.samsung.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
This reverts commit b3b77c8cae, which was
also totally broken (see commit 0d2daf5cc8 that reverted the crc32
version of it). As reported by Stephen Rothwell, it causes problems on
big-endian machines:
> In file included from fs/jfs/jfs_types.h:33,
> from fs/jfs/jfs_incore.h:26,
> from fs/jfs/file.c:22:
> fs/jfs/endian24.h:36:101: warning: "__LITTLE_ENDIAN" is not defined
The kernel has never had that crazy "__BYTE_ORDER == __LITTLE_ENDIAN"
model. It's not how we do things, and it isn't how we _should_ do
things. So don't go there.
Requested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
The ADP8860 combines a programmable backlight LED charge pump driver with
automatic phototransistor control.
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
This add basic led support for Freescale MC13783 PMIC.
Signed-off-by: Philippe Rétornaz <philippe.retornaz@epfl.ch>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
The leds-gpio blink_set() callback follows the same prototype as the
main leds subsystem blink_set() one.
The problem is that to stop blink, normally, a leds driver does it
in the brightness_set() callback when asked to set a new fixed value.
However, with leds-gpio, the platform has no hook to do so, as this
later callback results in a standard GPIO manipulation.
This changes the leds-gpio specific callback to take a new argument
that indicates whether the LED should be blinking or not and in what
state it should be set if not. We also update the dns323 platform
which seems to be the only user of this so far.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Richard Purdie <rpurdie@linux.intel.com>
Sparse complains because these one-bit bitfields are signed.
include/net/sctp/structs.h:879:24: error: dubious one-bit signed bitfield
include/net/sctp/structs.h:889:31: error: dubious one-bit signed bitfield
include/net/sctp/structs.h:895:26: error: dubious one-bit signed bitfield
include/net/sctp/structs.h:898:31: error: dubious one-bit signed bitfield
include/net/sctp/structs.h:901:27: error: dubious one-bit signed bitfield
It doesn't cause a problem in the current code, but it would be better
to clean it up. This was introduced by c0058a35aacc7: "sctp: Save some
room in the sctp_transport by using bitfields".
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the cls_cgroup module is not loaded, task_cls_classid will
return an uninitialised classid instead of zero.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
drivers/net/usb/asix.c: Fix pointer cast.
be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
proc_dointvec: write a single value
hso: add support for new products
Phonet: fix potential use-after-free in pep_sock_close()
ath9k: remove VEOL support for ad-hoc
ath9k: change beacon allocation to prefer the first beacon slot
sock.h: fix kernel-doc warning
cls_cgroup: Fix build error when built-in
macvlan: do proper cleanup in macvlan_common_newlink() V2
be2net: Bug fix in init code in probe
net/dccp: expansion of error code size
ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
wireless: fix sta_info.h kernel-doc warnings
wireless: fix mac80211.h kernel-doc warnings
iwlwifi: testing the wrong variable in iwl_add_bssid_station()
ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
ath9k_htc: dereferencing before check in hif_usb_tx_cb()
rt2x00: Fix rt2800usb TX descriptor writing.
rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
...
Add dump_id libata.force parameter. If specified, libata dumps full
IDENTIFY data during device configuration. This is to aid debugging.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Larry Baker <baker@usgs.gov>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Make BMDMA optional depending on new config variable CONFIG_ATA_BMDMA.
In Kconfig, drivers are grouped into five groups - non-SFF native, SFF
w/ custom DMA interface, SFF w/ BMDMA, PIO-only SFF, and generic
fallback / legacy ones. Kconfig and Makefile are reorganized
according to the groups and ordered alphabetically inside each group.
ata_ioports.bmdma_addr and ata_port.bmdma_prd[_dma] are put into
CONFIG_ATA_BMDMA, as are all bmdma related ops, variables and
functions.
This increase the binary size slightly when BMDMA is enabled but on
both native-only and PIO-only configurations the size is slightly
reduced. Either way, the size difference is insignificant. This
change is more meaningful to signify the separation between SFF and
BMDMA and as a tool to verify the separation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Separate out ata_pci_bmdma_prepare_host() and ata_pci_bmdma_init_one()
from their SFF counterparts. SFF ones no longer try to initialize
BMDMA or set PCI master.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Separate out BMDMA irq handler from SFF irq handler. The misnamed
host_intr() functions are renamed to ata_sff_port_intr() and
ata_bmdma_port_intr(). Common parts are factored into
__ata_sff_port_intr() and __ata_sff_interrupt() and used by sff and
bmdma interrupt routines.
All BMDMA drivers now use ata_bmdma_interrupt() or
ata_bmdma_port_intr() while all non-BMDMA SFF ones use
ata_sff_interrupt() or ata_sff_port_intr().
For now, ata_pci_sff_init_one() uses ata_bmdma_interrupt() as it's
used by both SFF and BMDMA drivers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
ata_sff_irq_clear() is BMDMA specific. Rename it to
ata_bmdma_irq_clear(), move it to ata_bmdma_port_ops and make
->sff_irq_clear() optional.
Note: ata_bmdma_irq_clear() is actually only needed by ata_piix and
possibly by sata_sil. This should be moved to respective low
level drivers later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/nes: Fix incorrect unlock in nes_process_mac_intr()
RDMA/nes: Async event for closed QP causes crash
RDMA/nes: Have ethtool read hardware registers for rx/tx stats
RDMA/cxgb4: Only insert sq qid in lookup table
RDMA/cxgb4: Support IB_WR_READ_WITH_INV opcode
RDMA/cxgb4: Set fence flag for inv-local-stag work requests
RDMA/cxgb4: Update some HW limits
RDMA/cxgb4: Don't limit fastreg page list depth
RDMA/cxgb4: Return proper errors in fastreg mr/pbl allocation
RDMA/cxgb4: Fix overflow bug in CQ arm
RDMA/cxgb4: Optimize CQ overflow detection
RDMA/cxgb4: CQ size must be IQ size - 2
RDMA/cxgb4: Register RDMA provider based on LLD state_change events
RDMA/cxgb4: Detach from the LLD after unregistering RDMA device
IB/ipath: Remove support for QLogic PCIe QLE devices
IB/qib: Add new qib driver for QLogic PCIe InfiniBand adapters
IB/mad: Make needlessly global mad_sendq_size/mad_recvq_size static
IB/core: Allow device-specific per-port sysfs files
mlx4_core: Clean up mlx4_alloc_icm() a bit
mlx4_core: Fix possible chunk sg list overflow in mlx4_alloc_icm()
* 'next-spi' of git://git.secretlab.ca/git/linux-2.6:
spi/xilinx: Fix compile error
spi/davinci: Fix clock prescale factor computation
spi: move bitbang txrx utility functions to private header
spi/mpc5121: Add SPI master driver for MPC5121 PSC
powerpc/mpc5121: move PSC FIFO memory init to platform code
spi/ep93xx: implemented driver for Cirrus EP93xx SPI controller
Documentation/spi/* compile warning fix
spi/omap2_mcspi: Check params before dereference or use
spi/omap2_mcspi: add turbo mode support
spi/omap2_mcspi: change default DMA_MIN_BYTES value to 160
spi/pl022: fix stop queue procedure
spi/pl022: add support for the PL023 derivate
spi/pl022: fix up differences between ARM and ST versions
spi/spi_mpc8xxx: Do not use map_tx_dma to unmap rx_dma
spi/spi_mpc8xxx: Fix QE mode Litte Endian
spi/spi_mpc8xxx: fix potential memory corruption.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6:
regulator: return set_mode is same mode is requested
Regulators: ab3100/bq24022: add a missing .owner field in regulator_desc
twl6030: regulator: Remove vsel tables and use formula for calculation
mc13783-regulator: fix vaild voltage range checking for mc13783_fixed_regulator_set_voltage
regulator: use voltage number array in 88pm860x
regulator: make 88pm860x sharing one driver structure
regulator: simplify regulator_register() error handling
regulator: fix unset_regulator_supplies() to remove all matches
regulator: prevent registration of matching regulator consumer supplies
regulator: Allow regulator-regulator supplies to be specified by name
There are 2 major changes from v0.81 to v0.7:
1. Consolidating the SPIB/I2CB tables into a new DEVS table,
which is more expandable and can support other bus types
than spi/i2c.
2. Creating a new GPIO table, which list all the GPIO pins
used in the platform.
However, to avoid breaking current platforms who use SFI v0.7
version firmware, the definitions for SPIB/I2CB will still
be kept for a while
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
FBIO_WAITFORVSYNC is currently implemented by matroxfb, atyfb, intelfb and
more. All of them keep redefining the same FBIO_WAITFORVSYNC macro over
and over again, so move it to linux/fb.h and clean up those duplicate
defines.
Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
Cc: Ville Syrjala <syrjala@sci.fi>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Maik Broemme <mbroemme@plusserver.de>
Cc: Petr Vandrovec <vandrove@vc.cvut.cz>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: "Hiremath, Vaibhav" <hvaibhav@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This work includes the following:
- Implement handler for FBIO_WAITFORVSYNC ioctl.
- Allocate the data and palette buffers separately. A consequence of
this is that the palette and data loading is now done in different
phases. And that the LCD must be disabled temporarily after the palette
is loaded but this will only happen once after init and each time the
palette is changed. I think this is OK.
- Allocate two (ping and pong) framebuffers from memory.
- Add pan_display handler which toggles the LCDC DMA registers between
the ping and pong buffers.
Signed-off-by: Martin Ambrose <martin@ti.com>
Cc: Chaithrika U S <chaithrika@ti.com>
Cc: Sudhakar Rajashekhara <sudhakar.raj@ti.com>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Content for the 8bit device threaded interrupt handlers. Depending on the
interrupt line and chip configuration, either click or wakeup / freefall
handler is called. In case of click, BTN_ event is sent via input device.
In case of wakeup or freefall, input device ABS_ events are updated
immediatelly.
It is still possible to configure interrupt line 1 for fast freefall
detection and use the second line either for click or threshold based
interrupts. Or both lines can be used for click / threshold interrupts.
Polled input device can be set to stopped state and still get coordinate
updates via input device using interrupt based method. Polled mode and
interrupt mode can also be used parallel.
BTN_ events are remapped based on existing axis remapping information.
Signed-off-by: Samu Onkalo <samu.p.onkalo@nokia.com>
Acked-by: Eric Piel <eric.piel@tremplin-utc.net>
Cc: Daniel Mack <daniel@caiaq.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Original lis3 driver didn't provide interrupt handler(s) for click or
threshold event handling. This patch adds threaded handlers for one or
two interrupt lines for 8 bit device. Actual content for interrupt
handling is provided in the separate patch.
Signed-off-by: Samu Onkalo <samu.p.onkalo@nokia.com>
Tested-by: Daniel Mack <daniel@caiaq.de>
Acked-by: Eric Piel <eric.piel@tremplin-utc.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8 bit device has two wakeup / free fall units. It was not possible to
configure the second unit. This patch introduces configuration entry to
the platform data and also corresponding changes to the 8 bit setup
function.
High pass filters were enabled by default. Patch introduces configuration
option for high pass filter cut off frequency and also possibility to
disable or enable the filter via platform data. Since the control is a
new one and default state was filter enabled, new option is used to
disable the filter. This way old platform data is still compatible with
the change.
Signed-off-by: Samu Onkalo <samu.p.onkalo@nokia.com>
Acked-by: Eric Piel <eric.piel@tremplin-utc.net>
Tested-by: Daniel Mack <daniel@caiaq.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hex_to_bin() is a little method which converts hex digit to its actual
value. There are plenty of places where such functionality is needed.
[akpm@linux-foundation.org: use tolower(), saving 3 bytes, test the more common case first - it's quicker]
[akpm@linux-foundation.org: relocate tolower to make it even faster! (Joe)]
Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com>
Cc: Tilman Schmidt <tilman@imap.cc>
Cc: Duncan Sands <duncan.sands@free.fr>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org>
Cc: John W. Linville <linville@tuxdriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, all users of ratelimit_state allocates it statically, so
DEFINE_RATELIMIT_STATE() is enough. But, I want to use ratelimit_state
for fs, i.e. per super_block to suppress too many error reports.
So, this adds ratelimit_state_init() to initialize ratelimite_state
which is dynamically allocated, instead of opencoding.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ratelimit_state initialization of printk_ratelimited() seems broken. This
fixes it by using DEFINE_RATELIMIT_STATE() to initialize spinlock
properly.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
32-bit Sparc used to only allow usage of 24-bit of it's atomic_t type.
This was corrected with linux 2.6.3 when Keith M Wesolowski changed the
implementation to use the parisc approach of having an array of spinlocks
to protect the atomic_t.
These warnings were also removed from the sparc implementation when the
new implementation was merged in BKrev:402e4949VThdc6D3iaosSFUgabMfvw, but
the warning still remained in some other places without any 24-bit-only
atomic_t implementation inside the kernel.
We should remove these warnings to allow users to rely on the full 32-bit
range of atomic_t.
Signed-off-by: Peter Fritzsche <peter.fritzsche@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.
- Make SHRT_MIN of type s16, not int, for consistency.
[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux does not define __BYTE_ORDER in its endian header files which makes
some header files bend backwards to get at the current endian. Lets
#define __BYTE_ORDER in big_endian.h/litte_endian.h to make it easier for
header files that are used in user space too.
In userspace the convention is that
1. _both_ __LITTLE_ENDIAN and __BIG_ENDIAN are defined,
2. you have to test for e.g. __BYTE_ORDER == __BIG_ENDIAN.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __must_check to error pointer handlers to have the compiler warn about
mistakes like:
if (err)
ERR_PTR(err);
It found two bugs:
Mar 12 Nikula Jani [PATCH] enclosure: fix error path - actually return ERR_PTR() on error
Mar 12 Nikula Jani [PATCH] sunrpc: fix error path - actually return ERR_PTR() on error
Signed-off-by: Jani Nikula <ext-jani.1.nikula@nokia.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This ensures that platforms with lowmem PAs above 32 bits work correctly
by avoiding truncating the PA during a left shift.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: Barry Song <21cnbao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add global mutex zonelists_mutex to fix the possible race:
CPU0 CPU1 CPU2
(1) zone->present_pages += online_pages;
(2) build_all_zonelists();
(3) alloc_page();
(4) free_page();
(5) build_all_zonelists();
(6) __build_all_zonelists();
(7) zone->pageset = alloc_percpu();
In step (3,4), zone->pageset still points to boot_pageset, so bad
things may happen if 2+ nodes are in this state. Even if only 1 node
is accessing the boot_pageset, (3) may still consume too much memory
to fail the memory allocations in step (7).
Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
since there is a new fresh memory block added in step(6).
[haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For each new populated zone of hotadded node, need to update its pagesets
with dynamically allocated per_cpu_pageset struct for all possible CPUs:
1) Detach zone->pageset from the shared boot_pageset
at end of __build_all_zonelists().
2) Use mutex to protect zone->pageset when it's still
shared in onlined_pages()
Otherwises, multiple zones of different nodes would share same boot strapping
boot_pageset for same CPU, which will finally cause below kernel panic:
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:1239!
invalid opcode: 0000 [#1] SMP
...
Call Trace:
[<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
[<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
[<ffffffff81128407>] __page_cache_alloc+0x67/0x70
[<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
[<ffffffff81132751>] ra_submit+0x21/0x30
[<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
[<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
[<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
[<ffffffff81266cfa>] nfs_file_read+0xca/0x130
[<ffffffff8117b20a>] do_sync_read+0xfa/0x140
[<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
[<ffffffff8117c151>] sys_read+0x51/0x80
[<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
RSP <ffff88000d1e78a8>
---[ end trace 4bda28328b9990db ]
[akpm@linux-foundation.org: merge fix]
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <andi.kleen@intel.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Got this while compiling for ARM/SA1100:
mm/sparse.c: In function '__section_nr':
mm/sparse.c:135: warning: 'root' is used uninitialized in this function
This patch follows Russell King's suggestion for a new calculation for
NR_SECTION_ROOTS. Thanks also to Sergei Shtylyov for pointing out the
existence of the macro DIV_ROUND_UP.
Atsushi Nemoto observed:
: This fix doesn't just silence the warning - it fixes a real problem.
:
: Without this fix, mem_section[] might have 0 size so mem_section[0]
: will share other variable area. For example, I got:
:
: c030c700 b __warned.16478
: c030c700 B mem_section
: c030c701 b __warned.16483
:
: This might cause very strange behavior. Your patch actually fixes it.
Signed-off-by: Marcelo Roberto Jimenez <mroberto@cpti.cetuc.puc-rio.br>
Cc: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Sergei Shtylyov <sshtylyov@mvista.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In f4112de6b6 ("mm: introduce
debug_kmap_atomic") I said that debug_kmap_atomic() needs
CONFIG_TRACE_IRQFLAGS_SUPPORT.
It was wrong. (I thought irqs_disabled() is only available when the
architecture has CONFIG_TRACE_IRQFLAGS_SUPPORT)
Remove the #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT check to enable
kmap_atomic() debugging for the architectures which do not have
CONFIG_TRACE_IRQFLAGS_SUPPORT.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add parenthesis in a define. This doesn't change functionality.
checkpatch errors:
1) white space fixes
2) add spaces after comas
Signed-off-by: matt mooney <mfm@muteddisk.com>
Cc: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix minor spelling errors in a few comments; no code changes.
Signed-off-by: matt mooney <mfm@muteddisk.com>
Cc: Dan Carpenter <error27@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable users to online CPUs even if the CPUs belongs to a numa node which
doesn't have onlined local memory.
The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
in system boot/init period, or at the time of local memory online. For a
numa node without onlined local memory, its zonelists are not initialized
at present. As a result, any memory allocation operations executed by
CPUs within this node will fail. In fact, an out-of-memory error is
triggered when attempt to online CPUs before memory comes to online.
This patch tries to create zonelists for such numa nodes, so that the
memory allocation for this node can be fallback'ed to other nodes.
[akpm@linux-foundation.org: remove unneeded export]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: minskey guo<chaohong.guo@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, we have global isolation vs. memory control group isolation, do
not allow the reclaim entry function to set an arbitrary page isolation
callback, we do not need that flexibility.
And since we already pass around the group descriptor for the memory
control group isolation case, just use it to decide which one of the two
isolator functions to use.
The decisions can be merged into nearby branches, so no extra cost there.
In fact, we save the indirect calls.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why
o Page migration cannot move all pages so fragmentation remains
o A suitable page may exist but watermarks are not met
In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.
The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone. To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache. This would impact the fast-paths and is not justified at
this time.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will not
be compacted. This choice is arbitrary and not based on data. To help
optimise the system and set a sensible default for this value, this patch
adds a sysctl extfrag_threshold. The kernel will only compact memory if
the fragmentation index is above the extfrag_threshold.
[randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.
Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.
[akpm@linux-foundation.org: Fix build errors]
[aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a per-node sysfs file called compact. When the file is written to,
each zone in that node is compacted. The intention that this would be
used by something like a job scheduler in a batch system before a job
starts so that the job can allocate the maximum number of hugepages
without significant start-up cost.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>