This was a temporary thing for 2.6.16.
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have noticed lockups during boot when stress testing kexec on ppc64.
Two cpus would deadlock in scheduler code trying to grab already taken
spinlocks.
The double_rq_lock code uses the address of the runqueue to order the
taking of multiple locks. This address is a per cpu variable:
if (rq1 < rq2) {
spin_lock(&rq1->lock);
spin_lock(&rq2->lock);
} else {
spin_lock(&rq2->lock);
spin_lock(&rq1->lock);
}
On the other hand, the code in wake_sleeping_dependent uses the cpu id
order to grab locks:
for_each_cpu_mask(i, sibling_map)
spin_lock(&cpu_rq(i)->lock);
This means we rely on the address of per cpu data increasing as cpu ids
increase. While this will be true for the generic percpu implementation it
may not be true for arch specific implementations.
One way to solve this is to always take runqueues in cpu id order. To do
this we add a cpu variable to the runqueue and check it in the
double runqueue locking functions.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When on_each_cpu() runs the callback on other CPUs, it runs with local
interrupts disabled. So we should run the function with local interrupts
disabled on this CPU, too.
And do the same for UP, so the callback is run in the same environment on both
UP and SMP. (strictly it should do preempt_disable() too, but I think
local_irq_disable is sufficiently equivalent).
Also uninlines on_each_cpu(). softirq.c was the most appropriate file I could
find, but it doesn't seem to justify creating a new file.
Oh, and fix up that comment over (under?) x86's smp_call_function(). It
drives me nuts.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A bare bones trivial patch to ensure we always get -EINVAL on the
unsupported cases for sys_unshare. If this goes in before 2.6.16 it allows
us to forward compatible with future applications using sys_unshare.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: JANAK DESAI <janak@us.ibm.com>
Cc: <stable@kerenl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the sleep_avg multiplier. This multiplier was necessary back when
we had 10 seconds of dynamic range in sleep_avg, but now that we only have
one second, it causes that one second to be compressed down to 100ms in
some cases. This is particularly noticeable when compiling a kernel in a
slow NFS mount, and I believe it to be a very likely candidate for other
recently reported network related interactivity problems.
In testing, I can detect no negative impact of this removal.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The module files, refcnt, version, and srcversion did not properly
increment the owner's module reference count, allowing the modules to
be removed while the files were open, causing oopses.
This patch fixes this, and also fixes the problem that the version and
srcversion files were not showing up, unless CONFIG_MODULE_UNLOAD was
enabled, which is not correct.
Cc: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As the RCU symbols are going to be changed to GPL in the near future,
lets warn users that this is going to happen.
Cc: Paul McKenney <paulmck@us.ibm.com>
Acked-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds the ability to mark symbols that will be changed in the
future, so that kernel modules that don't include MODULE_LICENSE("GPL")
and use the symbols, will be flagged and printed out to the system log.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Moving uevent_seqnum and uevent_helper to kobject_uevent.c
because they are used even if CONFIG_SYSFS=n
while kernel/ksysfs.c is built only if CONFIG_SYSFS=y,
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
testcase:
mount /dev/sdb10 /mnt
touch /mnt/tmp/b
umount /mnt
mount /dev/sdb10 /mnt
rm /mnt/tmp/b </mnt/tmp/b
umount /mnt
and watch blkdev_ioc line in /proc/slabinfo. Vanilla kernel leaks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sys_unshare() does mmput(new_mm). This is not enough if we have
mm->core_waiters.
This patch is a temporary fix for soon to be released 2.6.16.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
[ Checked with Uli: "I'm not planning to use unshare(CLONE_VM). It's
not needed for any functionality planned so far. What we (as in Red
Hat) need unshare() for now is the filesystem side." ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the posix-timer signal is ignored then the timer is rearmed by the
callback function. The requeue pending accounting has to be fixed up else
the state might be wrong.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The pointer to the current time interpolator and the current list of time
interpolators are typically only changed during bootup. Adding
__read_mostly takes them away from possibly hot cachelines.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sighand pointer only needs the rcu_read_lock on the
read side. So only depending on task_lock protection
when setting this pointer is not enough. We also need
a memory barrier to ensure the initialization is seen first.
Use rcu_assign_pointer as it does this for us, and clearly
documents that we are setting an rcu readable pointer.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes alternate signal stack corruption among cloned threads
with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6.
The value of alternate signal stack is currently inherited after a call of
clone(... CLONE_SIGHAND | CLONE_VM). But if sigaltstack is set by a
parent thread, and then if multiple cloned child threads (+ parent threads)
call signal handler at the same time, some threads may be conflicted -
because they share to use the same alternative signal stack region.
Finally they get sigsegv. It's an undesirable race condition. Note that
child threads created from NPTL pthread_create() also hit this conflict
when the parent thread uses sigaltstack, without my patch.
To fix this problem, this patch clears the child threads' sigaltstack
information like exec(). This behavior follows the SUSv3 specification.
In SUSv3, pthread_create() says "The alternate stack shall not be inherited
(when new threads are initialized)". It means that sigaltstack should be
cleared when sigaltstack memory space is shared by cloned threads with
CLONE_SIGHAND.
Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because:
- If clone_flags line is not existed, fork() does not inherit sigaltstack.
- CLONE_VM is another choice, but vfork() does not inherit sigaltstack.
- CLONE_SIGHAND implies CLONE_VM, and it looks suitable.
- CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM,
but this flag has a bit different semantics.
I decided to use CLONE_SIGHAND.
[ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ]
Signed-off-by: GOTO Masanori <gotom@sanori.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Linus Torvalds <torvalds@osdl.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch '[PATCH] RCU signal handling' [1] added an export for
__put_task_struct_cb, a put_task_struct helper newly introduced in that
patch. But the put_task_struct couldn't be used modular previously as
__put_task_struct wasn't exported. There are not callers of it in modular
code, and it shouldn't be exported because we don't want drivers to hold
references to task_structs.
This patch removes the export and folds __put_task_struct into
__put_task_struct_cb as there's no other caller.
[1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench. Tested on both x86_64 and powerpc.
The way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was
constructor/destructor based. This meant that nr_files was decremented
only when the object was removed from the slab cache. This is susceptible
to slab fragmentation. With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -
llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844
At the same time, I see only a 2000+ objects in filp cache. The following
patch I fixes this problem.
This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.
Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds new tunables for RCU queue and finished batches. There are
two types of controls - number of completed RCU updates invoked in a batch
(blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark,
qlowmark).
By default, the per-cpu batch limit is set to a small value. If the input
RCU rate exceeds the high watermark, we do two things - force quiescent
state on all cpus and set the batch limit of the CPU to INTMAX. Setting
batch limit to INTMAX forces all finished RCUs to be processed in one shot.
If we have more than INTMAX RCUs queued up, then we have bigger problems
anyway. Once the incoming queued RCUs fall below the low watermark, the
batch limit is set to the default.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Idle threads should have a sane ->timestamp value, to avoid init kernel
thread(s) from inheriting it and causing miscalculations in
try_to_wake_up().
Reported-by: Mike Galbraith <efault@gmx.de>.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a compiler barrier so that we don't read jiffies before updating
jiffies_64.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Also from Thomas Gleixner <tglx@linutronix.de>
Function next_timer_interrupt() got broken with a recent patch
6ba1b91213 as sys_nanosleep() was moved to
hrtimer. This broke things as next_timer_interrupt() did not check hrtimer
tree for next event.
Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
VST) implementations, as the system can be in idle when next hrtimer event
was supposed to happen. At least ARM and S390 currently use
next_timer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Just to be safe, we should not trigger a conditional reschedule during
the early boot sequence. We've historically done some questionable
early on, and the safety warnings in __might_sleep() are generally
turned off during that period, so there might be problems lurking.
This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to
cause a voluntary conditional reschedule.
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On some platforms readq performs additional work to make sure I/O is done
in a coherent way. This is not needed for time retrieval as done by the
time interpolator. So we can use readq_relaxed instead which will improve
performance.
It affects sparc64 and ia64 only. Apparently it makes a significant
difference on ia64.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
acpi_video_flags variable is unsigned long, so it should be set as such.
This actually matters on x86-64.
Signed-off-by: Stefan Seyfried <seife@suse.de>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow sysadmin to disable all warnings about userland apps
making unaligned accesses by using:
# echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap
Rather than having to use prctl on a process by process basis.
Default behaivour leaves the warnings enabled.
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
We have several points in the SCSI stack (primarily for our device
functions) where we need to guarantee process context, but (given the
place where the last reference was released) we cannot guarantee this.
This API gets around the issue by executing the function directly if
the caller has process context, but scheduling a workqueue to execute
in process context if the caller doesn't have it.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
In daemonize() a new thread gets cleaned up and 'merged' with init_task.
The current fs_struct is handled there, but not the current namespace.
This adds the namespace part.
[ Eric Biederman pointed out the namespace wrappers, and also notes that
we can't ever count on using our parents namespace because we already
have called exit_fs(), which is the only way to the namespace from a
process. ]
Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The compat syscalls are added to sys_ni.c since they are not defined if the
above CONFIG options are off. Also, nfs would not build with CONFIG_SYSCTL
off.
Noticed by Arthur Othieno.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Luke Yang <luke.adi@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, acpi video options can only be set on kernel command line. That's
little inflexible; I'd like userland s2ram application that just works, and
modifying kernel command line according to whitelist is not fun. It is better
to just allow s2ram application to set video options just before suspend
(according to the whitelist).
This implements sysctl to allow setting suspend video options without reboot.
(akpm: Documentation updates for this new sysctl are pending..)
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
audit_log_exit() is called from atomic contexts and gets explicit
gfp_mask argument; it should use it for all allocations rather
than doing some with gfp_mask and some with GFP_KERNEL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Restore the compatibility with the older code and make it possible to
suspend if the kernel command line doesn't contain the "resume=" argument
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
The boot sequence on s390 sometimes takes ages and we spend a very long
time (up to one or two minutes) in calibrate_migration_costs. The time
spent there differs from boot to boot. Also the calculated costs differ
a lot. I've seen differences by up to a factor of 15 (yes, factor not
percent). Also I doubt that making these measurements make much sense on
a completely virtualized architecture where you cannot tell how much cpu
time you will get anyway.
So introduce the CONFIG_DEFAULT_MIGRATION_COST method for an architecture
to set the scheduler migration costs. This turns off automatic detection
of migration costs. Makes sense on virtual platforms, where migration
costs are hard to measure accurately.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This provides an interface for arch code to find out how many
nanoseconds are going to be added on to xtime by the next call to
do_timer. The value returned is a fixed-point number in 52.12 format
in nanoseconds. The reason for this format is that it gives the
full precision that the timekeeping code is using internally.
The motivation for this is to fix a problem that has arisen on 32-bit
powerpc in that the value returned by do_gettimeofday drifts apart
from xtime if NTP is being used. PowerPC is now using a lockless
do_gettimeofday based on reading the timebase register and performing
some simple arithmetic. (This method of getting the time is also
exported to userspace via the VDSO.) However, the factor and offset
it uses were calculated based on the nominal tick length and weren't
being adjusted when NTP varied the tick length.
Note that 64-bit powerpc has had the lockless do_gettimeofday for a
long time now. It also had an extremely hairy routine that got called
from the 32-bit compat routine for adjtimex, which adjusted the
factor and offset according to what it thought the timekeeping code
was going to do. Not only was this only called if a 32-bit task did
adjtimex (i.e. not if a 64-bit task did adjtimex), it was also
duplicating computations from kernel/timer.c and it wasn't clear that
it was (still) correct.
The simple solution is to ask the timekeeping code how long the
current jiffy will be on each timer interrupt, after calling
do_timer. If this jiffy will be a different length from the last one,
we then need to compute new values for the factor and offset used in
the lockless do_gettimeofday. In this way we can keep xtime and
do_gettimeofday in sync, even when NTP is varying the tick length.
Note that when adjtimex varies the tick length, it almost always
introduces the variation from the next tick on. The only case I could
see where adjtimex would vary the length of the current tick is when
an old-style adjtime adjustment is being cancelled. (It's not clear
to me why the adjustment has to be cancelled immediately rather than
from the next tick on.) Thus I don't see any real need for a hook in
adjtimex; the rare case of an old-style adjustment being cancelled can
be fixed up at the next tick.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution
installation it's easiest if it's a boot time option.
Also I moved the variable to a more appropiate place and make
it independent from sysctl
And marked __read_mostly which it is.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I get about 88 squillion of these when suspending an old ad450nx server.
Cc: Pavel Roskin <proski@gnu.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a latent bug in cpuset_exit() handling. If a task tried to allocate
memory after calling cpuset_exit(), it oops'd in
cpuset_update_task_memory_state() on a NULL cpuset pointer.
So set the exiting tasks cpuset to the root cpuset instead of to NULL.
A distro kernel hit this with an added kernel package that had just such a
hook (allocating memory) in the exit code path.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. The tracee can go from ptrace_stop() to do_signal_stop()
after __ptrace_unlink(p).
2. It is unsafe to __ptrace_unlink(p) while p->parent may wait
for tasklist_lock in ptrace_detach().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
copy_process:
attach_pid(p, PIDTYPE_PID, p->pid);
attach_pid(p, PIDTYPE_TGID, p->tgid);
What if kill_proc_info(p->pid) happens in between?
copy_process() holds current->sighand.siglock, so we are safe
in CLONE_THREAD case, because current->sighand == p->sighand.
Otherwise, p->sighand is unlocked, the new process is already
visible to the find_task_by_pid(), but have a copy of parent's
'struct pid' in ->pids[PIDTYPE_TGID].
This means that __group_complete_signal() may hang while doing
do ... while (next_thread() != p)
We can solve this problem if we reverse these 2 attach_pid()s:
attach_pid() does wmb()
group_send_sig_info() calls spin_lock(), which
provides a read barrier. // Yes ?
I don't think we can hit this race in practice, but still.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a window after copy_process() unlocks ->sighand.siglock
and before it adds the new thread to the thread list.
In that window __group_complete_signal(SIGKILL) will not see the
new thread yet, so this thread will start running while the whole
thread group was supposed to exit.
I beleive we have another good reason to place attach_pid(PID/TGID)
under ->sighand.siglock. We can do the same for
release_task()->__unhash_process()
de_thread()->switch_exec_pids()
After that we don't need tasklist_lock to iterate over the thread
list, and we can simplify things, see for example do_sigaction()
or sys_times().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
CONFIG_TIME_LOW_RES is a temporary way for architectures to signal that
they simply return xtime in do_gettimeoffset(). In this corner-case we
want to round up by resolution when starting a relative timer, to avoid
short timeouts. This will go away with the GTOD framework.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6:
[PATCH] sched: filter affine wakeups
Apparently caused more than 10% performance regression for aim7 benchmark.
The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
supplied benchmark results).
The problem is, for aim7, the wake-up pattern is random, but it still needs
load balancing action in the wake-up path to achieve best performance. With
the above commit, lack of load balancing hurts that workload.
However, for workloads like database transaction processing, the requirement
is exactly opposite. In the wake up path, best performance is achieved with
absolutely zero load balancing. We simply wake up the process on the CPU that
it was previously run. Worst performance is obtained when we do load
balancing at wake up.
There isn't an easy way to auto detect the workload characteristics. Ingo's
earlier patch that detects idle CPU and decide whether to load balance or not
doesn't perform with aim7 either since all CPUs are busy (it causes even
bigger perf. regression).
Revert commit d7102e95b7, which causes more
than 10% performance regression with aim7.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PageCompound check before access_process_vm's set_page_dirty_lock is no
longer necessary, so remove it. But leave the PageCompound checks in
bio_set_pages_dirty, dio_bio_complete and nfs_free_user_pages: at least some
of those were introduced as a little optimization on hugetlb pages.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When panic_timeout is zero, suppress triggering a nested panic due to soft
lockup detection.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I don't think the code is quite ready, which is why I asked for Peter's
additions to also be merged before I acked it (although it turned out that
it still isn't quite ready with his additions either).
Basically I have had similar observations to Suresh in that it does not
play nicely with the rest of the balancing infrastructure (and raised
similar concerns in my review).
The samples (group of 4) I got for "maximum recorded imbalance" on a 2x2
SMP+HT Xeon are as follows:
| Following boot | hackbench 20 | hackbench 40
-----------+----------------+---------------------+---------------------
2.6.16-rc2 | 30,37,100,112 | 5600,5530,6020,6090 | 6390,7090,8760,8470
+nosmpnice | 3, 2, 4, 2 | 28, 150, 294, 132 | 348, 348, 294, 347
Hackbench raw performance is down around 15% with smpnice (but that in
itself isn't a huge deal because it is just a benchmark). However, the
samples show that the imbalance passed into move_tasks is increased by
about a factor of 10-30. I think this would also go some way to explaining
latency blips turning up in the balancing code (though I haven't actually
measured that).
We'll probably have to revert this in the SUSE kernel.
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: "Martin J. Bligh" <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes all self references and fixes references to files
in the now defunct arch/ppc64 tree. I think this accomplises
everything wanted, though there might be a few references I missed.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Pointed out by Linus Torvalds.
sys_signal() forgets to initialize ->sa_mask.
( I suspect arch/ia64/ia32/ia32_signal.c:sys32_signal()
also needs this fix )
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A bunch of asm/bug.h includes are both not needed (since it will get
pulled anyway) and bogus (since they are done too early). Removed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If the file descriptor structure is being shared, allocate a new one and copy
information from the current, shared, structure.
Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If vm structure is being shared, allocate a new one and copy information from
the current, shared, structure.
Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the namespace structure is being shared, allocate a new one and copy
information from the current, shared, structure.
Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If filesystem structure is being shared, allocate a new one and copy
information from the current, shared, structure.
Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_unshare system call handler function accepts the same flags as clone
system call, checks constraints on each of the flags and invokes corresponding
unshare functions to disassociate respective process context if it was being
shared with another task.
Here is the link to a program for testing unshare system call.
http://prdownloads.sourceforge.net/audit/unshare_test.c?download
Please note that because of a problem in rmdir associated with bind mounts and
clone with CLONE_NEWNS, the test fails while trying to remove temporary test
directory. You can remove that temporary directory by doing rmdir, twice,
from the command line. The first will fail with EBUSY, but the second will
succeed. I have reported the problem to Ram Pai and Al Viro with a small
program which reproduces the problem. Al told us yesterday that he will be
looking at the problem soon. I have tried multiple rmdirs from the
unshare_test program itself, but for some reason that is not working. Doing
two rmdirs from command line does seem to remove the directory.
Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix compilation problem in PM headers.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Remove unneeded bio_get() which would cause a bio leak
- Writing doesn't dirty pages. Reading dirties pages.
- We should dirty the pages after the IO completion, not before
(Busy-waiting for disk I/O completion isn't very polite.)
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It may suck something awful, but it shouldn't taint the kernel.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
migration_cost prints after every CPU hotplug event. Make it print only
once at boot.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.
As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().
(The above only applies to users of asm-generic/percpu.h. powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Five callsites. I dunno how all this crap got back in there :(
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kernel/cpuset.c:644:38: warning: non-ANSI function declaration of function 'cpuset_update_task_memory_state'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- In case of a negative nsec value the result of the division must be
normalized.
- Remove inline from an exported function.
Signed-off-by: George Anzinger <george@wildturkeyranch.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When one module exports a function symbol and another module uses that
symbol then kallsyms shows the symbol twice. Once from the consumer with a
type of 'U' and once from the provider with a type of 't' or 'T'. On most
architectures, both entries have the same address so it does not matter
which one is returned by kallsyms_lookup_name(). But on architectures with
function descriptors, the 'U' entry points to the descriptor, not to the
code body, which is not what we want.
IA64 # grep -w qla2x00_remove_one /proc/kallsyms
a000000208c25ef8 U qla2x00_remove_one [qla2300] <= descriptor
a000000208bf44c0 t qla2x00_remove_one [qla2xxx] <= function body
Tell kallsyms_lookup_name() to ignore type U entries in modules.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When two function-return probes are inserted on kfree()[1] and the second
on say, sys_link()[2], and later [2] is unregistered, we have a deadlock as
kfree is called with the kretprobe_lock held and the function-return probe
on kfree will also try to grab the same lock.
However, we can move the kfree() during unregistration to outside the
spinlock as we are sure that no instances from the free list will be used
after synchronized_sched() returns during the unregistration process.
Thanks to Masami Hiramatsu for spotting this.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kernel/kprobes.c:353: warning: 'pre_handler_kretprobe' defined but not used
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently the zone_reclaim code has a fixed window of 30 seconds of off node
allocations should a local zone have no unused pagecache pages left. Reclaim
will be attempted again after this timeout period to avoid repeated useless
scans for memory. This is also useful to established sufficiently large off
node allocation chunks to relieve the local node.
It may be beneficial to adjust that time period for some special situations.
For example if memory use was exceeding node capacity one may want to give up
for longer periods of time. If memory spikes intermittendly then one may want
to shorten the time period to reduce the number of off node allocations.
This patch allows just that....
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- If we only reclaim nr_pages then its okay to stay on node.
Switch from > to >= for the comparison.
- vm_table[] entry for zone_reclaim_mode is a bit screwed up.
- Add empty lines around shrink_zone to show that this is the
central function to be called.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Prevent the kernel from setting the log level to 10 unconditionally during
suspend/resume which was needed in the past for debugging, but generally is
undesirable.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change sched_getaffinity() so that it returns a bitmap that indicates the
legally schedulable cpus that a task is allowed to run on.
Without this patch, if CONFIG_HOTPLUG_CPU is enabled, sched_getaffinity()
unconditionally returns (at least on IA64) a mask with NR_CPUS bits set.
This conveys no useful infornmation except for a kernel compile option.
This fixes a breakage we obseved running recent kernels. We have MPI jobs
that use sched_getaffinity() to determine where to place their threads.
Placing them on non-existant cpus is problematic :-)
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This function is neither used nor has any real contents.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The expiry time for relative timers with SIGEV_NONE set was never
updated to the correct value.
Pointed out by George Anzinger.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At some point we added credits to people who actively helped to bring
k/hr-timers along. This was lost in the big code revamp. Add it back.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up the interface to hrtimers by changing the init code to pass the mode
as well as the clock. This allow the init code to select the correct base and
eliminates extra timer re-init code in posix-timers. We also simplify the
restart interface nanosleep use.
Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Steven Rostedtrostedt@goodmis.org <rostedt@goodmis.org>
CPU0 expires a posix-timer and runs the callback function. The signal is
queued.
After releasing the posix-timer lock and before returning to hrtimer_run_queue
CPU0 gets interrupted. CPU1 delivers the queued signal and rearms the timer.
CPU0 comes back to hrtimer_run_queue and sets the timer state to expired.
The next modification of the timer can result in an oops, because the state
information is wrong.
Keep track of state = RUNNING and check if the state has been in the return
path of hrtimer_run_queue. In case the state has been changed, ignore a
restart request and do not touch the state variable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This resolves bugzilla bug#5617. The oldvalue of the timer was read after the
timer was cancelled, so the remaining time was always zero.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixup the conversion of posix-timers to hrtimers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The itimer conversion removed the locking which protects the timer and
variables in the shared signal structure. Steven Rostedt found the problem in
the latest -rt patches.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make swsusp use bytes as the image size units, which is needed for future
compatibility.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I get storms of warnings from local_bh_enable(). Better-tested patches,
please.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
rcu_torture_lock is used in a softirq-unsafe manner, but it is also
taken by rcu_torture_cb(), which may execute in softirq-context,
resulting in potential deadlocks.
The fix is to acquire rcu_torture_lock in a softirq-safe manner. With
this fix applied, the rcu-torture code passes validation.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
RCU task-struct freeing can call free_uid(), which is taking
uidhash_lock - while other users of uidhash_lock are softirq-unsafe.
The fix is to always take the uidhash_spinlock in a softirq-safe manner.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reduces the amount of time the migration cost calculations cost
during bootup. Based on numbers by Tony Luck <tony.luck@intel.com>.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
EDAC requires a way to scrub memory if an ECC error is found and the chipset
does not do the work automatically. That means rewriting memory locations
atomically with respect to all CPUs _and_ bus masters. That means we can't
use atomic_add(foo, 0) as it gets optimised for non-SMP
This adds a function to include/asm-foo/atomic.h for the platforms currently
supported which implements a scrub of a mapped block.
It also adjusts a few other files include order where atomic.h is included
before types.h as this now causes an error as atomic_scrub uses u32.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The TIF_RESTORE_SIGMASK flag allows us to have a generic implementation of
sys_rt_sigsuspend() instead of duplicating it for each architecture. This
provides such an implementation and makes arch/powerpc use it.
It also tidies up the ppc32 sys_sigsuspend() to use TIF_RESTORE_SIGMASK.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, a negative policy argument passed into the
'sys_sched_setscheduler()' system call, will return with success. However,
the manpage for 'sys_sched_setscheduler' says:
EINVAL The scheduling policy is not one of the recognized policies, or the
parameter p does not make sense for the policy.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>