SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
the arches. It would be great if it could be the default so that we can get
rid of various forms of DISCONTIG and other variations on memory maps. So far
what has hindered this are the additional lookups that SPARSEMEM introduces
for virt_to_page and page_address. This goes so far that the code to do this
has to be kept in a separate function and cannot be used inline.
This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
is mapped into a virtually contigious area, only the active sections are
physically backed. This allows virt_to_page page_address and cohorts become
simple shift/add operations. No page flag fields, no table lookups, nothing
involving memory is required.
The two key operations pfn_to_page and page_to_page become:
#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) ((page) - vmemmap)
By having a virtual mapping for the memmap we allow simple access without
wasting physical memory. As kernel memory is typically already mapped 1:1
this introduces no additional overhead. The virtual mapping must be big
enough to allow a struct page to be allocated and mapped for all valid
physical pages. This vill make a virtual memmap difficult to use on 32 bit
platforms that support 36 address bits.
However, if there is enough virtual space available and the arch already maps
its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this
technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
FLATMEM needs to read the contents of the mem_map variable to get the start of
the memmap and then add the offset to the required entry. vmemmap is a
constant to which we can simply add the offset.
This patch has the potential to allow us to make SPARSMEM the default (and
even the only) option for most systems. It should be optimal on UP, SMP and
NUMA on most platforms. Then we may even be able to remove the other memory
models: FLATMEM, DISCONTIG etc.
[apw@shadowen.org: config cleanups, resplit code etc]
[kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
[apw@shadowen.org: vmemmap: remove excess debugging]
[apw@shadowen.org: simplify initialisation code and reduce duplication]
[apw@shadowen.org: pull out the vmemmap code into its own file]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 4665079cbb ("[NETNS]: Move some
code into __init section when CONFIG_NET_NS=n") we got a new section -
.exit.text.refok (more of 'let's tell modpost that some bogus calls are
not bogus', a-la text.init.refok).
Unfortunately, the commit in question forgot to add it to TEXT_TEXT,
with rather amusing results.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the PCI layer properly handling legacy IDE and the kernel now using
it these can go
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Commit f629307c85 introduced uses of
kernel_termios_to_user_termios_1 and user_termios_to_kernel_termios_1
on all architectures. However, powerpc, s390, avr32 and frv don't
currently define those functions since their termios struct didn't
need to be changed when the arbitrary baud rate stuff was added, and
thus the kernel won't currently build on those architectures.
This adds definitions of kernel_termios_to_user_termios_1 and
user_termios_to_kernel_termios_1 to include/asm-generic/termios.h
which are identical to kernel_termios_to_user_termios and
user_termios_to_kernel_termios respectively. The definitions are the
same because the "old" termios and "new" termios are in fact the same
on these architectures (which are the same ones that use
asm-generic/termios.h).
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Cox <alan@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are some parts of include/asm-generic/pgtable.h that are relevant to
the non-mmu architectures. To make it easier to include this from them I
would like to ifdef the relevant parts.
Without this there is a handful of functions that are referenced in here
that are not defined on many non-mmu architectures. They could be defined
out of course, as an alternative approach.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan noticed that the new WARN_ON() semantics that were
introduced by commit 684f978347 (to also
return the value to be warned on) didn't compile when given a bitfield,
because the typeof doesn't work for bitfields.
So instead of the typeof trick, use an "int" variable together with a
"!!(x)" expression, as suggested by Al Viro.
To make matters more interesting, Paul Mackerras points out that that is
sub-optimal on Power, but the old asm-coded comparison seems to be buggy
anyway on 32-bit Power if the conditional was 64-bit, so I think there
are more problems there.
Regardless, the new WARN_ON() semantics may have been a bad idea. But
this at least avoids the more serious complications.
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use "__val" rather than "val" in the __get_unaligned macro in
asm-generic/unaligned.h. This way gcc wont warn if you happen to also name
something in the same scope "val".
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the i386 linker script and the asm-generic macro it uses so that
ELF note sections with SHF_ALLOC set are linked into the kernel image along
with other read-only data. The PT_NOTE also points to their location.
This paves the way for putting useful build-time information into ELF notes
that can be found easily later in a kernel memory dump.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
per cpu data section contains two types of data. One set which is
exclusively accessed by the local cpu and the other set which is per cpu,
but also shared by remote cpus. In the current kernel, these two sets are
not clearely separated out. This can potentially cause the same data
cacheline shared between the two sets of data, which will result in
unnecessary bouncing of the cacheline between cpus.
One way to fix the problem is to cacheline align the remotely accessed per
cpu data, both at the beginning and at the end. Because of the padding at
both ends, this will likely cause some memory wastage and also the
interface to achieve this is not clean.
This patch:
Moves the remotely accessed per cpu data (which is currently marked
as ____cacheline_aligned_in_smp) into a different section, where all the data
elements are cacheline aligned. And as such, this differentiates the local
only data and remotely accessed data cleanly.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Verify that types would match for assignment (under sizeof, so we are safe from
side effects or any code actually getting generated), then explicitly cast
everywhere to the fixed-sized types. Kills a bunch of bogus warnings about
constants being truncated (gcc, sparse), finds a pile of endianness problems
hidden by old noise (sparse).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse now warns if one compares pointers with integers. However, there are
false positives, like:
fs/filesystems.c:72:2: warning: Using plain integer as NULL pointer
Every time BUG_ON(ptr) is used, ptr is checked against integer zero. Avoid
that and save ~70 false positives from allyesconfig run.
mentioned by Al.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Josh Triplett <josh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody is using ptep_test_and_clear_dirty and ptep_clear_flush_dirty. Remove
the functions from all architectures.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last user of ptep_establish in mm/ is long gone. Remove the architecture
primitive as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The problem is as follows: in multi-threaded code (or more correctly: all
code using clone() with CLONE_FILES) we have a race when exec'ing.
thread #1 thread #2
fd=open()
fork + exec
fcntl(fd,F_SETFD,FD_CLOEXEC)
In some applications this can happen frequently. Take a web browser. One
thread opens a file and another thread starts, say, an external PDF viewer.
The result can even be a security issue if that open file descriptor
refers to a sensitive file and the external program can somehow be tricked
into using that descriptor.
Just adding O_CLOEXEC support to open() doesn't solve the whole set of
problems. There are other ways to create file descriptors (socket,
epoll_create, Unix domain socket transfer, etc). These can and should be
addressed separately though. open() is such an easy case that it makes not
much sense putting the fix off.
The test program:
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#ifndef O_CLOEXEC
# define O_CLOEXEC 02000000
#endif
int
main (int argc, char *argv[])
{
int fd;
if (argc > 1)
{
fd = atol (argv[1]);
printf ("child: fd = %d\n", fd);
if (fcntl (fd, F_GETFD) == 0 || errno != EBADF)
{
puts ("file descriptor valid in child");
return 1;
}
return 0;
}
fd = open ("/proc/self/exe", O_RDONLY | O_CLOEXEC);
printf ("in parent: new fd = %d\n", fd);
char buf[20];
snprintf (buf, sizeof (buf), "%d", fd);
execl ("/proc/self/exe", argv[0], buf, NULL);
puts ("execl failed");
return 1;
}
[kyle@parisc-linux.org: parisc fix]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Continuing the work started in 411f0f3edc ...
This enables code with a dma path, that compiles away, to build without
requiring additional code factoring. It also prevents code that calls
dma_alloc_coherent and dma_free_coherent from linking whereas previously
the code would hit a BUG() at run time. Finally, it allows archs that set
!HAS_DMA to delete their asm/dma-mapping.h file.
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: John W. Linville <linville@tuxdriver.com>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: <geert@linux-m68k.org>
Cc: <zippel@linux-m68k.org>
Cc: <spyro@f2s.com>
Cc: <ysato@users.sourceforge.jp>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some changes done a while ago to avoid pounding on ptep_set_access_flags and
update_mmu_cache in some race situations break sun4c which requires
update_mmu_cache() to always be called on minor faults.
This patch reworks ptep_set_access_flags() semantics, implementations and
callers so that it's now responsible for returning whether an update is
necessary or not (basically whether the PTE actually changed). This allow
fixing the sparc implementation to always return 1 on sun4c.
[akpm@linux-foundation.org: fixes, cleanups]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Miller <davem@davemloft.net>
Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The RO_DATA section were hardcoded to a specific
alignment in include/asm-generic/vmlinux.h.
But for sparc64 this did not match the PAGE_SIZE.
Introduce a new section definition named:
RO_DATA that takes actual alignment as parameter.
RODATA are provided for backward compatibility.
On top of this avoid hardcoding alignment for
sparc64 in reset of the script
Fix is build-tested on sparc64 + x86_64.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Change the default printout message for WARN_ON() to say what it is, not
something else. I'm tired of having people get all aflutter about a warning.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Throughout the kernel there are a few legitimite references
to init or exit sections. Most of these are covered by the
patterns included in modpost but a few nees special attention.
To avoid hardcoding a lot of function names in modpost introduce
a marker so relevant function/data can be marked.
When modpost see a reference to a init/exit function from
a function/data marked no warning will be issued.
Idea from: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
* 'audit.b38' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[PATCH] Abnormal End of Processes
[PATCH] match audit name data
[PATCH] complete message queue auditing
[PATCH] audit inode for all xattr syscalls
[PATCH] initialize name osid
[PATCH] audit signal recipients
[PATCH] add SIGNAL syscall class (v3)
[PATCH] auditing ptrace
These files are almost all the same.
This patch could be made even simpler if we don't mind POLLREMOVE turning
up in a few architectures that didn't have it previously (which should be
OK as POLLREMOVE is not used anywhere in the current tree).
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the misspellings of "propogate", "writting" and (oh, the shame
:-) "kenrel" in the source tree.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
This series extena and standardises local_t operations on each architecture,
allowing a rich set of atomic operations to be done on per-cpu data with
minimal performance impact. On architectures where there seems to be no
difference between the SMP and UP operation (same memory barriers, same
LOCKing), local.h simply includes asm-generic/local.h, which removes
duplicated code from the current kernel tree.
This patch:
local_t: architecture independent extension
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
atomic_add_unless as inline. Remove system.h atomic.h circular dependency.
I agree (with Andi Kleen) this typeof is not needed and more error
prone. All the original atomic.h code that uses cmpxchg (which includes
the atomic_add_unless) uses defines instead of inline functions,
probably to circumvent a circular dependency between system.h and
atomic.h on powerpc (which my patch addresses). Therefore, it makes
sense to use inline functions that will provide type checking.
atomic_add_unless as inline. Remove system.h atomic.h circular dependency.
Digging into the FRV architecture shows me that it is also affected by
such a circular dependency. Here is the diff applying this against the
rest of my atomic.h patches.
It applies over the atomic.h standardization patches.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the die notifier handling to common code. Previous
various architectures had exactly the same code for it. Note that the new
code is compiled unconditionally, this should be understood as an appel to
the other architecture maintainer to implement support for it aswell (aka
sprinkling a notify_die or two in the proper place)
arm had a notifiy_die that did something totally different, I renamed it to
arm_notify_die as part of the patch and made it static to the file it's
declared and used at. avr32 used to pass slightly less information through
this interface and I brought it into line with the other architectures.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
[bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Three cleanups:
1: ELF notes are never mapped, so there's no need to have any access
flags in their phdr.
2: When generating them from asm, tell the assembler to use a SHT_NOTE
section type. There doesn't seem to be a way to do this from C.
3: Use ANSI rather than traditional cpp behaviour to stringify the
macro argument.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Add hooks to allow a paravirt implementation to track the lifetime of
an mm. Paravirtualization requires three hooks, but only two are
needed in common code. They are:
arch_dup_mmap, which is called when a new mmap is created at fork
arch_exit_mmap, which is called when the last process reference to an
mm is dropped, which typically happens on exit and exec.
The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures. It's called when an mm is first used.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Allocating PDA and GDT at boot is a pain. Using simple per-cpu variables adds
happiness (although we need the GDT page-aligned for Xen, which we do in a
followup patch).
[akpm@linux-foundation.org: build fix]
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The page_test_and_clear_dirty primitive really consists of two
operations, page_test_dirty and the page_clear_dirty. The combination
of the two is not an atomic operation, so it makes more sense to have
two separate operations instead of one.
In addition to the improved readability of the s390 version of
SetPageUptodate, it now avoids the page_test_dirty operation which is
an insert-storage-key-extended (iske) instruction which is an expensive
operation.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Here is the current version of the 64 bit divide common code.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since lazy MMU batching mode still allows interrupts to enter, it is
possible for interrupt handlers to try to use kmap_atomic, which fails when
lazy mode is active, since the PTE update to highmem will be delayed. The
best workaround is to issue an explicit flush in kmap_atomic_functions
case; this is the only way nested PTE updates can happen in the interrupt
handler.
Thanks to Jeremy Fitzhardinge for noting the bug and suggestions on a fix.
This patch gets reverted again when we start 2.6.22 and the bug gets fixed
differently.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 39d61db0ed.
The commit was buggy in multiple ways:
- the conversion to ilog2() was incorrect to begin with
- it tested the wrong #defines, so on all architectures but FRV you'd
never see the bug except for constant arguments.
- the new "get_order()" macro used its arguments multiple times, and
didn't even parenthesize them properly
- despite the comments, it was not true that you could use it for
constant initializers, since not all architectures even use the
generic page.h header file.
All of the problems are individually fixable, but it all boils down to:
better just revert it, and re-do it from scratch.
Cc: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VMI ROM has a mode where hypercalls can be queued and batched. This turns
out to be a significant win during context switch, but must be done at a
specific point before side effects to CPU state are visible to subsequent
instructions. This is similar to the MMU batching hooks already provided.
The same hooks could be used by the Xen backend to implement a context switch
multicall.
To explain a bit more about lazy modes in the paravirt patches, basically, the
idea is that only one of lazy CPU or MMU mode can be active at any given time.
Lazy MMU mode is similar to this lazy CPU mode, and allows for batching of
multiple PTE updates (say, inside a remap loop), but to avoid keeping some
kind of state machine about when to flush cpu or mmu updates, we just allow
one or the other to be active. Although there is no real reason a more
comprehensive scheme could not be implemented, there is also no demonstrated
need for this extra complexity.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
This defines a simple and minimalist programming interface for GPIO APIs:
- Documentation/gpio.txt ... describes things (read it)
- include/asm-arm/gpio.h ... defines the ARM hook, which just punts
to <asm/arch/gpio.h> for any implementation
- include/asm-generic/gpio.h ... implement "can sleep" variants as calling
the normal ones, for systems that don't handle i2c expanders.
The immediate need for such a cross-architecture API convention is to support
drivers that work the same on AT91 ARM and AVR32 AP7000 chips, which embed many
of the same controllers but have different CPUs. However, several other users
have been reported, including a driver for a hardware watchdog chip and some
handhelds.org multi-CPU button drivers.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9ac7849e35 causes this on S390:
drivers/built-in.o: In function `dmam_noncoherent_release':
dma-mapping.c:(.text+0x1515c): undefined reference to `dma_free_noncoherent'
drivers/built-in.o: In function `dmam_free_noncoherent':
undefined reference to `dma_free_noncoherent'
drivers/built-in.o: In function `dmam_alloc_noncoherent':
undefined reference to `dma_alloc_noncoherent'
make: *** [.tmp_vmlinux1] Error 1
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_SPARSEMEM=y:
mm/rmap.c:579: warning: format '%lx' expects type 'long unsigned int', but argument 2 has type 'int'
Make __page_to_pfn() return unsigned long.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
flag: use "MAP_ANONYMOUS" instead.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the Maple board, the AMD8111 IDE is in legacy mode... except that it
appears on IRQ 20 instead of IRQ 15. For drivers/ide this was handled by
the architecture's "pci_get_legacy_ide_irq()" function, but in libata we
just hard-code the numbers 14 and 15.
This patch provides asm-powerpc/libata-portmap.h which maps the IRQ as
appropriate, having added a pci_dev argument to the
ATA_{PRIM,SECOND}ARY_IRQ macros.
There's probably a better way to do this -- especially if we observe
that the _only_ case in which this seemingly-generic
"pci_get_legacy_ide_irq()" function returns anything other than 14 and
15 for primary and secondary respectively is the case of the AMD8111 on
the Maple board -- couldn't we handle that with a special case in the
pata_amd driver, or perhaps with a PCI quirk for Maple to switch it into
native mode during early boot and assign resources properly?
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
WARN_ON() ever triggering is a kernel bug. Do not try to paper over this
fact by suggesting to the user that this is 'only' a warning, as the
following recent commit does:
commit 30e25b71e7
Author: Jeremy Fitzhardinge <jeremy@goop.org>
Date: Fri Dec 8 02:36:24 2006 -0800
[PATCH] Fix generic WARN_ON message
A warning is a warning, not a BUG.
( it might make sense to rename BUG() to CRASH() and BUG_ON() to
CRASH_ON(), but that does not change the fact that WARN_ON()
signals a kernel bug. )
i and others objected to this change during lkml review:
http://marc.theaimsgroup.com/?l=linux-kernel&m=116115160710533&w=2
still the change slipped upstream - grumble :)
Also, use the standard "BUG: " format to make it easier to grep logs and
to make it easier to google for kernel bugs.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is designed to fix:
- Disk eating corruptor on KT7 after resume from RAM
- VIA IRQ handling
- VIA fixups for bus lockups after resume from RAM
The core of this is to add a table of resume fixups run at resume time.
We need to do this for a variety of boards and features, but particularly
we need to do this to get various critical VIA fixups done on resume.
The second part of the problem is to handle VIA IRQ number rules which
are a bit odd and need special handling for PIC interrupts. Various
patches broke various boxes and while this one may not be perfect
(hopefully it is) it ensures the workaround is applied to the right
devices only.
From: Jean Delvare <khali@linux-fr.org>
Now that PCI quirks are replayed on software resume, we can safely
re-enable the Asus SMBus unhiding quirk even when software suspend support
is enabled.
[akpm@osdl.org: fix const warning]
Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It has caused more problems than it ever really solved, and is
apparently not getting cleaned up and fixed. We can put it back when
it's stable and isn't likely to make warning or bug events worse.
In the meantime, enable frame pointers for more readable stack traces.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We should not initialize rootfs before all the core initializers have
run. So do it as a separate stage just before starting the regular
driver initializers.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the grungy swap all the occurrences in the right places patch that
goes with the updates. At this point we have the same functionality as
before (except that sgttyb() returns speeds not zero) and are ready to
begin turning new stuff on providing nobody reports lots of bugs
If you are a tty driver author converting an out of tree driver the only
impact should be termios->ktermios name changes for the speed/property
setting functions from your upper layers.
If you are implementing your own TCGETS function before then your driver
was broken already and its about to get a whole lot more painful for you so
please fix it 8)
Also fill in c_ispeed/ospeed on init for most devices, although the current
code will do this for you anyway but I'd like eventually to lose that extra
paranoia
[akpm@osdl.org: bluetooth fix]
[mp3@de.ibm.com: sclp fix]
[mp3@de.ibm.com: warning fix for tty3270]
[hugh@veritas.com: fix tty_ioctl powerpc build]
[jdike@addtoit.com: uml: fix ->set_termios declaration]
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@de.ibm.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alter get_order() so that it can make use of ilog2() on a constant to produce
a constant value, retaining the ability for an arch to override it in the
non-const case.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A warning is a warning, not a BUG.
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds common handling for kernel BUGs, for use by architectures as
they wish. The code is derived from arch/powerpc.
The advantages of having common BUG handling are:
- consistent BUG reporting across architectures
- shared implementation of out-of-line file/line data
- implement CONFIG_DEBUG_BUGVERBOSE consistently
This means that in inline impact of BUG is just the illegal instruction
itself, which is an improvement for i386 and x86-64.
A BUG is represented in the instruction stream as an illegal instruction,
which has file/line information associated with it. This extra information is
stored in the __bug_table section in the ELF file.
When the kernel gets an illegal instruction, it first confirms it might
possibly be from a BUG (ie, in kernel mode, the right illegal instruction).
It then calls report_bug(). This searches __bug_table for a matching
instruction pointer, and if found, prints the corresponding file/line
information. If report_bug() determines that it wasn't a BUG which caused the
trap, it returns BUG_TRAP_TYPE_NONE.
Some architectures (powerpc) implement WARN using the same mechanism; if the
illegal instruction was the result of a WARN, then report_bug(Q) returns
CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG.
lib/bug.c keeps a list of loaded modules which can be searched for __bug_table
entries. The architecture must call
module_bug_finalize()/module_bug_cleanup() from its corresponding
module_finalize/cleanup functions.
Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount.
At the very least, filename and line information will not be recorded for each
but, but architectures may decide to store no extra information per BUG at
all.
Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so
architectures will generally have to include an infinite loop (or similar) in
the BUG code, so that gcc knows execution won't continue beyond that point.
gcc does have a __builtin_trap() operator which may be useful to achieve the
same effect, unfortunately it cannot be used to actually implement the BUG
itself, because there's no way to get the instruction's address for use in
generating the __bug_table entry.
[randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors]
[bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h]
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the contents of the userspace asm/setup.h header consistent on all
architectures:
- export setup.h to userspace on all architectures
- export only COMMAND_LINE_SIZE to userspace
- frv: move COMMAND_LINE_SIZE from param.h
- i386: remove duplicate COMMAND_LINE_SIZE from param.h
- arm:
- export ATAGs to userspace
- change u8/u16/u32 to __u8/__u16/__u32
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass struct dev pointer to dma_cache_sync()
dma_cache_sync() is ill-designed in that it does not have a struct device
pointer argument which makes proper support for systems that consist of a
mix of coherent and non-coherent DMA devices hard. Change dma_cache_sync
to take a struct device pointer as first argument and fix all its callers
to pass it.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
dma_is_consistent() is ill-designed in that it does not have a struct
device pointer argument which makes proper support for systems that consist
of a mix of coherent and non-coherent DMA devices hard. Change
dma_is_consistent to take a struct device pointer as first argument and fix
the sole caller to pass it.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce pagefault_{disable,enable}() and use these where previously we did
manual preempt increments/decrements to make the pagefault handler do the
atomic thing.
Currently they still rely on the increased preempt count, but do not rely on
the disabled preemption, this might go away in the future.
(NOTE: the extra barrier() in pagefault_disable might fix some holes on
machines which have too many registers for their own good)
[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The .eh_frame section contents is never written to, so it can as well
benefit from CONFIG_DEBUG_RODATA.
Diff-ed against firstfloor tree.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Ld knows about 2 kinds of symbols, absolute and section
relative. Section relative symbols symbols change value
when a section is moved and absolute symbols do not.
Currently in the linker script we have several labels
marking the beginning and ending of sections that
are outside of sections, making them absolute symbols.
Having a mixture of absolute and section relative
symbols refereing to the same data is currently harmless
but it is confusing.
This must be done carefully as newer revs of ld do not place
symbols that appear in sections without data and instead
ld makes those symbols global :(
My ultimate goal is to build a relocatable kernel. The
safest and least intrusive technique is to generate
relocation entries so the kernel can be relocated at load
time. The only penalty would be an increase in the size
of the kernel binary. The problem is that if absolute and
relocatable symbols are not properly specified absolute symbols
will be relocated or section relative symbols won't be, which
is fatal.
The practical motivation is that when generating kernels that
will run from a reserved area for analyzing what caused
a kernel panic, it is simpler if you don't need to hard code
the physical memory location they will run at, especially
for the distributions.
[AK: and merged:]
o Also put a message so that in future people can be aware of it and
avoid introducing absolute symbols.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Add arch specific dev_archdata to struct device
Adds an arch specific struct dev_arch to struct device. This enables
architecture to add specific fields to every device in the system, like
DMA operation pointers, NUMA node ID, firmware specific data, etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Andi Kleen <ak@suse.de>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is a quick hack to overcome the fact that SRCU currently does not
allow static initializers, and we need to sometimes initialize those
things before any other initializers (even "core" ones) can do so.
Currently we don't allow this at all for modules, and the only user that
needs is right now is cpufreq. As reported by Thomas Gleixner:
"Commit b4dfdbb3c7 ("[PATCH] cpufreq:
make the transition_notifier chain use SRCU breaks cpu frequency
notification users, which register the callback > on core_init
level."
Cc: Thomas Gleixner <tglx@timesys.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@osdl.org>,
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The multithreaded-probing code has a problem: after one initcall level (eg,
core_initcall) has been processed, we will then start processing the next
level (postcore_initcall) while the kernel threads which are handling
core_initcall are still executing. This breaks the guarantees which the
layered initcalls previously gave us.
IOW, we want to be multithreaded _within_ an initcall level, but not between
different levels.
Fix that up by causing the probing code to wait for all outstanding probes at
one level to complete before we start processing the next level.
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a vmlinux.lds.h helper macro for defining the eight-level initcall table,
teach all the architectures to use it.
This is a prerequisite for a patch which performs initcall synchronisation for
multithreaded-probing.
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
[ Added AVR32 as well ]
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This changes the dwarf2 unwinder to do a binary search for CIEs
instead of a linear work. The linker is unfortunately not
able to build a proper lookup table at link time, instead it creates
one at runtime as soon as the bootmem allocator is usable (so you'll continue
using the linear lookup for the first [hopefully] few calls).
The code should be ready to utilize a build-time created table once
a fixed linker becomes available.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
In most cases the return value of WARN_ON() is ignored. If the generic
definition for the !CONFIG_BUG case is used this will result in a warning:
CC kernel/sched.o
In file included from include/linux/bio.h:25,
from include/linux/blkdev.h:14,
from kernel/sched.c:39:
include/linux/ioprio.h: In function âtask_ioprioâ:
include/linux/ioprio.h:50: warning: statement with no effect
kernel/sched.c: In function âcontext_switchâ:
kernel/sched.c:1834: warning: statement with no effect
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This likely profiling is pretty fun. I found a few possible problems
in sched.c.
This patch may be not measurable, but when I did measure long ago,
nooping (un)likely cost a couple of % on scheduler heavy benchmarks, so
it all adds up.
Tweak some branch hints:
- the 2nd 64 bits in the bitmask is likely to be populated, because it
contains the first 28 bits (nearly 3/4) of the normal priorities.
(ratio of 669669:691 ~= 1000:1).
- it isn't unlikely that context switching switches to another process. it
might be very rapidly switching to and from the idle process (ratio of
475815:419004 and 471330:423544). Let the branch predictor decide.
- preempt_enable seems to be very often called in a nested preempt_disable
or with interrupts disabled (ratio of 3567760:87965 ~= 40:1)
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Daniel Walker <dwalker@mvista.com>
Cc: Hua Zhong <hzhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Trivial typo fix in the "syntax error if percpu macros are incorrectly
used" patch. I misspelled "identifier" in all places. D'Oh!
Thanks to Dirk Mueller to point this out.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tim and Ananiev report that the recent WARN_ON_ONCE changes cause increased
cache misses with the tbench workload. Apparently due to the access to the
newly-added static variable.
Rearrange the code so that we don't touch that variable unless the warning is
going to trigger.
Also rework the logic so that the static variable starts out at zero, so we
can move it into bss.
It would seem logical to mark the static variable as __read_mostly too. But
it would be wrong, because that would put it back into the vmlinux image, and
the kernel will never read from this variable in normal operation anyway.
Unless the compiler or hardware go and do some prefetching on us?
For some reason this patch shrinks softirq.o text by 40 bytes.
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Ananiev, Leonid I" <leonid.i.ananiev@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.
The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around. On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).
Where appropriate, an arch may override the generic storage facility and do
something different with the variable. On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.
Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions. Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller. A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.
I've build this code with allyesconfig for x86_64 and i386. I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.
This will affect all archs. Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:
struct pt_regs *old_regs = set_irq_regs(regs);
And put the old one back at the end:
set_irq_regs(old_regs);
Don't pass regs through to generic_handle_irq() or __do_IRQ().
In timer_interrupt(), this sort of change will be necessary:
- update_process_times(user_mode(regs));
- profile_tick(CPU_PROFILING, regs);
+ update_process_times(user_mode(get_irq_regs()));
+ profile_tick(CPU_PROFILING);
I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().
Some notes on the interrupt handling in the drivers:
(*) input_dev() is now gone entirely. The regs pointer is no longer stored in
the input_dev struct.
(*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking. It does
something different depending on whether it's been supplied with a regs
pointer or not.
(*) Various IRQ handler function pointers have been moved to type
irq_handler_t.
Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)
Many files include the filename at the beginning, serveral used a wrong one.
Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Now that ptep_establish has a definition in PAE i386 3-level paging code, the
only paging model which is insane enough to have multi-word hardware PTEs
which are not efficient to set atomically, we can remove the ghost of
set_pte_atomic from other architectures which falesly duplicated it, and
remove all knowledge of it from the generic pgtable code.
set_pte_atomic is now a private pte operator which is specific to i386
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement lazy MMU update hooks which are SMP safe for both direct and shadow
page tables. The idea is that PTE updates and page invalidations while in
lazy mode can be batched into a single hypercall. We use this in VMI for
shadow page table synchronization, and it is a win. It also can be used by
PPC and for direct page tables on Xen.
For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified. This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of. Doing this under the protection of the locks guarantees the
synchronization is correct, and also means that spurious faults which are
generated during this window by remote CPUs are properly handled, as the page
fault handler must re-check the PTE under protection of the same lock.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change pte_clear_full to a more appropriately named pte_clear_not_present,
allowing optimizations when not-present mapping changes need not be reflected
in the hardware TLB for protected page table modes. There is also another
case that can use it in the fremap code.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Letting WARN_ON/WARN_ON_ONCE return the condition means that you could do
if (WARN_ON(blah)) {
handle_impossible_case
}
Rather than
if (unlikely(blah)) {
WARN_ON(1)
handle_impossible_case
}
I checked all the newly added WARN_ON_ONCE users and none of them test the
return status so we can still change it.
[akpm@osdl.org: warning fix]
[akpm@osdl.org: build fix]
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The param section is an array of "kernel_param" structures, storing only
constant data: pointer to name, permission of the variable pointed to by
(void *)arg and pointers to set/get methods.
Move end_rodata down to include __param section in the read-only range used
by CONFIG_DEBUG_RODATA.
Signed-off-by: Marcelo Tosatti <marcelo@kvack.org>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Parsing generic pgtable.h in assembler is simply crazy. None of this file is
needed in assembler code, and C inline functions and structures routine break
one or more different compiles.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch will pack any .note.* section into a PT_NOTE segment in the output
file.
To do this, we tell ld that we need a PT_NOTE segment. This requires us to
start explicitly mapping sections to segments, so we also need to explicitly
create PT_LOAD segments for text and data, and map the sections to them
appropriately. Fortunately, each section will default to its previous
section's segment, so it doesn't take many changes to vmlinux.lds.S.
This only changes i386 for now, but I presume the corresponding changes for
other architectures will be as simple.
This change also adds <linux/elfnote.h>, which defines C and Assembler macros
for actually creating ELF notes.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One of the changes necessary for shared page tables is to standardize the
pxx_page macros. pte_page and pmd_page have always returned the struct
page associated with their entry, while pte_page_kernel and pmd_page_kernel
have returned the kernel virtual address. pud_page and pgd_page, on the
other hand, return the kernel virtual address.
Shared page tables needs pud_page and pgd_page to return the actual page
structures. There are very few actual users of these functions, so it is
simple to standardize their usage.
Since this is basic cleanup, I am submitting these changes as a standalone
patch. Per Hugh Dickins' comments about it, I am also changing the
pxx_page_kernel macros to pxx_page_vaddr to clarify their meaning.
Signed-off-by: Dave McCracken <dmccr@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
get_cpu_var()/per_cpu()/__get_cpu_var() arguments must be simple
identifiers. Otherwise the arch dependent implementations might break.
This patch enforces the correct usage of the macros by producing a syntax
error if the variable is not a simple identifier.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
several targets have no ....at() family and m32r calls its only chown variant
chown32(), with __NR_chown being undefined. creat(2) is also absent in some
targets.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There's useful stuff in <linux/timex.h> but <asm/timex.h> has nothing for
userspace. Stop exporting it, and include it only from within the existing
#ifdef __KERNEL__ part of <linux/timex.h>
This fixes a 'make headers_check' failure on i386 because asm-i386/timex.h
includes both asm-i386/tsc.h and asm-i386/processor.h, neither of which are
exported to userspace. It's not entirely clear _why_ it includes either of
these, but it does.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kill host_set->next
Fix simplex support
Allow per platform setting of IDE legacy bases
Some of this can be tidied further later on, in particular all the
legacy port gunge belongs as a PCI quirk/PCI header decode to understand
the special legacy IDE rules in the PCI spec.
Longer term Jeff also wants to move the request_irq/free_irq out of core
which will make this even cleaner.
tj: folded in three followup patches - ata_piix-fix, broken-arch-fix
and fix-new-legacy-handling, and separated per-dev xfermask into
separate patch preceding this one. Folded in fixes are...
* ata_piix-fix: fix build failure due to host_set->next removal
* broken-arch-fix: add missing include/asm-*/libata-portmap.h
* fix-new-legacy-handling:
* In ata_pci_init_legacy_port(), probe_num was incorrectly
incremented during initialization of the secondary port and
probe_ent->n_ports was incorrectly fixed to 1.
* Both legacy ports ended up having the same hard_port_no.
* When printing port information, both legacy ports printed
the first irq.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Tejun Heo <htejun@gmail.com>
There's no excuse for userspace abusing this kernel header -- the kernel's
headers are not intended to provide a library of helper routines for
userspace. Using <asm/io.h> from userspace is broken on most architectures
anyway. Just say 'no'.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This isn't suitable for userspace to see -- the kernel headers are not a
random library of stuff for userspace; they're only there to define the
kernel<->user ABI for system libraries and tools. Anything which _was_
abusing asm/atomic.h from userspace was probably broken anyway -- as it often
didn't even give atomic operation.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove asm/irq.h from the exported headers -- there was never any good reason
for it to have been listed.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
Move workqueue exports to where the functions are defined.
[CPUFREQ] Misc cleanups in ondemand.
[CPUFREQ] Make ondemand sampling per CPU and remove the mutex usage in sampling path.
[CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
[CPUFREQ] Remove slowdown from ondemand sampling path.
Generic lock debugging:
- generalized lock debugging framework. For example, a bug in one lock
subsystem turns off debugging in all lock subsystems.
- got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
the mutex/rtmutex debugging code: it caused way too much prototype
hackery, and lockdep will give the same information anyway.
- ability to do silent tests
- check lock freeing in vfree too.
- more finegrained debugging options, to allow distributions to
turn off more expensive debugging features.
There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes. (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)
Here are the current debugging options:
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
which do:
config DEBUG_MUTEXES
bool "Mutex debugging, basic checks"
config DEBUG_LOCK_ALLOC
bool "Detect incorrect freeing of live mutexes"
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add the per_cpu_offset() generic method. (used by the lock validator)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add __start_rodata and __end_rodata to sections.h to avoid extern
declarations. Needed by s390 code (see following patch).
[akpm@osdl.org: update architectures]
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined
sets of syscalls. Infrastructure, a couple of classes (with 32bit counterparts
for biarch targets) and actual tie-in on i386, amd64 and ia64.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove slowdown from ondemand sampling path. This reduces the code path length
in dbs_check_cpu() by half. slowdown was not used by ondemand by default.
If there are any user level tools that were using this tunable, they
may report error now.
Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL. These
will be used as a transition measure for symbols that aren't used in the
kernel and are on the way out. When a module uses such a symbol, a warning
is printk'd at modprobe time.
The main reason for removing unused exports is size: eacho export takes
roughly between 100 and 150 bytes of kernel space in the binary. This
patch gives users the option to immediately get this size gain via a config
option.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.
This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Considering that there isn't a lot of hw we can depend on during resume,
this is about as good as it gets.
This is x86-only for now, although the basic concept (and most of the
code) will certainly work on almost any platform.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have architectures where the size of page_to_pfn and pfn_to_page are
significant enough to overall image size that they wish to push them out of
line. However, in the process we have grown a second copy of the
implementation of each of these routines for each memory model. Share the
implmentation exposing it either inline or out-of-line as required.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.infradead.org/hdrcleanup-2.6: (63 commits)
[S390] __FD_foo definitions.
Switch to __s32 types in joystick.h instead of C99 types for consistency.
Add <sys/types.h> to headers included for userspace in <linux/input.h>
Move inclusion of <linux/compat.h> out of user scope in asm-x86_64/mtrr.h
Remove struct fddi_statistics from user view in <linux/if_fddi.h>
Move user-visible parts of drivers/s390/crypto/z90crypt.h to include/asm-s390
Revert include/media changes: Mauro says those ioctls are only used in-kernel(!)
Include <linux/types.h> and use __uXX types in <linux/cramfs_fs.h>
Use __uXX types in <linux/i2o_dev.h>, include <linux/ioctl.h> too
Remove private struct dx_hash_info from public view in <linux/ext3_fs.h>
Include <linux/types.h> and use __uXX types in <linux/affs_hardblocks.h>
Use __uXX types in <linux/divert.h> for struct divert_blk et al.
Use __u32 for elf_addr_t in <asm-powerpc/elf.h>, not u32. It's user-visible.
Remove PPP_FCS from user view in <linux/ppp_defs.h>, remove __P mess entirely
Use __uXX types in user-visible structures in <linux/nbd.h>
Don't use 'u32' in user-visible struct ip_conntrack_old_tuple.
Use __uXX types for S390 DASD volume label definitions which are user-visible
S390 BIODASDREADCMB ioctl should use __u64 not u64 type.
Remove unneeded inclusion of <linux/time.h> from <linux/ufs_fs.h>
Fix private integer types used in V4L2 ioctls.
...
Manually resolve conflict in include/linux/mtd/physmap.h
This adds the Kbuild files listing the files which are to be installed by
the 'headers_install' make target, in generic directories.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
If we move a mapping from one virtual address to another,
and this changes the virtual color of the mapping to those
pages, we can see corrupt data due to D-cache aliasing.
Check for and deal with this by overriding the move_pte()
macro. Set things up so that other platforms can cleanly
override the move_pte() macro too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn some macros into inline functions and add proper type checking as
well as being more readable. Also a minor comment adjustment.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
local_t's were defined to be unsigned. This increases confusion because
atomic_t's are signed. The patch goes through and changes all implementations
to use signed longs throughout.
Also, x86-64 was using 32-bit quantities for the value passed into local_add()
and local_sub(). Fixed.
All (actually, both) existing users have been audited.
(Also s/__inline__/inline/ in x86_64/local.h)
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that Christoph Lameter's atomic_long_t support is merged in mainline,
might as well convert asm-generic/local.h to use it, so the same code can
be used for both sizes of 32 and 64-bit unsigned longs.
akpm sayeth:
Q:
Is there any particular reason why these routines weren't simply
implemented with local_save/restore_flags, if they are only meant to
guarantee atomicity to the local cpu? I'm sure on most platforms this
would be more efficient than using an atomic...
A:
The whole _point_ of local_t is to avoid local_irq_disable(). It's
designed to exploit the fact that many CPUs can do incs and decs in a way
which is atomic wrt local interrupts, but not atomic wrt SMP.
But this patch makes sense, because asm-generic/local.h is just a fallback
implementation for architectures which either cannot perform these
local-irq-atomic operations, or its maintainers haven't yet got around to
implementing them.
We need more work done on local_t in the 2.6.17 timeframe - they're defined as
unsigned long, but some architectures implement them as signed long.
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- fix: initialize the robust list(s) to NULL in copy_process.
- doc update
- cleanup: rename _inuser to _inatomic
- __user cleanups and other small cleanups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patchset provides a new (written from scratch) implementation of robust
futexes, called "lightweight robust futexes". We believe this new
implementation is faster and simpler than the vma-based robust futex solutions
presented before, and we'd like this patchset to be adopted in the upstream
kernel. This is version 1 of the patchset.
Background
----------
What are robust futexes? To answer that, we first need to understand what
futexes are: normal futexes are special types of locks that in the
noncontended case can be acquired/released from userspace without having to
enter the kernel.
A futex is in essence a user-space address, e.g. a 32-bit lock variable
field. If userspace notices contention (the lock is already owned and someone
else wants to grab it too) then the lock is marked with a value that says
"there's a waiter pending", and the sys_futex(FUTEX_WAIT) syscall is used to
wait for the other guy to release it. The kernel creates a 'futex queue'
internally, so that it can later on match up the waiter with the waker -
without them having to know about each other. When the owner thread releases
the futex, it notices (via the variable value) that there were waiter(s)
pending, and does the sys_futex(FUTEX_WAKE) syscall to wake them up. Once all
waiters have taken and released the lock, the futex is again back to
'uncontended' state, and there's no in-kernel state associated with it. The
kernel completely forgets that there ever was a futex at that address. This
method makes futexes very lightweight and scalable.
"Robustness" is about dealing with crashes while holding a lock: if a process
exits prematurely while holding a pthread_mutex_t lock that is also shared
with some other process (e.g. yum segfaults while holding a pthread_mutex_t,
or yum is kill -9-ed), then waiters for that lock need to be notified that the
last owner of the lock exited in some irregular way.
To solve such types of problems, "robust mutex" userspace APIs were created:
pthread_mutex_lock() returns an error value if the owner exits prematurely -
and the new owner can decide whether the data protected by the lock can be
recovered safely.
There is a big conceptual problem with futex based mutexes though: it is the
kernel that destroys the owner task (e.g. due to a SEGFAULT), but the kernel
cannot help with the cleanup: if there is no 'futex queue' (and in most cases
there is none, futexes being fast lightweight locks) then the kernel has no
information to clean up after the held lock! Userspace has no chance to clean
up after the lock either - userspace is the one that crashes, so it has no
opportunity to clean up. Catch-22.
In practice, when e.g. yum is kill -9-ed (or segfaults), a system reboot is
needed to release that futex based lock. This is one of the leading
bugreports against yum.
To solve this problem, 'Robust Futex' patches were created and presented on
lkml: the one written by Todd Kneisel and David Singleton is the most advanced
at the moment. These patches all tried to extend the futex abstraction by
registering futex-based locks in the kernel - and thus give the kernel a
chance to clean up.
E.g. in David Singleton's robust-futex-6.patch, there are 3 new syscall
variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and FUTEX_RECOVER.
The kernel attaches such robust futexes to vmas (via
vma->vm_file->f_mapping->robust_head), and at do_exit() time, all vmas are
searched to see whether they have a robust_head set.
Lots of work went into the vma-based robust-futex patch, and recently it has
improved significantly, but unfortunately it still has two fundamental
problems left:
- they have quite complex locking and race scenarios. The vma-based
patches had been pending for years, but they are still not completely
reliable.
- they have to scan _every_ vma at sys_exit() time, per thread!
The second disadvantage is a real killer: pthread_exit() takes around 1
microsecond on Linux, but with thousands (or tens of thousands) of vmas every
pthread_exit() takes a millisecond or more, also totally destroying the CPU's
L1 and L2 caches!
This is very much noticeable even for normal process sys_exit_group() calls:
the kernel has to do the vma scanning unconditionally! (this is because the
kernel has no knowledge about how many robust futexes there are to be cleaned
up, because a robust futex might have been registered in another task, and the
futex variable might have been simply mmap()-ed into this process's address
space).
This huge overhead forced the creation of CONFIG_FUTEX_ROBUST, but worse than
that: the overhead makes robust futexes impractical for any type of generic
Linux distribution.
So it became clear to us, something had to be done. Last week, when Thomas
Gleixner tried to fix up the vma-based robust futex patch in the -rt tree, he
found a handful of new races and we were talking about it and were analyzing
the situation. At that point a fundamentally different solution occured to
me. This patchset (written in the past couple of days) implements that new
solution. Be warned though - the patchset does things we normally dont do in
Linux, so some might find the approach disturbing. Parental advice
recommended ;-)
New approach to robust futexes
------------------------------
At the heart of this new approach there is a per-thread private list of robust
locks that userspace is holding (maintained by glibc) - which userspace list
is registered with the kernel via a new syscall [this registration happens at
most once per thread lifetime]. At do_exit() time, the kernel checks this
user-space list: are there any robust futex locks to be cleaned up?
In the common case, at do_exit() time, there is no list registered, so the
cost of robust futexes is just a simple current->robust_list != NULL
comparison. If the thread has registered a list, then normally the list is
empty. If the thread/process crashed or terminated in some incorrect way then
the list might be non-empty: in this case the kernel carefully walks the list
[not trusting it], and marks all locks that are owned by this thread with the
FUTEX_OWNER_DEAD bit, and wakes up one waiter (if any).
The list is guaranteed to be private and per-thread, so it's lockless. There
is one race possible though: since adding to and removing from the list is
done after the futex is acquired by glibc, there is a few instructions window
for the thread (or process) to die there, leaving the futex hung. To protect
against this possibility, userspace (glibc) also maintains a simple per-thread
'list_op_pending' field, to allow the kernel to clean up if the thread dies
after acquiring the lock, but just before it could have added itself to the
list. Glibc sets this list_op_pending field before it tries to acquire the
futex, and clears it after the list-add (or list-remove) has finished.
That's all that is needed - all the rest of robust-futex cleanup is done in
userspace [just like with the previous patches].
Ulrich Drepper has implemented the necessary glibc support for this new
mechanism, which fully enables robust mutexes. (Ulrich plans to commit these
changes to glibc-HEAD later today.)
Key differences of this userspace-list based approach, compared to the vma
based method:
- it's much, much faster: at thread exit time, there's no need to loop
over every vma (!), which the VM-based method has to do. Only a very
simple 'is the list empty' op is done.
- no VM changes are needed - 'struct address_space' is left alone.
- no registration of individual locks is needed: robust mutexes dont need
any extra per-lock syscalls. Robust mutexes thus become a very lightweight
primitive - so they dont force the application designer to do a hard choice
between performance and robustness - robust mutexes are just as fast.
- no per-lock kernel allocation happens.
- no resource limits are needed.
- no kernel-space recovery call (FUTEX_RECOVER) is needed.
- the implementation and the locking is "obvious", and there are no
interactions with the VM.
Performance
-----------
I have benchmarked the time needed for the kernel to process a list of 1
million (!) held locks, using the new method [on a 2GHz CPU]:
- with FUTEX_WAIT set [contended mutex]: 130 msecs
- without FUTEX_WAIT set [uncontended mutex]: 30 msecs
I have also measured an approach where glibc does the lock notification [which
it currently does for !pshared robust mutexes], and that took 256 msecs -
clearly slower, due to the 1 million FUTEX_WAKE syscalls userspace had to do.
(1 million held locks are unheard of - we expect at most a handful of locks to
be held at a time. Nevertheless it's nice to know that this approach scales
nicely.)
Implementation details
----------------------
The patch adds two new syscalls: one to register the userspace list, and one
to query the registered list pointer:
asmlinkage long
sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
asmlinkage long
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
size_t __user *len_ptr);
List registration is very fast: the pointer is simply stored in
current->robust_list. [Note that in the future, if robust futexes become
widespread, we could extend sys_clone() to register a robust-list head for new
threads, without the need of another syscall.]
So there is virtually zero overhead for tasks not using robust futexes, and
even for robust futex users, there is only one extra syscall per thread
lifetime, and the cleanup operation, if it happens, is fast and
straightforward. The kernel doesnt have any internal distinction between
robust and normal futexes.
If a futex is found to be held at exit time, the kernel sets the highest bit
of the futex word:
#define FUTEX_OWNER_DIED 0x40000000
and wakes up the next futex waiter (if any). User-space does the rest of
the cleanup.
Otherwise, robust futexes are acquired by glibc by putting the TID into the
futex field atomically. Waiters set the FUTEX_WAITERS bit:
#define FUTEX_WAITERS 0x80000000
and the remaining bits are for the TID.
Testing, architecture support
-----------------------------
I've tested the new syscalls on x86 and x86_64, and have made sure the parsing
of the userspace list is robust [ ;-) ] even if the list is deliberately
corrupted.
i386 and x86_64 syscalls are wired up at the moment, and Ulrich has tested the
new glibc code (on x86_64 and i386), and it works for his robust-mutex
testcases.
All other architectures should build just fine too - but they wont have the
new syscalls yet.
Architectures need to implement the new futex_atomic_cmpxchg_inuser() inline
function before writing up the syscalls (that function returns -ENOSYS right
now).
This patch:
Add placeholder futex_atomic_cmpxchg_inuser() implementations to every
architecture that supports futexes. It returns -ENOSYS.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes zone_mem_map.
pfn_to_page uses pgdat, page_to_pfn uses zone. page_to_pfn can use pgdat
instead of zone, which is only one user of zone_mem_map. By modifing it,
we can remove zone_mem_map.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
Each arch has its own page_to_pfn(), pfn_to_page() for each models.
But most of them can use the same arithmetic.
This patch adds asm-generic/memory_model.h, which includes generic
page_to_pfn(), pfn_to_page() definitions for each memory model.
When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
used instead of macro. This is enabled by some archs and reduces
text size.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently include/asm-generic/bitops.h is not referenced from anywhere. But
it will be the benefit of those who are trying to port Linux to another
architecture.
So update it by same manner
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
int minix_test_and_set_bit(int nr, volatile unsigned long *addr);
int minix_set_bit(int nr, volatile unsigned long *addr);
int minix_test_and_clear_bit(int nr, volatile unsigned long *addr);
int minix_test_bit(int nr, const volatile unsigned long *addr);
unsigned long minix_find_first_zero_bit(const unsigned long *addr,
unsigned long size);
In include/asm-generic/bitops/minix.h
and include/asm-generic/bitops/minix-le.h
This code largely copied from: include/asm-sparc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
int ext2_set_bit_atomic(int nr, volatile unsigned long *addr);
int ext2_clear_bit_atomic(int nr, volatile unsigned long *addr);
In include/asm-generic/bitops/ext2-atomic.h
This code largely copied from: include/asm-sparc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
int ext2_set_bit(int nr, volatile unsigned long *addr);
int ext2_clear_bit(int nr, volatile unsigned long *addr);
int ext2_test_bit(int nr, const volatile unsigned long *addr);
unsigned long ext2_find_first_zero_bit(const unsigned long *addr,
unsigned long size);
unsinged long ext2_find_next_zero_bit(const unsigned long *addr,
unsigned long size);
In include/asm-generic/bitops/ext2-non-atomic.h
This code largely copied from:
include/asm-powerpc/bitops.h
include/asm-parisc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Build fix for s390 declare __u32 and __u64.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
unsigned int hweight32(unsigned int w);
unsigned int hweight16(unsigned int w);
unsigned int hweight8(unsigned int w);
unsigned long hweight64(__u64 w);
In include/asm-generic/bitops/hweight.h
This code largely copied from: include/linux/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalent of the function: int ffs(int
x);
In include/asm-generic/bitops/ffs.h
This code largely copied from: include/linux/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalent of the function: int
sched_find_first_bit(const unsigned long *b);
In include/asm-generic/bitops/sched.h
This code largely copied from: include/asm-powerpc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
unsigned logn find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset);
unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
unsigned long offset);
unsigned long find_first_zero_bit(const unsigned long *addr,
unsigned long size);
unsigned long find_first_bit(const unsigned long *addr, unsigned long size);
In include/asm-generic/bitops/find.h
This code largely copied from: arch/powerpc/lib/bitops.c
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalent of the function: int
fls64(__u64 x);
In include/asm-generic/bitops/fls64.h
This code largely copied from: include/linux/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalent of the function: int fls(int
x);
In include/asm-generic/bitops/fls.h
This code largely copied from: include/linux/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalent of the function: unsigned long
ffz(unsigned long word);
In include/asm-generic/bitops/ffz.h
This code largely copied from: include/asm-parisc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalent of the function: unsigned long
__ffs(unsigned long word);
In include/asm-generic/bitops/__ffs.h
This code largely copied from: include/asm-sparc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
void __set_bit(int nr, volatile unsigned long *addr);
void __clear_bit(int nr, volatile unsigned long *addr);
void __change_bit(int nr, volatile unsigned long *addr);
int __test_and_set_bit(int nr, volatile unsigned long *addr);
int __test_and_clear_bit(int nr, volatile unsigned long *addr);
int __test_and_change_bit(int nr, volatile unsigned long *addr);
int test_bit(int nr, const volatile unsigned long *addr);
In include/asm-generic/bitops/non-atomic.h
This code largely copied from: asm-powerpc/bitops.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the C-language equivalents of the functions below:
void set_bit(int nr, volatile unsigned long *addr);
void clear_bit(int nr, volatile unsigned long *addr);
void change_bit(int nr, volatile unsigned long *addr);
int test_and_set_bit(int nr, volatile unsigned long *addr);
int test_and_clear_bit(int nr, volatile unsigned long *addr);
int test_and_change_bit(int nr, volatile unsigned long *addr);
In include/asm-generic/bitops/atomic.h
This code largely copied from:
include/asm-powerpc/bitops.h
include/asm-parisc/bitops.h
include/asm-parisc/atomic.h
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we stop allocating percpu memory for not-possible CPUs we must not touch
the percpu data for not-possible CPUs at all. The correct way of doing this
is to test cpu_possible() or to use for_each_cpu().
This patch is a kernel-wide sweep of all instances of NR_CPUS. I found very
few instances of this bug, if any. But the patch converts lots of open-coded
test to use the preferred helper macros.
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Christian Zankel <chris@zankel.net>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidate all kernel bug printouts to begin with the "BUG: " string.
Makes it easier to find them in large bootup logs.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds the ability to mark symbols that will be changed in the
future, so that kernel modules that don't include MODULE_LICENSE("GPL")
and use the symbols, will be flagged and printed out to the system log.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If the 'ptr' is a const, this code cause "assignment of read-only variable"
error on gcc 4.x.
Use __u64 instead of __typeof__(*(ptr)) for temporary variable to get
rid of errors on gcc 4.x.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make new MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK consistent across all
arches. The idea is to make it possible to use them portably even before
distros include them in libc headers.
Move common flags to asm-generic/mman.h
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For some reason, the BITS_PER_LONG == 64 case of atomic_long_set
was using atomic_set instead of atomic64_set. This does not jive
with architectures which use an inline instead of a #define to
implement their atomic_set() primitives.
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Noticed by Arjan originally on x86-64, then Ingo on x86, and finally me
grepping for it in the generic version.
Bad parenthesis nesting.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most arches copied the i386 ioctl.h. Combine them into a generic header.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add three (generic) mutex fastpath implementations.
The mutex-xchg.h implementation is atomic_xchg() based, and should
work fine on every architecture.
The mutex-dec.h implementation is atomic_dec_return() based - this
one too should work on every architecture, but might not perform the
most optimally on architectures that have no atomic-dec/inc instructions.
The mutex-null.h implementation forces all calls into the slowpath. This
is used for mutex debugging, but it can also be used on platforms that do
not want (or need) a fastpath at all.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Most of the architectures have the same asm/futex.h. This consolidates them
into asm-generic, with the arches including it from their own asm/futex.h.
In the case of UML, this reverts the old broken futex.h and goes back to using
the same one as almost everyone else.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kill L1_CACHE_SHIFT from all arches. Since L1_CACHE_SHIFT_MAX is not used
anymore with the introduction of INTERNODE_CACHE, kill L1_CACHE_SHIFT_MAX.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Generic prep-work for marking the .rodata section readonly:
* Align the rodata section at 4Kb boundary
* call the mark_rodata_ro() function when available
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Several counters already have the need to use 64 atomic variables on 64 bit
platforms (see mm_counter_t in sched.h). We have to do ugly ifdefs to fall
back to 32 bit atomic on 32 bit platforms.
The VM statistics patch that I am working on will also make more extensive
use of atomic64.
This patch introduces a new type atomic_long_t by providing definitions in
asm-generic/atomic.h that works similar to the c "long" type. Its 32 bits
on 32 bit platforms and 64 bits on 64 bit platforms.
Also cleans up the determination of the mm_counter_t in sched.h.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adding __initdata_* to asm-generic/sections.h
Replaces a lot of open coded externs in arch/x86_64/*
I had to change __bss_end to __bss_stop to match the other architectures.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds a RapidIO subsystem to the kernel. RIO is a switched fabric interconnect
used in higher-end embedded applications. The curious can look at the specs
over at http://www.rapidio.org
The core code implements enumeration/discovery, management of
devices/resources, and interfaces for RIO drivers.
There's a lot more to do to take advantages of all the hardware features.
However, this should provide a good base for folks with RIO hardware to start
contributing.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch. This should now allow not to include sched.h
from module.h, which is done by a followup patch.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Updated several references to page_table_lock in common code comments.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does
add a little to kernel size, change them to macros testing inline, calling
__pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them
as fastcalls, leave that to CONFIG_REGPARM or not.
It also seems more natural for the out-of-line functions to leave the offset
calculation and map to the inline, which has to do it anyway for the common
case. At least mremap move wants __pte_alloc without _map.
Macros rather than inline functions, certainly to avoid the header file issues
which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
architectures I haven't built would have other such problems.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
zap_pte_range has been counting the pages it frees in tlb->freed, then
tlb_finish_mmu has used that to update the mm's rss. That got stranger when I
added anon_rss, yet updated it by a different route; and stranger when rss and
anon_rss became mm_counters with special access macros. And it would no
longer be viable if we're relying on page_table_lock to stabilize the
mm_counter, but calling tlb_finish_mmu outside that lock.
Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
business, just decrement the rss mm_counter in zap_pte_range (yes, there was
some point to batching the update, and a subsequent patch restores that). And
forget the anal paranoia of first reading the counter to avoid going negative
- if rss does go negative, just fix that bug.
Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
was being made of them. But arm26 alone was actually using the freed, in the
way some others use need_flush: give it a need_flush. arm26 seems to prefer
spaces to tabs here: respect that.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the
mm's last user has gone and the whole mm is being torn down. And it's an
inline function because sparc64 uses a different (slightly better)
"tlb_frozen" name for the flag others call "fullmm".
And now the ptep_get_and_clear_full macro used in zap_pte_range refers
directly to tlb->fullmm, which would be wrong for sparc64. Rather than
correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
sparc64 to just use the same poor name as everyone else - is that okay?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
tlb_gather_mmu dates from before kernel preemption was allowed, and uses
smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That works
because it's currently only called after getting page_table_lock, which is not
dropped until after the matching tlb_finish_mmu. But don't rely on that, it
will soon change: now disable preemption internally by proper get_cpu_var in
tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the ZERO_PAGE remapping complexity to the move_pte macro in
asm-generic, have it conditionally depend on
__HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS.
For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes
a noop.
From: Hugh Dickins <hugh@veritas.com>
Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch.
The "pte" at that point may be a swap entry or a pte_file entry: we must
check pte_present before perhaps corrupting such an entry.
Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's
mm/mremap.c, and more dangerous there since it's affecting all arches: I
think the safest course is to send Nick's patch and Yoichi's build fix and
this fix (build tested) on to Linus - so only MIPS can be affected.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The generic TLB flush functions kept upto 506 pages per
CPU to avoid too frequent IPIs.
This value was done for the L1 cache of older x86 CPUs,
but with modern CPUs it does not make much sense anymore.
TLB flushing is slow enough that using the L2 cache is fine.
This patch increases the flush array on x86-64 to cache
5350 pages. That is roughly 20MB with 4K pages. It speeds
up large munmaps in multithreaded processes on SMP considerably.
The cost is roughly 42k of memory per CPU, which is reasonable.
I only increased it on x86-64 for now, but it would probably
make sense to increase it everywhere. Embedded architectures
with SMP may keep it smaller to save some memory per CPU.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Inside the linker script, insert the code for DWARF debug info sections. This
may help GDB'ing a Uml binary. Actually, it seems that ld is able to guess
what I added correctly, but normal linker scripts include this section so it
should be correct anyway adding it.
On request by Sam Ravnborg <sam@ravnborg.org>, I've added it to
asm-generic/vmlinux.lds.s. I've also moved there the stabs debug section,
used the new macro in i386 linker script and added DWARF debug section to
that.
In the truth, I've not been able to verify the difference in GDB behaviour
after this change (I've seen large improvements with another patch). This
may depend on my binutils version, older one may have worse defaults.
However, this section is present in normal linker script, so add it at
least for the sake of cleanness.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There were three changes necessary in order to allow
sparc64 to use setup-res.c:
1) Sparc64 roots the PCI I/O and MEM address space using
parent resources contained in the PCI controller structure.
I'm actually surprised no other platforms do this, especially
ones like Alpha and PPC{,64}. These resources get linked into the
iomem/ioport tree when PCI controllers are probed.
So the hierarchy looks like this:
iomem --|
PCI controller 1 MEM space --|
device 1
device 2
etc.
PCI controller 2 MEM space --|
...
ioport --|
PCI controller 1 IO space --|
...
PCI controller 2 IO space --|
...
You get the idea. The drivers/pci/setup-res.c code allocates
using plain iomem_space and ioport_space as the root, so that
wouldn't work with the above setup.
So I added a pcibios_select_root() that is used to handle this.
It uses the PCI controller struct's io_space and mem_space on
sparc64, and io{port,mem}_resource on every other platform to
keep current behavior.
2) quirk_io_region() is buggy. It takes in raw BUS view addresses
and tries to use them as a PCI resource.
pci_claim_resource() expects the resource to be fully formed when
it gets called. The sparc64 implementation would do the translation
but that's absolutely wrong, because if the same resource gets
released then re-claimed we'll adjust things twice.
So I fixed up quirk_io_region() to do the proper pcibios_bus_to_resource()
conversion before passing it on to pci_claim_resource().
3) I was mistakedly __init'ing the function methods the PCI controller
drivers provide on sparc64 to implement some parts of these
routines. This was, of course, easy to fix.
So we end up with the following, and that nasty SPARC64 makefile
ifdef in drivers/pci/Makefile is finally zapped.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are possible race conditions if probes are placed on routines within the
kprobes files and routines used by the kprobes. For example if you put probe
on get_kprobe() routines, the system can hang while inserting probes on any
routine such as do_fork(). Because while inserting probes on do_fork(),
register_kprobes() routine grabs the kprobes spin lock and executes
get_kprobe() routine and to handle probe of get_kprobe(), kprobes_handler()
gets executed and tries to grab kprobes spin lock, and spins forever. This
patch avoids such possible race conditions by preventing probes on routines
within the kprobes file and routines used by kprobes.
I have modified the patches as per Andi Kleen's suggestion to move kprobes
routines and other routines used by kprobes to a seperate section
.kprobes.text.
Also moved page fault and exception handlers, general protection fault to
.kprobes.text section.
These patches have been tested on i386, x86_64 and ppc64 architectures, also
compiled on ia64 and sparc64 architectures.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch gathers all the struct flock64 definitions (and the operations),
puts them under !CONFIG_64BIT and cleans up the arch files.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch just gathers together all the struct flock definitions except
xtensa into asm-generic/fcntl.h.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch puts the most popular of each fcntl operation/flag into
asm-generic/fcntl.h and cleans up the arch files.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch puts the most popular of each open flag into asm-generic/fcntl.h
and cleans up the arch files.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This set of patches creates asm-generic/fcntl.h and consolidates as much as
possible from the asm-*/fcntl.h files into it.
This patch just gathers all the identical bits of the asm-*/fcntl.h files into
asm-generic/fcntl.h.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I've rewriten Atushi's fix for the 64-bit put_unaligned on 32-bit systems
bug to generate more efficient code.
This case has buzilla URL http://bugzilla.kernel.org/show_bug.cgi?id=5138.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
unused and useless..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space. Removing the
locked operation should allow better pipelining of memory access in this
loop. I measured an average savings of 30-35 cycles per zap_pte_range on
the first 500 destructions on Pentium-M, but I believe the optimization
would win more on older processors which still assert the bus lock on xchg
for an exclusive cacheline.
Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
full address space destruction.
pte_clear_full is not yet used, but is provided for future optimizations
(in particular, when running inside of a hypervisor that queues page table
updates, the full hint allows us to avoid queueing unnecessary page table
update for an address space in the process of being destroyed.
This is not a huge win, but it does help a bit, and sets the stage for
further hypervisor optimization of the mm layer on all architectures.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Someone mentioned that almost all the architectures used basically the same
implementation of get_order. This patch consolidates them into
asm-generic/page.h and includes that in the appropriate places. The
exceptions are ia64 and ppc which have their own (presumably optimised)
versions.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In yenta_socket, we default to using the resource setting of the CardBus
bridge. However, this is a PCI-bus-centric view of resources and thus needs
to be converted to generic resources first. Therefore, add a call to
pcibios_bus_to_resource() call in between. This function is a mere wrapper on
x86 and friends, however on some others it already exists, is added in this
patch (alpha, arm, ppc, ppc64) or still needs to be provided (parisc -- where
is its pcibios_resource_to_bus() ?).
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Minor cleanup.
Move things into their include files, remove obsolete includes, fix
indentation, remove obsolete special cases etc.
I also added the per cpu section to asm-generic/sections.h and fixed
init/main.c to use it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the kernel is working well and we want to restart cleanly
kernel_restart is the function to use. But in many instances
the kernel wants to reboot when thing are expected to be working
very badly such as from panic or a software watchdog handler.
This patch adds the function emergency_restart() so that
callers can be clear what semantics they expect when calling
restart. emergency_restart() is expected to be callable
from interrupt context and possibly reliable in even more
trying circumstances.
This is an initial generic implementation for all architectures.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Several reports on inconsistent kallsyms data has been caused by the aliased symbols
__sched_text_start and __down to shift places in the output of nm.
The root cause was that on second pass ld aligned __sched_text_start to a 4 byte boundary
which is the function alignment on i386.
sched.text and spinlock.text is now aligned to an 8 byte boundary to make sure they
are aligned to a function alignemnt on most (all?) archs.
Tested by: Paulo Marques <pmarques@grupopie.com>
Tested by: Alexander Stohr <Alexander.Stohr@gmx.de>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
In vmlinux.lds.h the code is carefull to define every section so vmlinux
properly reports the correct physical load address of code, as well as
it's virtual address.
The new SECURITY_INIT definition fails to follow that convention and
and causes incorrect physical address to appear in the vmlinux if
there are any security initcalls.
This patch updates the SECURITY_INIT to follow the convention in the rest of
the file.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix (in the architectures I'm actually building for) the UP definition of
per_cpu so that the cpu specified may be any expression, not just an
identifier or a suffix expression.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define pcibus_to_node to be able to figure out which NUMA node contains a
given PCI device. This defines pcibus_to_node(bus) in
include/linux/topology.h and adjusts the macros for i386 and x86_64 that
already provided a way to determine the cpumask of a pci device.
x86_64 was changed to not build an array of cpumasks anymore. Instead an
array of nodes is build which can be used to generate the cpumask via
node_to_cpumask.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's common practice to msync a large address range regularly, in which
often only a few ptes have actually been dirtied since the previous pass.
sync_pte_range then goes much faster if it tests whether pte is dirty
before locating and accessing each struct page cacheline; and it is hardly
slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
when every pte actually is dirty.
But beware, s390's pte_dirty always says false, since its dirty bit is kept
in the storage key, located via the struct page address. So skip this
optimization in its case: use a pte_maybe_dirty macro which just says true
if page_test_and_clear_dirty is implemented.
Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PPC32 kernel puts platform-specific functions into separate sections so
that unneeded parts of it can be freed when we've booted and actually
worked out what we're running on today.
This makes kallsyms ignore those functions, because they're not between
_[se]text or _[se]inittext. Rather than teaching kallsyms about the
various pmac/chrp/etc sections, this patch adds '_[se]extratext' markers
for kallsyms.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
New file - asm-generic/signal.h. Contains declarations of
__sighandler_t, __sigrestore_t, SIG_DFL, SIG_IGN, SIG_ERR and default
definitions of SIG_BLOCK, SIG_UNBLOCK and SIG_SETMASK.
asm-*/signal.h switched to including it. The only exception is
asm-parisc/signal.h that wants its own declaration of __sighandler_t;
that one is left as-is.
asm-ppc64/signal.h required one more thing - unlike everybody else it
used __sigrestorer_t instead of usual __sigrestore_t. PPC64 switched to
common spelling.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Discussing with Matthew Wilcox some of his outstanding patches lead me to
this patch (among others).
The preamble in struct sigevent can be expressed independently of the
architecture.
Also use __ARCH_SI_PREAMBLE_SIZE on ia64.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add EOWNERDEAD and ENOTRECOVERABLE to all architectures. This is to
support the upcoming patches for robust mutexes.
We normally don't reserve parts of the name/number space for external
patches, but robust mutexes are sufficiently popular and important to
justify it in this case.
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove PAGE_BUG - repalce it with BUG and BUG_ON.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch eliminates all kernel BUGs, trims about 35k off the typical
kernel, and makes the system slightly faster.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a pair of rlimits for allowing non-root tasks to raise nice and rt
priorities. Defaults to traditional behavior. Originally written by
Chris Wright.
The patch implements a simple rlimit ceiling for the RT (and nice) priorities
a task can set. The rlimit defaults to 0, meaning no change in behavior by
default. A value of 50 means RT priority levels 1-50 are allowed. A value of
100 means all 99 privilege levels from 1 to 99 are allowed. CAP_SYS_NICE is
blanket permission.
(akpm: see http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1921.html for
tips on integrating this with PAM).
Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Turns __get_unaligned() and __put_unaligned into macros. That is
definitely safe; leaving them as inlines breaks on e.g. alpha [try to
build ncpfs there and you'll get unresolved symbols since we end up
getting __get_unaligned() not inlined].
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ia64 and sparc64 hurriedly had to introduce their own variants of
pgd_addr_end, to leapfrog over the holes in their virtual address spaces which
the final clear_page_range suddenly presented when converted from pgd_index to
pgd_addr_end. But now that free_pgtables respects the vma list, those holes
are never presented, and the arch variants can go.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the new io infrastructure, all of our operators are expecting the
underlying device to be little endian (because the PCI bus, their main
consumer, is LE).
However, there are a fair few devices and busses in the world that are
actually Big Endian. There's even evidence that some of these BE bus and
chip types are attached to LE systems. Thus, there's a need for a BE
equivalent of our io{read,write}{16,32} operations.
The attached patch adds this as io{read,write}{16,32}be. When it's in,
I'll add the first consume (the 53c700 SCSI chip driver).
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!