Current versions of ipvsadm include "/usr/src/linux/include/net/ip_vs.h"
directly. This file also contains kernel-only definitions. Normally, public
definitions should live in include/linux, so this patch moves the
definitions shared with userspace to a new file, "include/linux/ip_vs.h".
This also removes the unused NFC_IPVS_PROPERTY bitmask, which was once
used to point into skb->nfcache.
To make old ipvsadms still compile with this, the old header file includes
the new one.
Thanks to Dave Miller and Horms for noting/adding the missing Kbuild entry
for the new header file.
Signed-off-by: Julius Volz <juliusv@google.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When support for multiple TX queues were added, the
netif_tx_lock() routines we converted to iterate over
all TX queues and grab each queue's spinlock.
This causes heartburn for lockdep and it's not a healthy
thing to do with lots of TX queues anyways.
So modify this to use a top-level lock and a "frozen"
state for the individual TX queues.
Signed-off-by: David S. Miller <davem@davemloft.net>
Sysfs has the _ATTR() and _ATTR_RO() macros to make defining extended
form attributes easier. configfs should have something similiar.
- _CONFIGFS_ATTR() and _CONFIGFS_ATTR_RO() are the counterparts to the
sysfs macros.
- CONFIGFS_ATTR_STRUCT() creates the extended form attribute structure.
- CONFIGFS_ATTR_OPS() defines the show_attribute()/store_attribute()
operations that call the show()/store() operations of the extended
form configfs_attributes.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We now use PTR_ERR() in the ->make_item() and ->make_group() operations.
Folks including configfs.h need err.h.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When we traverse the graph, either forwards or backwards, we
are interested in whether a certain property exists somewhere
in a node reachable in the graph.
Therefore it is never necessary to traverse through a node more
than once to get a correct answer to the given query.
Take advantage of this property using a global ID counter so that we
need not clear all the markers in all the lock_class entries before
doing a traversal. A new ID is choosen when we start to traverse, and
we continue through a lock_class only if it's ID hasn't been marked
with the new value yet.
This short-circuiting is essential especially for high CPU count
systems. The scheduler has a runqueue per cpu, and needs to take
two runqueue locks at a time, which leads to long chains of
backwards and forwards subgraphs from these runqueue lock nodes.
Without the short-circuit implemented here, a graph traversal on
a runqueue lock can take up to (1 << (N - 1)) checks on a system
with N cpus.
For anything more than 16 cpus or so, lockdep will eventually bring
the machine to a complete standstill.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Found an interactivity problem on a quad core test-system - simple
CPU loops would occasionally delay the system un an unacceptable way.
After much debugging with Peter Zijlstra it turned out that the problem
is caused by the string of sched_clock() changes - they caused the CPU
clock to jump backwards a bit - which confuses the scheduler arithmetics.
(which is unsigned for performance reasons)
So revert:
# c300ba2: sched_clock: and multiplier for TSC to gtod drift
# c0c8773: sched_clock: only update deltas with local reads.
# af52a90: sched_clock: stop maximum check on NO HZ
# f7cce27: sched_clock: widen the max and min time
This solves the interactivity problems.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
In order to time out dead connections quicker, keep track of outstanding data
and cap the timeout.
Suggested by Herbert Xu.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add support for the RDC 1010 variant
- Rework the core library to have a read_id method. This allows the hacky
bits of it821x to go and prepares us for pata_hd
- Switch from WARN to BUG in ata_id_string as it will reboot if you get
it wrong so WARN won't be seen
- Allow the issue of command 0xFC on the 821x. This is needed to query
rebuild status.
- Tidy up printk formatting
- Do more ident rewriting on RAID volumes to handle firmware provided
ident data which is rather wonky
- Report the firmware revision and device layout in RAID mode
- Don't try and disable raid on the 8211 or RDC - they don't have the
relevant bits
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/mm: Lockless get_user_pages_fast() for 64-bit (v3)
powerpc: Don't use the wrong thread_struct for ptrace get/set VSX regs
powerpc: Fix ptrace buffer size for VSX
powerpc: Correctly hookup PTRACE_GET/SETVSRREGS for 32 bit processes
ide/powermac: Fix use of uninitialized pointer on media-bay
powerpc: Allow non-hcall return values for lparcfg writes
ipmi/powerpc: Use linux/of_{device,platform}.h instead of asm
powerpc/fsl: proliferate simple-bus compatibility to soc nodes
Documentation: remove old sbc8260 board specific information
cpm2: Rework baud rate generators configuration to support external clocks.
powerpc: rtc_cmos_setup: assign interrupts only if there is i8259 PIC
cpm_uart: Add generic clock API support to set baudrates
cpm_uart: Modem control lines support
powerpc: implement GPIO LIB API on CPM1 Freescale SoC.
cpm2: Implement GPIO LIB API on CPM2 Freescale SoC.
powerpc: Fix 8xx build failure
powerpc: clean up the Book-E HW watchpoint support
when you take the address of the result. Noticed on a sparc64 compile
using a version 3.4.5 cross compiler.
kernel/time/tick-common.c: In function `tick_check_new_device':
kernel/time/tick-common.c:210: error: invalid lvalue in unary `&'
...
Just make it a regular expression.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
net: Make "networking" one-click deselectable.
ipv6: Fix useless proc net sockstat6 removal
tcp: MD5: Use MIB counter instead of warning for MD5 mismatch.
pkt_sched: Fix OOPS on ingress qdisc add.
niu: Fix error checking in niu_ethflow_to_class.
IPv6: datagram_send_ctl() should exit immediately when an error occured
mac80211: fix mesh beaconing
PS3: gelic: use unsigned long for irqflags
mac80211: fix cfg80211 hooks for master interface
nl80211: fix dump callbacks
mac80211: partially fix skb->cb use
rtl8187: Improve wireless statistics for RTL8187B
rtl8187: Fix for TX sequence number problem
mac80211: append CONFIG_ to MAC80211_VERBOSE_PS_DEBUG in net/mac80211/tx.c.
mac80211: fix sparse integer as NULL pointer warning
drivers/net/wireless/iwlwifi/iwl-led.c: printk fix
mac80211: return correct error return from ieee80211_wep_init
mac80211: tx, use dev_kfree_skb_any for beacon_get
rt2x00: Clear queue entry flags during initialization
rt2x00: Force full register config after start()
...
zap_vma_ptes() is intended to be used by drivers to unmap ptes assigned to the
driver private vmas. This interface is similar to zap_page_range() but is
less general & less likely to be abused.
Needed by the GRU driver.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The file kernel.h contains the upper_32_bits macro. This patch adds the
other part, the lower_32_bits macro. Its first use will be in the driver
for AMD IOMMU.
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a BlackBoard user to connector. BlackBoard is part of the TSP GPL
sampling framework (http://savannah.nongnu.org/p/tsp)
[akpm@linux-foundation.org: add comment]
Signed-off-by: Jerome Arbez-Gindre <jeromearbezgindre@gmail.com>
Acked-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It has no user now
Also print out info about adding/removing active regions.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo Molnar provided a fix to not call _PPC at processor driver
initialization time in "[PATCH] ACPI: fix cpufreq regression" (git
commit e4233dec74)
But it can still happen that _PPC is called at processor driver
initialization time.
This patch should make sure that this is not possible anymore.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid one-off errors by introducing a resource_size() function.
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current style for debug messages is to ensure they're always
parsed by the compiler and then subjected to dead code removal.
That way builds won't break only when debug options get enabled,
which is common when they are stripped out early by CPP.
This patch makes CONFIG_MTD_DEBUG adopt that convention.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Current implementation of subpage read feature for NAND has issues with
small page devices. Small page NAND do not support RNDOUT command.
So subpage feature is not applicable for them.
This patch disables support of subpage for small page NAND.
The code is verified on nandsim(SP NAND simulation) and on LP NAND
devices.
Thanks a lot to Artem for finding this issue.
Signed-off-by: Alexey Korolev <akorolev@infradead.org>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This adds a regulator driver for the TI bq24022 Single-Chip
Li-Ion Charger with its nCE and ISET2 pins connected to GPIOs.
Signed-off-by: Philipp Zabel <philipp.zabel@gmail.com>
Signed-off-by: Liam Girdwood <lg@opensource.wolfsonmicro.com>
This patch adds support for fixed regulators. This class of regulator is
not software controllable but can coexist on machines with software
controlable regulators.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Liam Girdwood <lg@opensource.wolfsonmicro.com>
This interface is for machine specific code and allows the creation of
voltage/current domains (with constraints) for each regulator. It can
provide regulator constraints that will prevent device damage through
overvoltage or over current caused by buggy client drivers. It also
allows the creation of a regulator tree whereby some regulators are
supplied by others (similar to a clock tree).
Signed-off-by: Liam Girdwood <lg@opensource.wolfsonmicro.com>
Signed-off-by: Philipp Zabel <philipp.zabel@gmail.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
This allows regulator drivers to register their regulators and provide
operations to the core. It also has a notifier call chain for propagating
regulator events to clients.
Signed-off-by: Liam Girdwood <lg@opensource.wolfsonmicro.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Add support to allow consumer device drivers to control their regulator
power supply.
This uses a similar API to the kernel clock interface in that consumer
drivers can get and put a regulator (like they can with clocks atm) and
get/set voltage, current limit, mode, enable and disable. This should
allow consumers complete control over their supply voltage and current
limit. This also compiles out if not in use so drivers can be reused in
systems with no regulator based power control.
Signed-off-by: Liam Girdwood <lg@opensource.wolfsonmicro.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Implement lockless get_user_pages_fast for 64-bit powerpc.
Page table existence is guaranteed with RCU, and speculative page references
are used to take a reference to the pages without having a prior existence
guarantee on them.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch fixes mac80211 to not use the skb->cb over the queue step
from virtual interfaces to the master. The patch also, for now,
disables aggregation because that would still require requeuing,
will fix that in a separate patch. There are two other places (software
requeue and powersaving stations) where requeue can happen, but that is
not currently used by any drivers/not possible to use respectively.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Reorder fields in struct rfkill and add comments to make it clear
which fields are protected by rfkill->mutex.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch cleans up the handling of the maple bus queue to remove
the risk of races when adding packets. It also removes references to the
redundant connect and disconnect functions.
Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
This IOMMU helper function doesn't work for some architectures:
http://marc.info/?l=linux-kernel&m=121699304403202&w=2
It also breaks POWER and SPARC builds:
http://marc.info/?l=linux-kernel&m=121730388001890&w=2
Currently, only x86 IOMMUs use this so let's move it to x86 for
now.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Synchronize changes to host virtual addresses which are part of
a KVM memory slot to the KVM shadow mmu. This allows pte operations
like swapping, page migration, and madvise() to transparently work
with KVM.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
* 'for-linus' of git://git.o-hand.com/linux-mfd:
mfd: accept pure device as a parent, not only platform_device
mfd: add platform_data to mfd_cell
mfd: Coding style fixes
mfd: Use to_platform_device instead of container_of
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits)
x86/PCI: use dev_printk when possible
PCI: add D3 power state avoidance quirk
PCI: fix bogus "'device' may be used uninitialized" warning in pci_slot
PCI: add an option to allow ASPM enabled forcibly
PCI: disable ASPM on pre-1.1 PCIe devices
PCI: disable ASPM per ACPI FADT setting
PCI MSI: Don't disable MSIs if the mask bit isn't supported
PCI: handle 64-bit resources better on 32-bit machines
PCI: rewrite PCI BAR reading code
PCI: document pci_target_state
PCI hotplug: fix typo in pcie hotplug output
x86 gart: replace to_pages macro with iommu_num_pages
x86, AMD IOMMU: replace to_pages macro with iommu_num_pages
iommu: add iommu_num_pages helper function
dma-coherent: add documentation to new interfaces
Cris: convert to using generic dma-coherent mem allocator
Sh: use generic per-device coherent dma allocator
ARM: support generic per-device coherent dma mem
Generic dma-coherent: fix DMA_MEMORY_EXCLUSIVE
x86: use generic per-device dma coherent allocator
...
When we read some part of a file through pagecache, if there is a
pagecache of corresponding index but this page is not uptodate, read IO
is issued and this page will be uptodate.
I think this is good for pagesize == blocksize environment but there is
room for improvement on pagesize != blocksize environment. Because in
this case a page can have multiple buffers and even if a page is not
uptodate, some buffers can be uptodate.
So I suggest that when all buffers which correspond to a part of a file
that we want to read are uptodate, use this pagecache and copy data from
this pagecache to user buffer even if a page is not uptodate. This can
reduce read IO and improve system throughput.
I wrote a benchmark program and got result number with this program.
This benchmark do:
1: mount and open a test file.
2: create a 512MB file.
3: close a file and umount.
4: mount and again open a test file.
5: pwrite randomly 300000 times on a test file. offset is aligned
by IO size(1024bytes).
6: measure time of preading randomly 100000 times on a test file.
The result was:
2.6.26
330 sec
2.6.26-patched
226 sec
Arch:i386
Filesystem:ext3
Blocksize:1024 bytes
Memory: 1GB
On ext3/4, a file is written through buffer/block. So random read/write
mixed workloads or random read after random write workloads are optimized
with this patch under pagesize != blocksize environment. This test result
showed this.
The benchmark program is as follows:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#define LEN 1024
#define LOOP 1024*512 /* 512MB */
main(void)
{
unsigned long i, offset, filesize;
int fd;
char buf[LEN];
time_t t1, t2;
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
memset(buf, 0, LEN);
fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
for (i = 0; i < LOOP; i++)
write(fd, buf, LEN);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
perror("cannot mount\n");
exit(1);
}
fd = open("/root/test1/testfile", O_RDWR);
if (fd < 0) {
perror("cannot open file\n");
exit(1);
}
filesize = LEN * LOOP;
for (i = 0; i < 300000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pwrite(fd, buf, LEN, offset);
}
printf("start test\n");
time(&t1);
for (i = 0; i < 100000; i++){
offset = (random() % filesize) & (~(LEN - 1));
pread(fd, buf, LEN, offset);
}
time(&t2);
printf("%ld sec\n", t2-t1);
close(fd);
if (umount("/root/test1/") < 0) {
perror("cannot umount\n");
exit(1);
}
}
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_take_all_locks holds off reclaim from an entire mm_struct. This allows
mmu notifiers to register into the mm at any time with the guarantee that
no mmu operation is in progress on the mm.
This operation locks against the VM for all pte/vma/mm related operations
that could ever happen on a certain mm. This includes vmtruncate,
try_to_unmap, and all page faults.
The caller must take the mmap_sem in write mode before calling
mm_take_all_locks(). The caller isn't allowed to release the mmap_sem
until mm_drop_all_locks() returns.
mmap_sem in write mode is required in order to block all operations that
could modify pagetables and free pages without need of altering the vma
layout (for example populate_range() with nonlinear vmas). It's also
needed in write mode to avoid new anon_vmas to be associated with existing
vmas.
A single task can't take more than one mm_take_all_locks() in a row or it
would deadlock.
mm_take_all_locks() and mm_drop_all_locks are expensive operations that
may have to take thousand of locks.
mm_take_all_locks() can fail if it's interrupted by signals.
When mmu_notifier_register returns, we must be sure that the driver is
notified if some task is in the middle of a vmtruncate for the 'mm' where
the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
is run around the vmtruncation but mmu_notifier_register can run after
mmu_notifier_invalidate_range_start and before
mmu_notifier_invalidate_range_end). Same problem for rmap paths. And
we've to remove page pinning to avoid replicating the tlb_gather logic
inside KVM (and GRU doesn't work well with page pinning regardless of
needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
the page, kvm would have no way to notice that it mapped into sptes a page
that is going into the freelist without a chance of any further
mmu_notifier notification.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding platform_data to mfd_cell allows passing of platform data directly
to the platform_device created for each cell and thus reuse of existing
drivers.
On the other side it can be used as a hook to mfd_cell itself
removing the need in mfd_get_cell method.
Signed-off-by: Mike Rapoport <mike@compulab.co.il>
Acked-by: Dmitry Baryshkov <dbaryshkov@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@openedhand.com>
Libata has some hacks to deal with certain controllers going silly in D3
state. The right way to handle this is to keep a PCI device flag for
such devices. That can then be generalised for no ATA devices with power
problems.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Disable ASPM on pre-1.1 PCIe devices, as many of them don't implement it
correctly.
Tested-by: Jack Howarth <howarth@bromo.msbb.uc.edu>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The ACPI FADT table includes an ASPM control bit. If the bit is set, do
not enable ASPM since it may indicate that the platform doesn't actually
support the feature.
Tested-by: Jack Howarth <howarth@bromo.msbb.uc.edu>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Clean up and optimize cpumask_of_cpu(), by sharing all the zero words.
Instead of stupidly generating all possible i=0...NR_CPUS 2^i patterns
creating a huge array of constant bitmasks, realize that the zero words
can be shared.
In other words, on a 64-bit architecture, we only ever need 64 of these
arrays - with a different bit set in one single world (with enough zero
words around it so that we can create any bitmask by just offsetting in
that big array). And then we just put enough zeroes around it that we
can point every single cpumask to be one of those things.
So when we have 4k CPU's, instead of having 4k arrays (of 4k bits each,
with one bit set in each array - 2MB memory total), we have exactly 64
arrays instead, each 8k bits in size (64kB total).
And then we just point cpumask(n) to the right position (which we can
calculate dynamically). Once we have the right arrays, getting
"cpumask(n)" ends up being:
static inline const cpumask_t *get_cpu_mask(unsigned int cpu)
{
const unsigned long *p = cpu_bit_bitmap[1 + cpu % BITS_PER_LONG];
p -= cpu / BITS_PER_LONG;
return (const cpumask_t *)p;
}
This brings other advantages and simplifications as well:
- we are not wasting memory that is just filled with a single bit in
various different places
- we don't need all those games to re-create the arrays in some dense
format, because they're already going to be dense enough.
if we compile a kernel for up to 4k CPU's, "wasting" that 64kB of memory
is a non-issue (especially since by doing this "overlapping" trick we
probably get better cache behaviour anyway).
[ mingo@elte.hu:
Converted Linus's mails into a commit. See:
http://lkml.org/lkml/2008/7/27/156http://lkml.org/lkml/2008/7/28/320
Also applied a family filter - which also has the side-effect of leaving
out the bits where Linus calls me an idio... Oh, never mind ;-)
]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
powerpc: Disable 64K hugetlb support when doing 64K SPU mappings
powerpc/powermac: Fixup default serial port device for pmac_zilog
powerpc/powermac: Use sane default baudrate for SCC debugging
powerpc/mm: Implement _PAGE_SPECIAL & pte_special() for 64-bit
powerpc: Show processor cache information in sysfs
powerpc: Make core id information available to userspace
powerpc: Make core sibling information available to userspace
powerpc/vio: More fallout from dma_mapping_error API change
ibmveth: Fix multiple errors with dma_mapping_error conversion
powerpc/pseries: Fix CMO sysdev attribute API change fallout
powerpc: Enable tracehook for the architecture
powerpc: Add TIF_NOTIFY_RESUME support for tracehook
powerpc: Add asm/syscall.h with the tracehook entry points
powerpc: Make syscall tracing use tracehook.h helpers
powerpc: Call tracehook_signal_handler() when setting up signal frames
powerpc: Update cpu_sibling_maps dynamically
powerpc: register_cpu_online should be __cpuinit
powerpc: kill useless SMT code in prom_hold_cpus
powerpc: Fix 8xx build failure
powerpc: Fix vio build warnings
...
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (72 commits)
sh: SuperH Mobile CEU and camera platform data for AP325RXA
sh: Update smc911x platform data for AP325RXA
sh: SuperH Mobile LCDC platform data for AP325RXA
sh: Add SuperH Mobile CEU platform data for Migo-R
sh: Add SuperH Mobile LCDC platform data for Migo-R
sh: Move asid_cache() out of ifdef to fix SH-3/4 nommu build.
sh: Workaround for __put_user_asm() bug with gcc 4.x on big-endian.
sh: Wire up new syscalls.
sh: fix uImage Entry Point
sh_keysc: remove request_mem_region() and release_mem_region()
sh: Don't miss pending signals returning to user mode after signal processing
sh: Use clk_always_enable() on sh7366
sh: Use clk_always_enable() on sh7343 / SE77343
sh: Use clk_always_enable() on sh7722 / Migo-R / SE7722
sh: Use clk_always_enable() on sh7723 / ap325rxa
sh: Introduce clk_always_enable() function
sh: Show all clocks and their state in /proc/clocks
sh: Merge sh7343 and sh7722 clock code
sh: Add SuperH Mobile MSTPCR bits to clock framework
sh: Use arch_flags to simplify sh7722 siu clock code
...
The connect and disconnect functions are unnecessary - everything they do can be
accomplished in the initial probe - so remove them.
Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
This add a dcache entry to the dcache for lookup, but changing the name
that is associated with the entry rather than the one passed in to the
lookup routine.
First, it sees if the case-exact match already exists in the dcache and
uses it if one exists. Otherwise, it allocates a new node with the new
name and splices it into the dcache.
Original code from ntfs_lookup in fs/ntfs/namei.c by Anton Altaparmakov.
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Acked-by: Christoph Hellwig <hch@infradead.org>
Instead of a "cpu" arg with magic values NR_CPUS (any cpu) and ~0 (all
cpus), pass a cpumask_t. Allow NULL for the common case (where we
don't care which CPU the function is run on): temporary cpumask_t's
are usually considered bad for stack space.
This deprecates stop_machine_run, to be removed soon when all the
callers are dead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
stop_machine creates a kthread which creates kernel threads. We can
create those threads directly and simplify things a little. Some care
must be taken with CPU hotunplug, which has special needs, but that code
seems more robust than it was in the past.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
-allow stop_mahcine_run() to call a function on all cpus. Calling
stop_machine_run() with a 'ALL_CPUS' invokes this new behavior.
stop_machine_run() proceeds as normal until the calling cpu has
invoked 'fn'. Then, we tell all the other cpus to call 'fn'.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
Simplify the code of include/linux/task_io_accounting.h.
It is also more reasonable to have all the task i/o-related statistics in a
single struct (task_io_accounting).
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
firewire: state userland requirements in Kconfig help
firewire: avoid memleak after phy config transmit failure
firewire: fw-ohci: TSB43AB22/A dualbuffer workaround
firewire: queue the right number of data
firewire: warn on unfinished transactions during card removal
firewire: small fw_fill_request cleanup
firewire: fully initialize fw_transaction before marking it pending
firewire: fix race of bus reset with request transmission
Put all i/o statistics in struct proc_io_accounting and use inline functions to
initialize and increment statistics, removing a lot of single variable
assignments.
This also reduces the kernel size as following (with CONFIG_TASK_XACCT=y and
CONFIG_TASK_IO_ACCOUNTING=y).
text data bss dec hex filename
11651 0 0 11651 2d83 kernel/exit.o.before
11619 0 0 11619 2d63 kernel/exit.o.after
10886 132 136 11154 2b92 kernel/fork.o.before
10758 132 136 11026 2b12 kernel/fork.o.after
3082029 807968 4818600 8708597 84e1f5 vmlinux.o.before
3081869 807968 4818600 8708437 84e155 vmlinux.o.after
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VID_TYPE defines are V4L1 specific, so copy them back to videodev.h.
In videodev2.h ensure that they are not used in the kernel (you need
to include videodev.h instead) and mark them are deprecated.
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
SPCA505 and SPCA508 added in the pixel formats.
Decode functions and associated resources removed in spca505, 506 and 508.
The decode routines are now found in the V4L library.
Signed-off-by: Jean-Francois Moine <moinejf@free.fr>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
mlx4: Update/add Mellanox Technologies copyright lines to mlx4 driver files
mlx4_core: Add VLAN tag field to WQE control segment struct
RDMA/nes: CM connection setup/teardown rework
IPoIB: Correct help text for INFINIBAND_IPOIB_DEBUG
IPoIB/cm: Connected mode is no longer EXPERIMENTAL
RDMA/ucm: BKL is not needed for ib_ucm_open()
RDMA/ucma: BKL is not needed for ucma_open()
* git://git.infradead.org/mtd-2.6: (57 commits)
[MTD] [NAND] subpage read feature as a way to increase performance.
CPUFREQ: S3C24XX NAND driver frequency scaling support.
[MTD][NAND] au1550nd: remove unused variable
[MTD] jedec_probe: Fix SST 16-bit chip detection
[MTD][MTDPART] Fix a division by zero bug
[MTD][MTDPART] Cleanup and document the erase region handling
[MTD][MTDPART] Handle most checkpatch findings
[MTD][MTDPART] Seperate main loop from per-partition code in add_mtd_partition
[MTD] physmap: resume already suspended chips on failure to suspend
[MTD] physmap: Fix suspend/resume/shutdown bugs.
[MTD] [NOR] Fix -ETIMEO errors in CFI driver
[MTD] [NAND] fsl_elbc_nand: fix section mismatch with CONFIG_MTD_OF_PARTS=y
[JFFS2] Use .unlocked_ioctl
[MTD] Fix const assignment in the MTD command line partitioning driver
[MTD] [NOR] gen_probe: No debug message when debugging is disabled
[MTD] [NAND] remove __PPC__ hardcoded address from DiskOnChip drivers
[MTD] [MAPS] Remove the bast-flash driver.
[MTD] [NAND] fsl_elbc_nand: ecclayout cleanups
[MTD] [NAND] fsl_elbc_nand: implement support for flash-based BBT
[MTD] [NAND] fsl_elbc_nand: fix OOB workability for large page NAND chips
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc:
atmel-mci: debugfs support
mmc: Add per-card debugfs support
mmc: Export internal host state through debugfs
imxmmc: fix crash when no platform data is provided
imxmmc: fix platform resources
imxmmc: remove DEBUG definition
mmc_spi: put signals to low power off fix
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
[PATCH] fix RLIM_NOFILE handling
[PATCH] get rid of corner case in dup3() entirely
[PATCH] remove remaining namei_{32,64}.h crap
[PATCH] get rid of indirect users of namei.h
[PATCH] get rid of __user_path_lookup_open
[PATCH] f_count may wrap around
[PATCH] dup3 fix
[PATCH] don't pass nameidata to __ncp_lookup_validate()
[PATCH] don't pass nameidata to gfs2_lookupi()
[PATCH] new (local) helper: user_path_parent()
[PATCH] sanitize __user_walk_fd() et.al.
[PATCH] preparation to __user_walk_fd cleanup
[PATCH] kill nameidata passing to permission(), rename to inode_permission()
[PATCH] take noexec checks to very few callers that care
Re: [PATCH 3/6] vfs: open_exec cleanup
[patch 4/4] vfs: immutable inode checking cleanup
[patch 3/4] fat: dont call notify_change
[patch 2/4] vfs: utimes cleanup
[patch 1/4] vfs: utimes: move owner check into inode_change_ok()
[PATCH] vfs: use kstrdup() and check failing allocation
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netns: fix ip_rt_frag_needed rt_is_expired
netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences
netfilter: fix double-free and use-after free
netfilter: arptables in netns for real
netfilter: ip{,6}tables_security: fix future section mismatch
selinux: use nf_register_hooks()
netfilter: ebtables: use nf_register_hooks()
Revert "pkt_sched: sch_sfq: dump a real number of flows"
qeth: use dev->ml_priv instead of dev->priv
syncookies: Make sure ECN is disabled
net: drop unused BUG_TRAP()
net: convert BUG_TRAP to generic WARN_ON
drivers/net: convert BUG_TRAP to generic WARN_ON
Remove the following warning when CONFIG_HUGETLB_PAGE is not set:
ipc/shm.c: In function `shm_get_stat':
ipc/shm.c:565: warning: unused variable `h'
[akpm@linux-foundation.org: use tabs, not spaces]
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs.h needs path.h, not namei.h; nfs_fs.h doesn't need it at all.
Several places in the tree needed direct include.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* do not pass nameidata; struct path is all the callers want.
* switch to new helpers:
user_path_at(dfd, pathname, flags, &path)
user_path(pathname, &path)
user_lpath(pathname, &path)
user_path_dir(pathname, &path) (fail if not a directory)
The last 3 are trivial macro wrappers for the first one.
* remove nameidata in callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new ia_valid flag: ATTR_TIMES_SET, to handle the
UTIMES_OMIT/UTIMES_NOW and UTIMES_NOW/UTIMES_OMIT cases. In these
cases neither ATTR_MTIME_SET nor ATTR_ATIME_SET is in the flags, yet
the POSIX draft specifies that permission checking is performed the
same way as if one or both of the times was explicitly set to a
timestamp.
See the path "vfs: utimensat(): fix error checking for
{UTIME_NOW,UTIME_OMIT} case" by Michael Kerrisk for the patch
introducing this behavior.
This is a cleanup, as well as allowing filesystems (NFS/fuse/...) to
perform their own permission checking instead of the default.
CC: Ulrich Drepper <drepper@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- use kstrdup() instead of kmalloc() + memcpy()
- return NULL if allocating ->mnt_devname failed
- mnt_devname should be const
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* MAY_CHDIR is redundant - it's an equivalent of MAY_ACCESS
* MAY_ACCESS on fuse should affect only the last step of pathname resolution
* fchdir() and chroot() should pass MAY_ACCESS, for the same reason why
chdir() needs that.
* now that we pass MAY_ACCESS explicitly in all cases, LOOKUP_ACCESS can be
removed; it has no business being in nameidata.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... so we ought to pass MAY_CHDIR to vfs_permission() instead of having
it triggered on every step of preceding pathname resolution. LOOKUP_CHDIR
is killed by that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the unused mode parameter from vfs_symlink and callers.
Thanks to Tetsuo Handa for noticing.
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.
Clean up callers by passing in a file instead of a dentry.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.
The obvious next target in that direction is permission(9)
folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* keep references to ctl_table_head and ctl_table in /proc/sys inodes
* grab the former during operations, use the latter for access to
entry if that succeeds
* have ->d_compare() check if table should be seen for one who does lookup;
that allows us to avoid flipping inodes - if we have the same name resolve
to different things, we'll just keep several dentries and ->d_compare()
will reject the wrong ones.
* have ->lookup() and ->readdir() scan the table of our inode first, then
walk all ctl_table_header and scan ->attached_by for those that are
attached to our directory.
* implement ->getattr().
* get rid of insane amounts of tree-walking
* get rid of the need to know dentry in ->permission() and of the contortions
induced by that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In a sense, that's the heart of the series. It's based on the following
property of the trees we are actually asked to add: they can be split into
stem that is already covered by registered trees and crown that is entirely
new. IOW, if a/b and a/c/d are introduced by our tree, then a/c is also
introduced by it.
That allows to associate tree and table entry with each node in the union;
while directory nodes might be covered by many trees, only one will cover
the node by its crown. And that will allow much saner logics for /proc/sys
in the next patches. This patch introduces the data structures needed to
keep track of that.
When adding a sysctl table, we find a "parent" one. Which is to say,
find the deepest node on its stem that already is present in one of the
tables from our table set or its ancestor sets. That table will be our
parent and that node in it - attachment point. Add our table to list
anchored in parent, have it refer the parent and contents of attachment
point. Also remember where its crown lives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Refcount the sucker; instead of freeing it by the end of unregistration
just drop the refcount and free only when it hits zero. Make sure that
we _always_ make ->unregistering non-NULL in start_unregistering().
That allows anybody to get a reference to such puppy, preventing its
freeing and reuse. It does *not* block unregistration. Anybody who
holds such a reference can
* try to grab a "use" reference (ctl_head_grab()); that will
succeeds if and only if it hadn't entered unregistration yet. If it
succeeds, we can use it in all normal ways until we release the "use"
reference (with ctl_head_finish()). Note that this relies on having
->unregistering become non-NULL in all cases when one starts to unregister
the sucker.
* keep pointers to ctl_table entries; they *can* be freed if
the entire thing is unregistered. However, if ctl_head_grab() succeeds,
we know that unregistration had not happened (and will not happen until
ctl_head_finish()) and such pointers can be used safely.
IOW, now we can have inodes under /proc/sys keep references to ctl_table
entries, protecting them with references to ctl_table_header and
grabbing the latter for the duration of operations that require access
to ctl_table. That won't cause deadlocks, since unregistration will not
be stopped by mere keeping a reference to ctl_table_header.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New object: set of sysctls [currently - root and per-net-ns].
Contains: pointer to parent set, list of tables and "should I see this set?"
method (->is_seen(set)).
Current lists of tables are subsumed by that; net-ns contains such a beast.
->lookup() for ctl_table_root returns pointer to ctl_table_set instead of
that to ->list of that ctl_table_set.
[folded compile fixes by rdd for configs without sysctl]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
As suggested by Patrick McHardy, introduce a __krealloc() that doesn't
free the original buffer to fix a double-free and use-after-free bug
introduced by me in netfilter that uses RCU.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Tested-by: Dieter Ries <clip2@gmx.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable support for digital audio processing capability.
This module may be used for special applications that require
cross connecting of bchannels, conferencing, dtmf decoding
echo cancelation, tone generation, and Blowfish encryption and
decryption.
It may use hardware features if available.
Signed-off-by: Karsten Keil <kkeil@suse.de>
For each card successfully added to the bus, create a subdirectory under
the host's debugfs root with information about the card.
At the moment, only a single file is added to the card directory for
all cards: "state". It reflects the "state" field in struct mmc_card,
indicating whether the card is present, readonly, etc.
For MMC and SD cards (not SDIO), another file is added: "status".
Reading this file will ask the card about its current status and
return it. This can be useful if the card just refuses to respond to
any commands, which might indicate that the card state is not what the
MMC core thinks it is (due to a missing stop command, for example.)
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
When CONFIG_DEBUG_FS is set, create a few files under /sys/kernel/debug
containing information about an mmc host's internal state. Currently,
just a single file is created, "ios", which contains information about
the current operating parameters for the bus (clock speed, bus width,
etc.)
Host drivers can add additional files and directories under the host's
root directory by passing the debugfs_root field in struct mmc_host as
the 'parent' parameter to debugfs_create_*.
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, AMD IOMMU: include amd_iommu_last_bdf in device initialization
x86: fix IBM Summit based systems' phys_cpu_present_map on 32-bit kernels
x86, RDC321x: remove gpio.h complications
x86, RDC321x: add to mach-default
crashdump: fix undefined reference to `elfcorehdr_addr'
flag parameters: fix compile error of sys_epoll_create1
The following functions can now become static:
- rtc_interrupt()
- rtc_get_rtc_time()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Bernhard Walle <bwalle@suse.de>
Acked-by: Paul Gortmaker <p_gortmaker@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the following needlessly global code static:
- swap_lock
- nr_swapfiles
- struct swap_list
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the needlessly global print_bad_pte() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the following needlessly global functions static:
- percpu_depopulate()
- __percpu_depopulate_mask()
- percpu_populate()
- __percpu_populate_mask()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the new function task_current_syscall() on machines where the
asm/syscall.h interface is supported (CONFIG_HAVE_ARCH_TRACEHOOK). It's
exported for modules to use in the future. This function safely samples
the state of a blocked thread to collect what system call it is blocked
in, and the six system call argument registers.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This extends wait_task_inactive() with a new argument so it can be used in
a "soft" mode where it will check for the task changing state unexpectedly
and back off. There is no change to existing callers. This lays the
groundwork to allow robust, noninvasive tracing that can try to sample a
blocked thread but back off safely if it wakes up.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds asm-generic/syscall.h, which documents what a real
asm-ARCH/syscall.h file should define. This is not used yet, but will
provide all the machine-dependent details of examining a user system call
about to begin, in progress, or just ended.
Each arch should add an asm-ARCH/syscall.h that defines all the entry
points documented in asm-generic/syscall.h, as short inlines if possible.
This lets us write new tracing code that understands user system call
registers, without any new arch-specific work.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds tracehook.h inlines to enable a new arch feature in support of
user debugging/tracing. This is not used yet, but it lays the groundwork
for a debugger to be able to wrangle a task that's possibly running,
without interrupting its syscalls in progress.
Each arch should define TIF_NOTIFY_RESUME, and in their entry.S code treat
it much like TIF_SIGPENDING. That is, it causes you to take the slow path
when returning to user mode, where you get the full user-mode state
accessible as for signal handling or ptrace. The arch code should check
TIF_NOTIFY_RESUME after handling TIF_SIGPENDING. When it's set, clear it
and then call tracehook_notify_resume().
In future, tracing code will call set_notify_resume() when it wants to get
a callback in tracehook_notify_resume().
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines a new hook tracehook_force_sigpending() that lets tracing
code decide to force TIF_SIGPENDING on in recalc_sigpending().
This is not used yet, so it compiles away to nothing for now. It lays the
groundwork for new tracing code that can interrupt a task synthetically
without actually sending a signal.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the ptrace logic in task death (exit_notify) into tracehook.h
inlines. Some code is rearranged slightly to make things nicer. There is
no change, only cleanup.
There is one hook called with the tasklist_lock write-locked, as ptrace
needs. There is also a new hook called after exit_state changes and
without locks. This is a better place for tracing work to be in the
future, since it doesn't delay the whole system with locking.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines the tracehook_notify_jctl() hook to formalize the ptrace
effects on the job control notifications. There is no change, only
cleanup.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines the tracehook_get_signal() hook to allow tracing code to slip
in before normal signal dequeuing. This lays the groundwork for new
tracing features that can inject synthetic signals outside the normal
queue or control the disposition of delivered signals. The calling
convention lets tracehook_get_signal() decide both exactly what will
happen and what signal number to report in the handler/exit.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds standard tracehook.h inlines for arch code to call when
TIF_SYSCALL_TRACE has been set. This replaces having each arch implement
the ptrace guts for its syscall tracing support.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines tracehook_consider_fatal_signal() has a fine-grained hook for
deciding to skip the special cases for a fatal signal, as ptrace does.
There is no change, only cleanup.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines tracehook_consider_ignored_signal() has a fine-grained hook
for deciding to prevent the normal short-circuit of sending an ignored
signal, as ptrace does. There is no change, only cleanup.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This defines tracehook_signal_handler() as a hook for the arch signal
handling code to call. It gives ptrace the opportunity to stop for a
pseudo-single-step trap immediately after signal handler setup is done.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds tracehook_expect_breakpoints() as a formal hook for the nommu
code to use for its, "Is text-poking likely?" check at mmap time. This
names the actual semantics the code means to test, and documents it.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the tracehook_tracer_task() hook to consolidate all forms of
"Who is using ptrace on me?" logic. This is used for "TracerPid:" in
/proc and for permission checks. We also clean up the selinux code the
called an identical accessor.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the ptrace-related logic from release_task into tracehook.h and
ptrace.h inlines. It provides clean hooks both before and after locking
tasklist_lock, for future tracing logic to do more cleanup without the
lock.
This also changes release_task() itself in the rare "zap_leader" case to
set the leader to EXIT_DEAD before iterating. This maintains the
invariant that release_task() only ever handles a task in EXIT_DEAD. This
is a common-sense invariant that is already always true except in this one
arcane case of zombie leader whose parent ignores SIGCHLD.
This change is harmless and only costs one store in this one rare case.
It keeps the expected state more consisently sane, which is nicer when
debugging weirdness in release_task(). It also lets some future code in
the tracehook entry points rely on this invariant for bookkeeping.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the PTRACE_EVENT_VFORK_DONE tracing into a tracehook.h inline,
tracehook_report_vfork_done(). The change has no effect, just clean-up.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves all the ptrace initialization and tracing logic for task
creation into tracehook.h and ptrace.h inlines. It reorganizes the code
slightly, but should not change any behavior.
There are four tracehook entry points, at each important stage of task
creation. This keeps the interface from the core fork.c code fairly
clean, while supporting the complex setup required for ptrace or something
like it.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the PTRACE_EVENT_EXIT tracing into a tracehook.h inline,
tracehook_report_exec(). The change has no effect, just clean-up.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves all the ptrace hooks related to exec into tracehook.h inlines.
This also lifts the calls for tracing out of the binfmt load_binary hooks
into search_binary_handler() after it calls into the binfmt module. This
change has no effect, since all the binfmt modules' load_binary functions
did the call at the end on success, and now search_binary_handler() does
it immediately after return if successful. We consolidate the repeated
code, and binfmt modules no longer need to import ptrace_notify().
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch series introduces the "tracehook" interface layer of inlines in
<linux/tracehook.h>. There are more details in the log entry for patch
01/23 and in the header file comments inside that patch. Most of these
changes move code around with little or no change, and they should not
break anything or change any behavior.
This sets a new standard for uniform arch support to enable clean
arch-independent implementations of new debugging and tracing stuff,
denoted by CONFIG_HAVE_ARCH_TRACEHOOK. Patch 20/23 adds that symbol to
arch/Kconfig, with comments listing everything an arch has to do before
setting "select HAVE_ARCH_TRACEHOOK". These are elaborted a bit at:
http://sourceware.org/systemtap/wiki/utrace/arch/HowTo
The new inlines that arch code must define or call have detailed kerneldoc
comments in the generic header files that say what is required.
No arch is obligated to do any work, and no arch's build should be broken
by these changes. There are several steps that each arch should take so
it can set HAVE_ARCH_TRACEHOOK. Most of these are simple. Providing this
support will let new things people add for doing debugging and tracing of
user-level threads "just work" for your arch in the future. For an arch
that does not provide HAVE_ARCH_TRACEHOOK, some new options for such
features will not be available for config.
I have done some arch work and will submit this to the arch maintainers
after the generic tracehook series settles in. For now, that work is
available in my GIT repositories, and in patch and mbox-of-patches form at
http://people.redhat.com/roland/utrace/2.6-current/
This paves the way for my "utrace" work, to be submitted later. But it is
not innately tied to that. I hope that the tracehook series can go in
soon regardless of what eventually does or doesn't go on top of it. For
anyone implementing any kind of new tracing/debugging plan, or just
understanding all the context of the existing ptrace implementation,
having tracehook.h makes things much easier to find and understand.
This patch:
This adds the new kernel-internal header file <linux/tracehook.h>. This
is not yet used at all. The comments in the header introduce what the
following series of patches is about.
The aim is to formalize and consolidate all the places that the core
kernel code and the arch code now ties into the ptrace implementation.
These patches mostly don't cause any functional change. They just move
the details of ptrace logic out of core code into tracehook.h inlines,
where they are mostly compiled away to the same as before. All that
changes is that everything is thoroughly documented and any future
reworking of ptrace, or addition of something new, would not have to touch
core code all over, just change the tracehook.h inlines.
The new linux/ptrace.h inlines are used by the following patches in the
new tracehook_*() inlines. Using these helpers for the ptrace event stops
makes it simple to change or disable the old ptrace implementation of
these stops conditionally later.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we can be sure that elevating the page_count on a pagecache page will
pin it, we can speculatively run this operation, and subsequently check to
see if we hit the right page rather than relying on holding a lock or
otherwise pinning a reference to the page.
This can be done if get_page/put_page behaves consistently throughout the
whole tree (ie. if we "get" the page after it has been used for something
else, we must be able to free it with a put_page).
Actually, there is a period where the count behaves differently: when the
page is free or if it is a constituent page of a compound page. We need
an atomic_inc_not_zero operation to ensure we don't try to grab the page
in either case.
This patch introduces the core locking protocol to the pagecache (ie.
adds page_cache_get_speculative, and tweaks some update-side code to make
it work).
Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to
hold off speculative getters.
[kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
[hugh@veritas.com: fix add_to_page_cache]
[akpm@linux-foundation.org: repair a comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce gang_lookup_slot() and gang_lookup_slot_tag() functions, which
are used by lockless pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new get_user_pages_fast mm API, which is basically a
get_user_pages with a less general API (but still tends to be suited to
the common case):
- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas
This restricted API can be implemented in a much more scalable way on many
architectures when the ptes are present, by walking the page tables
locklessly (no mmap_sem or page table locks). When the ptes are not
populated, get_user_pages_fast() could be slower.
This is implemented locklessly on x86, and used in some key direct IO call
sites, in later patches, which provides nearly 10% performance improvement
on a threaded database workload.
Lots of other code could use this too, depending on use cases (eg. grep
drivers/). And it might inspire some new and clever ways to use it.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allows one to create and use a channel with no associated files. Files
can be initialized later. This is useful in scenarios such as logging in
early code, before VFS is up. Therefore, such channels can be created and
used as soon as kmem_cache_init() completed.
This is needed by kmemtrace to do tracing in early kernel code.
[kosaki.motohiro@jp.fujitsu.com: build fix]
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A previous patch added the early_initcall(), to allow a cleaner hooking of
pre-SMP initcalls. Now we remove the older interface, converting all
existing users to the new one.
[akpm@linux-foundation.org: cleanups]
[akpm@linux-foundation.org: build fix]
[kosaki.motohiro@jp.fujitsu.com: warning fix]
[kosaki.motohiro@jp.fujitsu.com: warning fix]
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added early initcall (pre-SMP) support, using an identical interface to
that of regular initcalls. Functions called from do_pre_smp_initcalls()
could be converted to use this cleaner interface.
This is required by CPU hotplug, because early users have to register
notifiers before going SMP. One such CPU hotplug user is the relay
interface with buffer-only channels, which needs to register such a
notifier, to be usable in early code. This in turn is used by kmemtrace.
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements devices state save/restore before after kexec.
This patch together with features in kexec_jump patch can be used for
following:
- A simple hibernation implementation without ACPI support. You can kexec a
hibernating kernel, save the memory image of original system and shutdown
the system. When resuming, you restore the memory image of original system
via ordinary kexec load then jump back.
- Kernel/system debug through making system snapshot. You can make system
snapshot, jump back, do some thing and make another system snapshot.
- Cooperative multi-kernel/system. With kexec jump, you can switch between
several kernels/systems quickly without boot process except the first time.
This appears like swap a whole kernel/system out/in.
- A general method to call program in physical mode (paging turning
off). This can be used to invoke BIOS code under Linux.
The following user-space tools can be used with kexec jump:
- kexec-tools needs to be patched to support kexec jump. The patches
and the precompiled kexec can be download from the following URL:
source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10
- makedumpfile with patches are used as memory image saving tool, it
can exclude free pages from original kernel memory image file. The
patches and the precompiled makedumpfile can be download from the
following URL:
source: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-src_cvs_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-patches_cvs_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile_cvs_kh10
- An initramfs image can be used as the root file system of kexeced
kernel. An initramfs image built with "BuildRoot" can be downloaded
from the following URL:
initramfs image: http://khibernation.sourceforge.net/download/release_v10/initramfs/rootfs_cvs_kh10.gz
All user space tools above are included in the initramfs image.
Usage example of simple hibernation:
1. Compile and install patched kernel with following options selected:
CONFIG_X86_32=y
CONFIG_RELOCATABLE=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PM=y
CONFIG_HIBERNATION=y
CONFIG_KEXEC_JUMP=y
2. Build an initramfs image contains kexec-tool and makedumpfile, or
download the pre-built initramfs image, called rootfs.gz in
following text.
3. Prepare a partition to save memory image of original kernel, called
hibernating partition in following text.
4. Boot kernel compiled in step 1 (kernel A).
5. In the kernel A, load kernel compiled in step 1 (kernel B) with
/sbin/kexec. The shell command line can be as follow:
/sbin/kexec --load-preserve-context /boot/bzImage --mem-min=0x100000
--mem-max=0xffffff --initrd=rootfs.gz
6. Boot the kernel B with following shell command line:
/sbin/kexec -e
7. The kernel B will boot as normal kexec. In kernel B the memory
image of kernel A can be saved into hibernating partition as
follow:
jump_back_entry=`cat /proc/cmdline | tr ' ' '\n' | grep kexec_jump_back_entry | cut -d '='`
echo $jump_back_entry > kexec_jump_back_entry
cp /proc/vmcore dump.elf
Then you can shutdown the machine as normal.
8. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as
root file system.
9. In kernel C, load the memory image of kernel A as follow:
/sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf
10. Jump back to the kernel A as follow:
/sbin/kexec -e
Then, kernel A is resumed.
Implementation point:
To support jumping between two kernels, before jumping to (executing)
the new kernel and jumping back to the original kernel, the devices
are put into quiescent state, and the state of devices and CPU is
saved. After jumping back from kexeced kernel and jumping to the new
kernel, the state of devices and CPU are restored accordingly. The
devices/CPU state save/restore code of software suspend is called to
implement corresponding function.
Known issues:
- Because the segment number supported by sys_kexec_load is limited,
hibernation image with many segments may not be load. This is
planned to be eliminated by adding a new flag to sys_kexec_load to
make a image can be loaded with multiple sys_kexec_load invoking.
Now, only the i386 architecture is supported.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch provides an enhancement to kexec/kdump. It implements the
following features:
- Backup/restore memory used by the original kernel before/after
kexec.
- Save/restore CPU state before/after kexec.
The features of this patch can be used as a general method to call program in
physical mode (paging turning off). This can be used to call BIOS code under
Linux.
kexec-tools needs to be patched to support kexec jump. The patches and
the precompiled kexec can be download from the following URL:
source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10
Usage example of calling some physical mode code and return:
1. Compile and install patched kernel with following options selected:
CONFIG_X86_32=y
CONFIG_KEXEC=y
CONFIG_PM=y
CONFIG_KEXEC_JUMP=y
2. Build patched kexec-tool or download the pre-built one.
3. Build some physical mode executable named such as "phy_mode"
4. Boot kernel compiled in step 1.
5. Load physical mode executable with /sbin/kexec. The shell command
line can be as follow:
/sbin/kexec --load-preserve-context --args-none phy_mode
6. Call physical mode executable with following shell command line:
/sbin/kexec -e
Implementation point:
To support jumping without reserving memory. One shadow backup page (source
page) is allocated for each page used by kexeced code image (destination
page). When do kexec_load, the image of kexeced code is loaded into source
pages, and before executing, the destination pages and the source pages are
swapped, so the contents of destination pages are backupped. Before jumping
to the kexeced code image and after jumping back to the original kernel, the
destination pages and the source pages are swapped too.
C ABI (calling convention) is used as communication protocol between
kernel and called code.
A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to
indicate that the loaded kernel image is used for jumping back.
Now, only the i386 architecture is supported.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases it may be desirable to ensure that associated driver is not
going to access the media in some period of time. "start" and "stop"
methods are provided therefore to allow it.
Signed-off-by: Alex Dubov <oakad@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some controllers (Jmicron, for instance) can report temporal failure
condition during power-on. It is desirable to account for this using a
return value of "set_param" device method. The return value can also be
handy to distinguish between supported and unsupported device parameters
in run time.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alex Dubov <oakad@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds proper externs for parport_default_timeslice and
parport_default_spintime in include/linux/parport.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:
This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread). So I
CC'ed this to KVM camp. Comments are appreciated.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
NULL, the system-wide dma_ops pointer is used as before.
If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging). It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.
The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations. So x86 can't have dma_mapping_ops per
device. Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.
The first patch adds the device argument to dma_mapping_error. The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.
This patch:
dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations. So we can't have dma_mapping_ops per device.
Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
argument.
[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix this, on avr32:
include/linux/utsname.h:35,
from init/main.c:20:
include/linux/sched.h: In function 'arch_pick_mmap_layout':
include/linux/sched.h:2149: error: implicit declaration of function 'PAGE_ALIGN'
Reported-by: Adrian Bunk <bunk@kernel.org>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an arch doesn't define cpumask_of_cpu_map, create a generic
statically-initialized one for them. This allows removal of the buggy
cpumask_of_cpu() macro (&cpumask_of_cpu() gives address of
out-of-scope var).
An arch with NR_CPUS of 4096 probably wants to allocate this itself
based on the actual number of CPUs, since otherwise they're using 2MB
of rodata (1024 cpus means 128k). That's what
CONFIG_HAVE_CPUMASK_OF_CPU_MAP is for (only x86/64 does so at the
moment).
In future as we support more CPUs, we'll need to resort to a
get_cpu_map()/put_cpu_map() allocation scheme.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Calculating the number of pages from given address and length numbers is a task
required in multiple IOMMU implementations. So implement this as a generic
function into the IOMMU helper code.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix build bug introduced by 95b68dec0d "calgary iommu: use the first
kernels TCE tables in kdump":
arch/x86/kernel/built-in.o: In function `calgary_iommu_init':
(.init.text+0x8399): undefined reference to `elfcorehdr_addr'
arch/x86/kernel/built-in.o: In function `calgary_iommu_init':
(.init.text+0x856c): undefined reference to `elfcorehdr_addr'
arch/x86/kernel/built-in.o: In function `detect_calgary':
(.init.text+0x8c68): undefined reference to `elfcorehdr_addr'
arch/x86/kernel/built-in.o: In function `detect_calgary':
(.init.text+0x8d0c): undefined reference to `elfcorehdr_addr'
make elfcorehdr_addr a generally available symbol.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds support for populating an SPI bus based on data in the
OF device tree. This is useful for powerpc platforms which use the
device tree instead of discrete code for describing platform layout.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
spi_new_device() allocates and registers an spi device all in one swoop.
If the driver needs to add extra data to the spi_device before it is
registered, then this causes problems. This is needed for OF device
tree support so that the SPI device tree helper can add a pointer to
the device node after the device is allocated, but before the device
is registered. OF aware SPI devices can then retrieve data out of the
device node to populate a platform data structure.
This patch splits the allocation and registration portions of code out
of spi_new_device() and creates two new functions; spi_alloc_device()
and spi_register_device(). spi_new_device() is modified to use the new
functions for allocation and registration. None of the existing users
of spi_new_device() should be affected by this change.
Drivers using the new API can forego the use of spi_board_info
structure to describe the device layout and populate data into the
spi_device structure directly.
This change is in preparation for adding an OF device tree parser to
generate spi_devices based on data in the device tree.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
SPI has a similar problem as I2C in that it needs to determine an
appropriate modalias value for each device node. This patch adapts
the of_i2c of_find_i2c_driver() function to be usable by of_spi also.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Introduced by commit aaca0bdca5 ("flag
parameters: paccept"):
net/socket.c:1515:17: error: symbol 'sys_paccept' redeclared with different type (originally declared at include/linux/syscalls.h:413) - incompatible argument 4 (different address spaces)
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These got unintentionally moved, put them back as x86 provides its own
versions.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the dummy asm/kvm.h files on architectures not (yet)
supporting KVM and uses the same conditional headers installation as
already used for a.out.h .
Also removed are superfluous install rules in the s390 and x86 Kbuild
files (they are already in Kbuild.asm).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits)
powerpc: Wireup new syscalls
Move update_mmu_cache() declaration from tlbflush.h to pgtable.h
powerpc/pseries: Remove kmalloc call in handling writes to lparcfg
powerpc/pseries: Update arch vector to indicate support for CMO
ibmvfc: Add support for collaborative memory overcommit
ibmvscsi: driver enablement for CMO
ibmveth: enable driver for CMO
ibmveth: Automatically enable larger rx buffer pools for larger mtu
powerpc/pseries: Verify CMO memory entitlement updates with virtual I/O
powerpc/pseries: vio bus support for CMO
powerpc/pseries: iommu enablement for CMO
powerpc/pseries: Add CMO paging statistics
powerpc/pseries: Add collaborative memory manager
powerpc/pseries: Utilities to set firmware page state
powerpc/pseries: Enable CMO feature during platform setup
powerpc/pseries: Split retrieval of processor entitlement data into a helper routine
powerpc/pseries: Add memory entitlement capabilities to /proc/ppc64/lparcfg
powerpc/pseries: Split processor entitlement retrieval and gathering to helper routines
powerpc/pseries: Remove extraneous error reporting for hcall failures in lparcfg
powerpc: Fix compile error with binutils 2.15
...
Fixed up conflict in arch/powerpc/platforms/52xx/Kconfig manually.
Preliminary support for the Intel 5100 MCH. CE and UE errors are reported
along with the current DIMM label information and other memory parameters.
Reasons why this is preliminary:
1) This chip has 2 independent memory controllers which, for best
perforance, use interleaved accesses to the DDR2 memory. This
architecture does not map very well to the current edac data structures
which depend on symmetric channel access to the interleaved data.
Without core changes, the best I could do for now is to map both memory
controllers to different csrows (first all ranks of controller 0, then
all ranks of controller 1). Someone much more familiar with the edac
core than I will probably need to come up with a more general data
structure to handle the interleaving and de-interleaving of the two
memory controllers.
2) I have not yet tackled the de-interleaving of the rank/controller
address space into the physical address space of the CPU. There is
nothing fundamentally missing, it is just ending up to be a lot of
code, and I'd rather keep it separate for now, esp since it doesn't
work yet...
3) The code depends on a particular i5100 chip select to DIMM mainboard
chip select mapping. This mapping seems obvious to me in order to
support dual and single ranked memory, but it is not unique and DIMM
labels could be wrong on other mainboards. There is no way to query
this mapping that I know of.
4) The code requires that the i5100 is in 32GB mode. Only 4 ranks per
controller, 2 ranks per DIMM are supported. I do not have hardware
(nor do I expect to have hardware anytime soon) for the 48GB (6 ranks
per controller) mode.
5) The serial presence detect code should be broken out into a "real"
i2c driver so that decode-dimms.pl can work.
Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the get_parent export operation by sending a LOOKUP request with
".." as the name.
Implement looking up an inode by node ID after it has been evicted from
the cache. This is done by seding a LOOKUP request with "." as the name
(for all file types, not just directories).
The filesystem can set the FUSE_EXPORT_SUPPORT flag in the INIT reply, to
indicate that it supports these special lookups.
Thanks to John Muir for the original implementation of this feature.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a special error value FILE_LOCK_DEFERRED to mean that a locking
operation returned asynchronously. This is returned by
posix_lock_file() for sleeping locks to mean that the lock has been
queued on the block list, and will be woken up when it might become
available and needs to be retried (either fl_lmops->fl_notify() is
called or fl_wait is woken up).
f_op->lock() to mean either the above, or that the filesystem will
call back with fl_lmops->fl_grant() when the result of the locking
operation is known. The filesystem can do this for sleeping as well
as non-sleeping locks.
This is to make sure, that return values of -EAGAIN and -EINPROGRESS by
filesystems are not mistaken to mean an asynchronous locking.
This also makes error handling in fs/locks.c and lockd/svclock.c slightly
cleaner.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add members for memory reclaim delay to taskstats, and accumulate them in
__delayacct_add_tsk() .
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes, application responses become bad under heavy memory load.
Applications take a bit time to reclaim memory. The statistics, how long
memory reclaim takes, will be useful to measure memory usage.
This patch adds accounting memory reclaim to per-task-delay-accounting for
accounting the time of do_try_to_free_pages().
<i.e>
- When System is under low memory load,
memory reclaim may not occur.
$ free
total used free shared buffers cached
Mem: 8197800 1577300 6620500 0 4808 1516724
-/+ buffers/cache: 55768 8142032
Swap: 16386292 0 16386292
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 5069748 10612 3014060 0 0 0 0 3 26 0 0 100 0
0 0 0 5069748 10612 3014060 0 0 0 0 4 22 0 0 100 0
0 0 0 5069748 10612 3014060 0 0 0 0 3 18 0 0 100 0
Measure the time of tar command.
$ ls -s test.dat
1501472 test.dat
$ time tar cvf test.tar test.dat
real 0m13.388s
user 0m0.116s
sys 0m5.304s
$ ./delayget -d -p <pid>
CPU count real total virtual total delay total
428 5528345500 5477116080 62749891
IO count delay total
338 8078977189
SWAP count delay total
0 0
RECLAIM count delay total
0 0
- When system is under heavy memory load
memory reclaim may occur.
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 7159032 49724 1812 3012 0 0 0 0 3 24 0 0 100 0
0 0 7159032 49724 1812 3012 0 0 0 0 4 24 0 0 100 0
0 0 7159032 49848 1812 3012 0 0 0 0 3 22 0 0 100 0
In this case, one process uses more 8G memory
by execution of malloc() and memset().
$ time tar cvf test.tar test.dat
real 1m38.563s <- increased by 85 sec
user 0m0.140s
sys 0m7.060s
$ ./delayget -d -p <pid>
CPU count real total virtual total delay total
9021 7140446250 7315277975 923201824
IO count delay total
8965 90466349669
SWAP count delay total
3 21036367
RECLAIM count delay total
740 61011951153
In the later case, the value of RECLAIM is increasing.
So, taskstats can show how much memory reclaim influences TAT.
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujistu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate
parent I/O statistics in /proc/pid/io. This approach follows the same
model used to account per-process and per-thread CPU times.
As a practial application, this allows for example to quickly find the top
I/O consumer when a process spawns many child threads that perform the
actual I/O work, because the aggregated I/O statistics can always be found
in /proc/pid/io.
[ Oleg Nesterov points out that we should check that the task is still
alive before we iterate over the threads, but also says that we can do
that fixup on top of this later. - Linus ]
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Cc: Matt Heaton <matt@hostmonster.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Acked-by-with-comments: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocate the structure on the first call to sys_acct(). After this each
namespace, that ordered the accounting, will live with this structure till
its own death.
Two notes
- routines, that close the accounting on fs umount time use
the init_pid_ns's acct by now;
- accounting routine accounts to dying task's namespace
(also by now).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the bsdacct-related info will be stored in the area, pointer by this
one.
It will be NULL automatically for all new namespaces.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adapt acct_update_integrals() to include user time when calculating the time
difference. The units of acct_rss_mem1 and acct_vm_mem1 are also changed from
pages-jiffies to pages-usecs to avoid calling jiffies_to_usecs() in
xacct_add_tsk() which might overflow.
Signed-off-by: Jonathan Lim <jlim@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It seems to me that it was a mistake marking this function as deprecated
and scheduling it for removal, rather than resolutely removing it after
the last caller's death.
Anyway - better late, then never.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This one had the only users so far - the kill_proc, which is removed, so
drop this (invalid in namespaced world) call too.
And of course - erase all references on it from comments.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function operated on a pid_t to kill a task, which is no longer valid
in a containerized system.
It has finally lost all its users and we can safely remove it from the
tree.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When struct pid is built on a 64 bit platform gcc has to insert padding to
maintain the correct alignment, by simply reordering its members the
memory usage shrinks from 88 bytes to 80.
I've successfully run with this patch on my desktop AMD64 machine.
There are no significant kernel size changes to a default config.X86_64
on the latest git v2.6.26-rc1
text data bss dec hex filename
5404828 976760 734280 7115868 6c945c vmlinux
5404811 976760 734280 7115851 6c944b vmlinux.pid-patch
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds proper prototypes for pid{hash,map}_init() in
include/linux/pid_namespace.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current two-stage scheme of removing PDE emphasizes one bug in proc:
open
rmmod
remove_proc_entry
close
->release won't be called because ->proc_fops were cleared. In simple
cases it's small memory leak.
For every ->open, ->release has to be done. List of openers is introduced
which is traversed at remove_proc_entry() if neeeded.
Discussions with Al long ago (sigh).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the extern of struct proc_kmsg_operations to
fs/proc/internal.h and adds an #include "internal.h" to fs/proc/kmsg.c
so that the latter sees the former.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch proposes an alternative to the "magical
positive-versus-negative number trick" Andrew complained about last week
in http://lkml.org/lkml/2008/6/24/418.
This had been introduced with the patches that scale msgmni to the amount
of lowmem. With these patches, msgmni has a registered notification
routine that recomputes msgmni value upon memory add/remove or ipc
namespace creation/ removal.
When msgmni is changed from user space (i.e. value written to the proc
file), that notification routine is unregistered, and the way to make it
registered back is to write a negative value into the proc file. This is
the "magical positive-versus-negative number trick".
To fix this, a new proc file is introduced: /proc/sys/kernel/auto_msgmni.
This file acts as ON/OFF for msgmni automatic recomputing.
With this patch, the process is the following:
1) kernel boots in "automatic recomputing mode"
/proc/sys/kernel/msgmni contains the value that has been computed (depends
on lowmem)
/proc/sys/kernel/automatic_msgmni contains "1"
2) echo <val> > /proc/sys/kernel/msgmni
. sets msg_ctlmni to <val>
. de-activates automatic recomputing (i.e. if, say, some memory is added
msgmni won't be recomputed anymore)
. /proc/sys/kernel/automatic_msgmni now contains "0"
3) echo "0" > /proc/sys/kernel/automatic_msgmni
. de-activates msgmni automatic recomputing
this has the same effect as 2) except that msg_ctlmni's value stays
blocked at its current value)
3) echo "1" > /proc/sys/kernel/automatic_msgmni
. recomputes msgmni's value based on the current available memory size
and number of ipc namespaces
. re-activates automatic recomputing for msgmni.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Solofo Ramangalahy <Solofo.Ramangalahy@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The attached patch:
- reverses the locking order of ulp->lock and sem_lock:
Previously, it was first ulp->lock, then inside sem_lock.
Now it's the other way around.
- converts the undo structure to rcu.
Benefits:
- With the old locking order, IPC_RMID could not kfree the undo structures.
The stale entries remained in the linked lists and were released later.
- The patch fixes a a race in semtimedop(): if both IPC_RMID and a semget() that
recreates exactly the same id happen between find_alloc_undo() and sem_lock,
then semtimedop() would access already kfree'd memory.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make idr_find rcu-safe: it can now be called inside an rcu_read critical
section.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Reviewed-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Jim Houston <jim.houston@comcast.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do some code factorization in the return code analysis.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Jim Houston <jim.houston@comcast.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After scalability problems have been detected when using the sysV ipcs, I have
proposed to use an RCU based implementation of the IDR api instead (see
threads http://lkml.org/lkml/2008/4/11/212 and
http://lkml.org/lkml/2008/4/29/295).
This resulted in many people asking to convert the idr API and make it rcu
safe (because most of the code was duplicated and thus unmaintanable and
unreviewable).
So here is a first attempt.
The important change wrt to the idr API itself is during idr removes: idr
layers are freed after a grace period, instead of being moved to the free
list.
The important change wrt to ipcs, is that idr_find() can now be called
locklessly inside a rcu read critical section.
Here are the results I've got for the pmsg test sent by Manfred:
2.6.25-rc3-mm1 2.6.25-rc3-mm1+ 2.6.25-mm1 Patched 2.6.25-mm1
1 1168441 1064021 876000 947488
2 1094264 921059 1549592 1730685
3 2082520 1738165 1694370 2324880
4 2079929 1695521 404553 2400408
5 2898758 406566 391283 3246580
6 2921417 261275 263249 3752148
7 3308761 126056 191742 4243142
8 3329456 100129 141722 4275780
1st column: stock 2.6.25-rc3-mm1
2nd column: 2.6.25-rc3-mm1 + ipc patches (store ipcs into idrs)
3nd column: stock 2.6.25-mm1
4th column: 2.6.25-mm1 + this pacth series.
This patch:
Add an rcu_head to the idr_layer structure in order to free it after a grace
period.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Reviewed-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Jim Houston <jim.houston@comcast.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kdump kernel fails to boot with calgary iommu and aacraid driver on a x366
box. The ongoing dma's of aacraid from the first kernel continue to exist
until the driver is loaded in the kdump kernel. Calgary is initialized
prior to aacraid and creation of new tce tables causes wrong dma's to
occur. Here we try to get the tce tables of the first kernel in kdump
kernel and use them. While in the kdump kernel we do not allocate new tce
tables but instead read the base address register contents of calgary
iommu and use the tables that the registers point to. With these changes
the kdump kernel and hence aacraid now boots normally.
Signed-off-by: Chandru Siddalingappa <chandru@in.ibm.com>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
workqueue_cpu_callback(CPU_DEAD) flushes cwq->thread under
cpu_maps_update_begin(). This means that the multithreaded workqueues
can't use get_online_cpus() due to the possible deadlock, very bad and
very old problem.
Introduce the new state, CPU_POST_DEAD, which is called after
cpu_hotplug_done() but before cpu_maps_update_done().
Change workqueue_cpu_callback() to use CPU_POST_DEAD instead of CPU_DEAD.
This means that create/destroy functions can't rely on get_online_cpus()
any longer and should take cpu_add_remove_lock instead.
[akpm@linux-foundation.org: fix CONFIG_SMP=n]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have core_state->dumper list we can use it to wake up the
sub-threads waiting for the coredump completion.
This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
lessens by sizeof(struct completion). Also, with this change we can
decouple exit_mm() from the coredumping code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
binfmt->core_dump() has to iterate over the all threads in system in order
to find the coredumping threads and construct the list using the
GFP_ATOMIC allocations.
With this patch each thread allocates the list node on exit_mm()'s stack and
adds itself to the list.
This allows us to do further changes:
- simplify ->core_dump()
- change exit_mm() to clear ->mm first, then wait for ->core_done.
this makes the coredumping process visible to oom_kill
- kill mm->core_done
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Turn core_state->nr_threads into atomic_t and kill now unneeded
down_write(&mm->mmap_sem) in exit_mm().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move mm->core_waiters into "struct core_state" allocated on stack. This
shrinks mm_struct a little bit and allows further changes.
This patch mostly does s/core_waiters/core_state. The only essential
change is that coredump_wait() must clear mm->core_state before return.
The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
is fixed by the next patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm->core_startup_done points to "struct completion startup_done" allocated
on the coredump_wait()'s stack. Introduce the new structure, core_state,
which holds this "struct completion". This way we can add more info
visible to the threads participating in coredump without enlarging
mm_struct.
No changes in affected .o files.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
No functional changes yet. But this allows us to do further
fixes/cleanups.
oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
kthreads, this is wrong because of use_mm(). The problem with
PF_BORROWED_MM is that we need task_lock() to avoid races. With this
patch we can check PF_KTHREAD directly, or use a simple lockless helper:
/* The result must not be dereferenced !!! */
struct mm_struct *__get_task_mm(struct task_struct *tsk)
{
if (tsk->flags & PF_KTHREAD)
return NULL;
return tsk->mm;
}
Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
thread without PF_BORROWED_MM.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
by INIT_TASK() and copied to the forked childs (we could set it in
kthreadd() along with PF_NOFREEZE instead).
daemonize() was changed as well. In that case testing of PF_KTHREAD is
racy, but daemonize() is hopeless anyway.
This flag is cleared in do_execve(), before search_binary_handler().
Probably not the best place, we can do this in exec_mmap() or in
start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
doesn't matter in practice, and if do_execve() fails kthread should die
soon.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace_stop() has some complicated checks to prevent the scheduling in the
TASK_TRACED state with the pending SIGKILL, but these checks are racy, and
they depend on arch_ptrace_stop_needed().
This patch assumes that the traced task should die asap if it was killed by
SIGKILL, in that case schedule()->signal_pending_state() has no reason to
ignore the TASK_WAKEKILL part of TASK_TRACED, and we can kill this nasty
special case.
Note: do_exit()->ptrace_notify() is special, the killed task can already
dequeue SIGKILL at this point. Another indication that fatal_signal_pending()
is not exactly right.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an interface to set limit. This is necessary to memory resource
controller because it shrinks usage at set limit.
Other controllers may not need this interface to shrink usage because
shrinking is not necessary or impossible.
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new call, mem_cgroup_shrink_usage() is added for shmem handling and
relacing non-standard usage of mem_cgroup_charge/uncharge.
Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
mem_cgroup. In general, shmem is used by some process group and not for
global resource (like file caches). So, it's reasonable to reclaim pages
from mem_cgroup where shmem is mainly used.
[hugh@veritas.com: shmem_getpage release page sooner]
[hugh@veritas.com: mem_cgroup_shrink_usage css_put]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes page migration under memory controller to use a
different algorithm. (thanks to Christoph for new idea.)
Before:
- page_cgroup is migrated from an old page to a new page.
After:
- a new page is accounted , no reuse of page_cgroup.
Pros:
- We can avoid compliated lock depndencies and races in migration.
Cons:
- new param to mem_cgroup_charge_common().
- mem_cgroup_getref() is added for handling ref_cnt ping-pong.
This version simplifies complicated lock dependency in page migraiton
under memory resource controller.
new refcnt sequence is following.
a mapped page:
prepage_migration() ..... +1 to NEW page
try_to_unmap() ..... all refs to OLD page is gone.
move_pages() ..... +1 to NEW page if page cache.
remap... ..... all refs from *map* is added to NEW one.
end_migration() ..... -1 to New page.
page's mapcount + (page_is_cache) refs are added to NEW one.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup_clone creates a new cgroup with the pid of the task. This works
correctly for unshare, but for clone cgroup_clone is called from
copy_namespaces inside copy_process, which happens before the new pid is
created. As a result, the new cgroup was created with current's pid.
This patch:
1. Moves the call inside copy_process to after the new pid
is created
2. Passes the struct pid into ns_cgroup_clone (as it is not
yet attached to the task)
3. Passes a name from ns_cgroup_clone() into cgroup_clone()
so as to keep cgroup_clone() itself simpler
4. Uses pid_vnr() to get the process id value, so that the
pid used to name the new cgroup is always the pid as it
would be known to the task which did the cloning or
unsharing. I think that is the most intuitive thing to
do. This way, task t1 does clone(CLONE_NEWPID) to get
t2, which does clone(CLONE_NEWPID) to get t3, then the
cgroup for t3 will be named for the pid by which t2 knows
t3.
(Thanks to Dan Smith for finding the main bug)
Changelog:
June 11: Incorporate Paul Menage's feedback: don't pass
NULL to ns_cgroup_clone from unshare, and reduce
patch size by using 'nodename' in cgroup_clone.
June 10: Original version
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge Hallyn <serge@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Tested-by: Dan Smith <danms@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently res_counter_write() is a raw file handler even though it's
ultimately taking a number, since in some cases it wants to
pre-process the string when converting it to a number.
This patch converts res_counter_write() from a raw file handler to a
write_string() handler; this allows some of the boilerplate
copying/locking/checking to be removed, and simplies the cleanup path,
since these functions are now performed by the cgroups framework.
[lizf@cn.fujitsu.com: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch contains cleanups suggested by reviewers for the recent
write_string() patchset:
- pair cgroup_lock_live_group() with cgroup_unlock() in cgroup.c for
clarity, rather than directly unlocking cgroup_mutex.
- make the return type of cgroup_lock_live_group() a bool
- use a #define'd constant for the local buffer size in read/write functions
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds cgroup_release_agent_write() and cgroup_release_agent_show()
methods to handle writing/reading the path to a cgroup hierarchy's
release agent. As a result, cgroup_common_file_read() is now unnecessary.
As part of the change, a previously-tolerated race in
cgroup_release_agent() is avoided by copying the current
release_agent_path prior to calling call_usermode_helper().
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a write_string() method for cgroups control files. The
semantics are that a buffer is copied from userspace to kernelspace
and the handler function invoked on that buffer. The buffer is
guaranteed to be nul-terminated, and no longer than max_write_len
(defaulting to 64 bytes if unspecified). Later patches will convert
existing raw file write handlers in control group subsystems to use
this method.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes some extraneous spaces from method declarations in
struct cftype, to fit in with conventional kernel style.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ignoring their return values may result in counter underflow in the future -
when the value charged will be uncharged (or in "leaks" - when the value is
not uncharged).
This also prevents from using charging routines to decrement the
counter value (i.e. uncharge it) ;)
(Current code works OK with res_counter, however :) )
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes it may be useful for userspace to know (e.g. for some hosting
guys) that some user stopped exceeding his hardlimit or softlimit in
quotas. Implement sending of such events to userspace via quota netlink
protocol so that they don't have to poll for such events. Based on idea
and initial implementation by Vladislav Bogdanov.
Cc: Vladislav Bogdanov <slava@nsys.by>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move declarations of some macros, which should be in fact functions to
quotaops.h. This way they can be later converted to inline functions
because we can now use declarations from quota.h. Also add necessary
includes of quotaops.h to a few files.
[akpm@linux-foundation.org: fix JFS build]
[akpm@linux-foundation.org: fix UFS build]
[vegard.nossum@gmail.com: fix QUOTA=n build]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjen Pool <arjenpool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make loop in sync_dquots() checking whether there's something to write
more readable, remove useless variable and macro info_any_dirty() which
is used only in this place.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup quotaops.h: Rename functions from uppercase to lowercase (and
define backward compatibility macros), move larger functions to dquot.c
and make them non-inline.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a new mount option ("tz=UTC") for DOS (vfat/msdos) filesystems,
allowing timestamps to be in coordinated universal time (UTC) rather than
local time in applications where doing this is advantageous.
In particular, portable devices that use fat/vfat (such as digital
cameras) can benefit from using UTC in their internal clocks, thus
avoiding daylight saving time errors and general time ambiguity issues.
The user of the device does not have to worry about changing the time when
moving from place or when daylight saving changes.
The new mount option, when set, disables the counter-adjustment that Linux
currently makes to FAT timestamp info in anticipation of the normal
userspace time zone correction. When used in this new mode, all daylight
saving time and time zone handling is done in userspace as is normal for
many other filesystems (like ext3). The default mode, which remains
unchanged, is still appropriate when mounting volumes written in Windows
(because of its use of local time).
I originally based this patch on one submitted last year by Paul Collins,
but I updated it to work with current source and changed variable/option
naming. Ogawa Hirofumi (who maintains these filesystems) and I discussed
this patch at length on lkml, and he suggested using the option name in
the attached version of the patch. Barry Bouwsma pointed out a good
addition to the patch as well.
Signed-off-by: Joe Peterson <joe@skyrush.com>
Signed-off-by: Paul Collins <paul@ondioline.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Barry Bouwsma <free_beer_for_all@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel struct dirent{,64} were different from the ones in
userspace.
Even worse, we exported the kernel ones to userspace.
But after the fat usages are fixed we can remove the conflicting
kernel versions.
Reviewed-by: H. Peter Anvin <hpa@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It has been impossible to set the option 'atari' of the MSDOS filesystem
for several years. Since nobody seems to have missed it, let's remove its
remains.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"struct dirent" is a kernel type here, but is a **different type** in
userspace! This means both the structure and the IOCTL number is wrong!
So, this adds new "struct __fat_dirent" to generate correct IOCTL number.
And kernel stuff moves to under __KERNEL__.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
j_commit_lock is a semaphore but uses it as if it were a mutex. This patch
converts it to a mutex.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
j_flush_sem is a semaphore but uses it as if it were a mutex. This patch
converts it to a mutex.
[akpm@linux-foundation.org: fix mutex_trylock retval treatment]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
j_lock is a semaphore but uses it as if it were a mutex. This patch converts
it to a mutex.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While fixing CONFIG_ leakages to the userspace kernel headers I ran into
CODA_FS_OLD_API.
After five years, are there still people using the old API left?
Especially considering that you have to choose at compile time which API
to support in the kernel (and distributions tend to offer the new API for
some time).
Jan: "The old API can definitely go. Around the time the new
interface went in there were some non-Coda userspace file system
implementations that took a while longer to convert to the new API,
but by now they all switched to the new interface or in some cases
to a FUSE-based solution."
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the orphan node list includes valid, untruncatable nodes with nlink > 0
the ext3_orphan_cleanup loop which attempts to delete them will not do so,
causing it to loop forever. Fix by checking for such nodes in the
ext3_orphan_get function.
This patch fixes the second case (image hdb.20000009.softlockup.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: printk warning fix]
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix typo in Hurd part of include/linux/ext2_fs.h
The ';' here is redundant or can even pose problem. This is actually not
used by the Linux kernel, but it is exposed in GNU/Hurd.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a driver supporting a family of I2C port expanders from Maxim,
which includes the MAX7319 and MAX7320-7327 chips.
[dbrownell@users.sourceforge.net: minor fixes]
Signed-off-by: Jack Ren <jack.ren@marvell.com>
Signed-off-by: Eric Miao <eric.miao@marvell.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Teach the mcp23s08 driver about a curious feature of these chips: up to
four of them can share the same chipselect, with the SPI signals wired in
parallel, by matching two bits in the first protocol byte against two
address lines on the chip.
This is handled by three software changes:
* Platform data now holds an array of per-chip structs, not
just one chip's address and pullup configuration.
* Probe() and remove() now use another level of structure,
wrapping an instance of the original structure for each
mcp23s08 chip sharing that chipselect.
* The HAEN bit is set, so that the hardware address bits can no
longer be ignored (boot firmware may not have enabled them).
The "one struct per chip" preserves the guts of the current code,
but platform_data will need minor changes.
OLD:
/* incorrect "slave" ID may not have mattered */
.slave = 3,
.pullups = BIT(3) | BIT(1) | BIT(0),
NEW:
/* slave address _must_ match chip's wiring */
.chip[3] = {
.is_present = true,
.pullups = BIT(3) | BIT(1) | BIT(0),
},
There's no change in how things _behave_ for spi_device nodes with a
single mcp23s08 chip. New multi-chip configurations assign GPIOs in
sequence, without holes. The spi_device just resembles a bigger
controller, but internally it has multiple gpio_chip instances.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a simple sysfs interface for GPIOs.
/sys/class/gpio
/export ... asks the kernel to export a GPIO to userspace
/unexport ... to return a GPIO to the kernel
/gpioN ... for each exported GPIO #N
/value ... always readable, writes fail for input GPIOs
/direction ... r/w as: in, out (default low); write high, low
/gpiochipN ... for each gpiochip; #N is its first GPIO
/base ... (r/o) same as N
/label ... (r/o) descriptive, not necessarily unique
/ngpio ... (r/o) number of GPIOs; numbered N .. N+(ngpio - 1)
GPIOs claimed by kernel code may be exported by its owner using a new
gpio_export() call, which should be most useful for driver debugging.
Such exports may optionally be done without a "direction" attribute.
Userspace may ask to take over a GPIO by writing to a sysfs control file,
helping to cope with incomplete board support or other "one-off"
requirements that don't merit full kernel support:
echo 23 > /sys/class/gpio/export
... will gpio_request(23, "sysfs") and gpio_export(23);
use /sys/class/gpio/gpio-23/direction to (re)configure it,
when that GPIO can be used as both input and output.
echo 23 > /sys/class/gpio/unexport
... will gpio_free(23), when it was exported as above
The extra D-space footprint is a few hundred bytes, except for the sysfs
resources associated with each exported GPIO. The additional I-space
footprint is about two thirds of the current size of gpiolib (!). Since
no /dev node creation is involved, no "udev" support is needed.
Related changes:
* This adds a device pointer to "struct gpio_chip". When GPIO
providers initialize that, sysfs gpio class devices become children of
that device instead of being "virtual" devices.
* The (few) gpio_chip providers which have such a device node have
been updated.
* Some gpio_chip drivers also needed to update their module "owner"
field ... for which missing kerneldoc was added.
* Some gpio_chips don't support input GPIOs. Those GPIOs are now
flagged appropriately when the chip is registered.
Based on previous patches, and discussion both on and off LKML.
A Documentation/ABI/testing/sysfs-gpio update is ready to submit once this
merges to mainline.
[akpm@linux-foundation.org: a few maintenance build fixes]
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Guennadi Liakhovetski <g.liakhovetski@pengutronix.de>
Cc: Greg KH <greg@kroah.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently list of kretprobe instances are stored in kretprobe object (as
used_instances,free_instances) and in kretprobe hash table. We have one
global kretprobe lock to serialise the access to these lists. This causes
only one kretprobe handler to execute at a time. Hence affects system
performance, particularly on SMP systems and when return probe is set on
lot of functions (like on all systemcalls).
Solution proposed here gives fine-grain locks that performs better on SMP
system compared to present kretprobe implementation.
Solution:
1) Instead of having one global lock to protect kretprobe instances
present in kretprobe object and kretprobe hash table. We will have
two locks, one lock for protecting kretprobe hash table and another
lock for kretporbe object.
2) We hold lock present in kretprobe object while we modify kretprobe
instance in kretprobe object and we hold per-hash-list lock while
modifying kretprobe instances present in that hash list. To prevent
deadlock, we never grab a per-hash-list lock while holding a kretprobe
lock.
3) We can remove used_instances from struct kretprobe, as we can
track used instances of kretprobe instances using kretprobe hash
table.
Time duration for kernel compilation ("make -j 8") on a 8-way ppc64 system
with return probes set on all systemcalls looks like this.
cacheline non-cacheline Un-patched kernel
aligned patch aligned patch
===============================================================================
real 9m46.784s 9m54.412s 10m2.450s
user 40m5.715s 40m7.142s 40m4.273s
sys 2m57.754s 2m58.583s 3m17.430s
===========================================================
Time duration for kernel compilation ("make -j 8) on the same system, when
kernel is not probed.
=========================
real 9m26.389s
user 40m8.775s
sys 2m7.283s
=========================
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for adding the GPIO based I2C resources.
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Cc: Arnaud Patard <apatard@mandriva.com>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SM501 PCI card requires a dyanmic gpio allocation as the number of
cards is not known at compile time. Fixup the platform data and
registration to deal with this.
Acked-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Arnaud Patard <apatard@mandriva.com>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for exporting the GPIOs on the SM501 via gpiolib.
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Cc: Arnaud Patard <apatard@mandriva.com>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add callback to get or set the power control if the device has the sleep
connected to some form of GPIO.
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Cc: Arnaud Patard <apatard@mandriva.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All ratelimit user use same jiffies and burst params, so some messages
(callbacks) will be lost.
For example:
a call printk_ratelimit(5 * HZ, 1)
b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will
will be supressed.
- rewrite __ratelimit, and use a ratelimit_state as parameter. Thanks for
hints from andrew.
- Add WARN_ON_RATELIMIT, update rcupreempt.h
- remove __printk_ratelimit
- use __ratelimit in net_ratelimit
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the %p format string which already accounts for the padding you need
with a pointer type on a particular architecture.
Also replace the macro with a static inline function to match the rest of
the file.
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want to use WARN() as a variant of WARN_ON(), however a few drivers are
using WARN() internally. This patch renames these to WARNING() to avoid the
namespace clash. A few cases were defining but not using the thing, for those
cases I just deleted the definition.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Greg KH <greg@kroah.com>
Cc: Karsten Keil <kkeil@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove apparently obsolete content from init.h referring to gcc 2.9x
and to "no_module_init".
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently call_usermodehelper_setup() uses GFP_ATOMIC. but it can return
NULL _very_ easily.
GFP_ATOMIC is needed only when we can't sleep. and, GFP_KERNEL is robust
and better.
thus, I add gfp_mask argument to call_usermodehelper_setup().
So, its callers pass the gfp_t as below:
call_usermodehelper() and call_usermodehelper_keys():
depend on 'wait' argument.
call_usermodehelper_pipe():
always GFP_KERNEL because always run under process context.
orderly_poweroff():
pass to GFP_ATOMIC because may run under interrupt context.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Paul Menage" <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Build kernel/profile.o only if CONFIG_PROFILING is enabled.
This makes CONFIG_PROFILING=n kernels smaller.
As a bonus, some profile_tick() calls and one branch from schedule() are
now eliminated with CONFIG_PROFILING=n (but I doubt these are
measurable effects).
This patch changes the effects of CONFIG_PROFILING=n, but I don't think
having more than two choices would be the better choice.
This patch also adds the name of the first parameter to the prototypes
of profile_{hits,tick}() since I anyway had to add them for the dummy
functions.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the conditional surrounding the definition of list_add() from list.h
since, if you define CONFIG_DEBUG_LIST, the definition you will subsequently
pick up from lib/list_debug.c will be absolutely identical, at which point you
can remove that redundant definition from list_debug.c as well.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This header file has been unused for quite some time, and the
corresponding source files appear to have been removed back in commit
99eb8a550d ("Remove the arm26 port")
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: Ian Molton <spyro@f2s.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There haave been several areas in the kernel where an int has been used for
flags in local_irq_save() and friends instead of a long. This can cause some
hard to debug problems on some architectures.
This patch adds a typecheck inside the irqsave and restore functions to flag
these cases.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Needed to fix up a recursive include snafu in
locking-add-typecheck-on-irqsave-and-friends-for-correct-flags.patch
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add fields for VLAN tag and insert VLAN tag flag to the control
section struct. These fields will be used for sending ethernet
packets.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Changeset 7fa897b91a ("ide: trivial sparse
annotations") created an IDE bootup regression on big-endian systems.
In drivers/ide/ide-iops.c, function ide_fixstring() we now have the
loop:
for (p = end ; p != s;)
be16_to_cpus((u16 *)(p -= 2));
which will never terminate on big-endian because in such
a configuration be16_to_cpus() evaluates to "do { } while (0)"
Therefore, always evaluate the arguments to nop endian transformation
operations.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables NAND subpage read functionality.
If upper layer drivers are requesting to read non page aligned data NAND
subpage-read functionality reads the only whose ECC regions which include
requested data when original code reads whole page.
This significantly improves performance in many cases.
Here are some digits :
UBI volume mount time
No subpage reads: 5.75 seconds
Subpage read patch: 2.42 seconds
Open/stat time for files on JFFS2 volume:
No subpage read 0m 5.36s
Subpage read 0m 2.88s
Signed-off-by Alexey Korolev <akorolev@infradead.org>
Acked-by: Artem Bityutskiy <dedekind@infradead.org>
Acked-by: Jörn Engel <joern@logfs.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>