Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c11ae035>] find_next_bit+0x55/0xb0
Call Trace:
[<c11addda>] cpumask_any_but+0x2a/0x70
[<c102396b>] flush_tlb_mm+0x2b/0x80
[<c1022705>] pud_populate+0x35/0x50
[<c10227ba>] pgd_alloc+0x9a/0xf0
[<c103a3fc>] mm_init+0xec/0x120
[<c103a7a3>] mm_alloc+0x53/0xd0
which was introduced by commit de03c72cfc ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask
Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd: make local functions static
NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
NFSD: Check status from nfsd4_map_bcts_dir()
NFSD: Remove setting unused variable in nfsd_vfs_read()
nfsd41: error out on repeated RECLAIM_COMPLETE
nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
nfsd v4.1 lOCKT clientid field must be ignored
nfsd41: add flag checking for create_session
nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
nfsd4: fix wrongsec handling for PUTFH + op cases
nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
nfsd4: introduce OPDESC helper
nfsd4: allow fh_verify caller to skip pseudoflavor checks
nfsd: distinguish functions of NFSD_MAY_* flags
svcrpc: complete svsk processing on cb receive failure
svcrpc: take advantage of tcp autotuning
SUNRPC: Don't wait for full record to receive tcp data
svcrpc: copy cb reply instead of pages
svcrpc: close connection if client sends short packet
svcrpc: note network-order types in svc_process_calldir
...
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
dm kcopyd: return client directly and not through a pointer
dm kcopyd: reserve fewer pages
dm io: use fixed initial mempool size
dm kcopyd: alloc pages from the main page allocator
dm kcopyd: add gfp parm to alloc_pl
dm kcopyd: remove superfluous page allocation spinlock
dm kcopyd: preallocate sub jobs to avoid deadlock
dm kcopyd: avoid pointless job splitting
dm mpath: do not fail paths after integrity errors
dm table: reject devices without request fns
dm table: allow targets to support discards internally
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Support for RPC over AF_LOCAL transports
SUNRPC: Remove obsolete comment
SUNRPC: Use AF_LOCAL for rpcbind upcalls
SUNRPC: Clean up use of curly braces in switch cases
NFS: Revert NFSROOT default mount options
SUNRPC: Rename xs_encode_tcp_fragment_header()
nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
nfs41: Correct offset for LAYOUTCOMMIT
NFS: nfs_update_inode: print current and new inode size in debug output
NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
NFSv4: Handle expired stateids when the lease is still valid
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
ACPI EC: remove redundant code
ACPI: Add D3 cold state
ACPI: processor: fix processor_physically_present in UP kernel
ACPI: Split out custom_method functionality into an own driver
ACPI: Cleanup custom_method debug stuff
ACPI EC: enable MSI workaround for Quanta laptops
ACPICA: Update to version 20110413
ACPICA: Execute an orphan _REG method under the EC device
ACPICA: Move ACPI_NUM_PREDEFINED_REGIONS to a more appropriate place
ACPICA: Update internal address SpaceID for DataTable regions
ACPICA: Add more methods eligible for NULL package element removal
ACPICA: Split all internal Global Lock functions to new file - evglock
ACPI: EC: add another DMI check for ASUS hardware
ACPI EC: remove dead code
ACPICA: Fix code divergence of global lock handling
ACPICA: Use acpi_os_create_lock interface
ACPI: osl, add acpi_os_create_lock interface
ACPI:Fix goto flows in thermal-sys
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
x86 idle: deprecate "no-hlt" cmdline param
x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
x86 idle floppy: deprecate disable_hlt()
x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
x86 idle: clarify AMD erratum 400 workaround
idle governor: Avoid lock acquisition to read pm_qos before entering idle
cpuidle: menu: fixed wrapping timers at 4.294 seconds
NFSv4.1 LAYOUTRETURN implementation
Currently, does not support layout-type payload encoding.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.
[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Add the pnfs_osd_xdr.h header
* defintions the pnfs_osd_layout structure including all it's
sub-types and constants.
* Declare the pnfs_osd_xdr_decode_layout API + all needed
inline helpers.
* Define the pnfs_osd_deviceaddr structure and all its subtypes and
constants.
* Declare API for decoding of a pnfs_osd_deviceaddr from XDR stream.
* Define the pnfs_osd_ioerr structure, its substructures and constants.
* Declare API for encoding of a pnfs_osd_ioerr into XDR stream.
* Define the pnfs_osd_layoutupdate structure and its substructures.
* Declare API for encoding of a pnfs_osd_layoutupdate into XDR stream.
[Remove server definitions]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reserve just the minimum of pages needed to process one job.
Because we allocate pages from page allocator, we don't need to reserve
a large number of pages. The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Replace the arbitrary calculation of an initial io struct mempool size
with a constant.
The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4. This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.
Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures. One structure is enough to
process the whole request.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Permit a target to support discards regardless of whether or not all its
underlying devices do.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
page_get_storage_key() and page_set_storage_key() expect a page address
and not its page frame number. This got inconsistent with 2d42552d
"[S390] merge page_test_dirty and page_clear_dirty".
Result is that we read/write storage keys from random pages and do not
have a working dirty bit tracking at all.
E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
for example ext4 doesn't like very much and panics after a while.
Unable to handle kernel paging request at virtual user address (null)
Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 Not tainted 2.6.39-07551-g139f37f-dirty #152
Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
Krnl Code: 0000000000316b04: f0a0000407f4 srp 4(11,%r0),2036,0
0000000000316b0a: b9020022 ltgr %r2,%r2
0000000000316b0e: a7740015 brc 7,316b38
>0000000000316b12: e3d0c0000024 stg %r13,0(%r12)
0000000000316b18: 4120c010 la %r2,16(%r12)
0000000000316b1c: 4130d060 la %r3,96(%r13)
0000000000316b20: e340d0600004 lg %r4,96(%r13)
0000000000316b26: c0e50002b567 brasl %r14,36d5f4
Call Trace:
([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
[<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
[<00000000002daac2>] ext4_da_writepages+0x2da/0x504
[<00000000002597e8>] writeback_single_inode+0xf8/0x268
[<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
[<000000000025a700>] writeback_inodes_wb+0x80/0x168
[<000000000025aa92>] wb_writeback+0x2aa/0x324
[<000000000025abde>] wb_do_writeback+0xd2/0x274
[<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
[<00000000001737be>] kthread+0xa6/0xb0
[<000000000056c1da>] kernel_thread_starter+0x6/0xc
[<000000000056c1d4>] kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Last Breaking-Event-Address:
[<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
_SxW returns an Integer containing the lowest D-state supported in state
Sx. If OSPM has not indicated that it supports _PR3, then the value “3”
corresponds to D3. If it has indicated _PR3 support, the value “3”
represents D3hot and the value “4” represents D3cold.
Linux does set _OSC._PR3, so we should fix it to expect that _SxW can
return 4.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Usually, there are multiple processors defined in ACPI table, for
example
Scope (_PR)
{
Processor (CPU0, 0x00, 0x00000410, 0x06) {}
Processor (CPU1, 0x01, 0x00000410, 0x06) {}
Processor (CPU2, 0x02, 0x00000410, 0x06) {}
Processor (CPU3, 0x03, 0x00000410, 0x06) {}
}
processor_physically_present(...) will be called to check whether those
processors are physically present.
Currently we have below codes in processor_physically_present,
cpuid = acpi_get_cpuid(...);
if ((cpuid == -1) && (num_possible_cpus() > 1))
return false;
return true;
In UP kernel, acpi_get_cpuid(...) always return -1 and
num_possible_cpus() always return 1, so
processor_physically_present(...) always returns true for all passed in
processor handles.
This is wrong for UP processor or SMP processor running UP kernel.
This patch removes the !SMP version of acpi_get_cpuid(), so both UP and
SMP kernel use the same acpi_get_cpuid function.
And for UP kernel, only processor 0 is valid.
https://bugzilla.kernel.org/show_bug.cgi?id=16548https://bugzilla.kernel.org/show_bug.cgi?id=16357
Tested-by: Anton Kochkov <anton.kochkov@gmail.com>
Tested-by: Ambroz Bizjak <ambrop7@gmail.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Thanks to the reviews and comments by Rafael, James, Mark and Andi.
Here's version 2 of the patch incorporating your comments and also some
update to my previous patch comments.
I noticed that before entering idle state, the menu idle governor will
look up the current pm_qos target value according to the list of qos
requests received. This look up currently needs the acquisition of a
lock to access the list of qos requests to find the qos target value,
slowing down the entrance into idle state due to contention by multiple
cpus to access this list. The contention is severe when there are a lot
of cpus waking and going into idle. For example, for a simple workload
that has 32 pair of processes ping ponging messages to each other, where
64 cpu cores are active in test system, I see the following profile with
37.82% of cpu cycles spent in contention of pm_qos_lock:
- 37.82% swapper [kernel.kallsyms] [k]
_raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
- 95.65% pm_qos_request
menu_select
cpuidle_idle_call
- cpu_idle
99.98% start_secondary
A better approach will be to cache the updated pm_qos target value so
reading it does not require lock acquisition as in the patch below.
With this patch the contention for pm_qos_lock is removed and I saw a
2.2X increase in throughput for my message passing workload.
cc: stable@kernel.org
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: James Bottomley <James.Bottomley@suse.de>
Acked-by: mark gross <markgross@thegnar.org>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
Cache xattr security drop check for write v2
fs: block_page_mkwrite should wait for writeback to finish
mm: Wait for writeback when grabbing pages to begin a write
configfs: remove unnecessary dentry_unhash on rmdir, dir rename
fat: remove unnecessary dentry_unhash on rmdir, dir rename
hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
minix: remove unnecessary dentry_unhash on rmdir, dir rename
fuse: remove unnecessary dentry_unhash on rmdir, dir rename
coda: remove unnecessary dentry_unhash on rmdir, dir rename
afs: remove unnecessary dentry_unhash on rmdir, dir rename
affs: remove unnecessary dentry_unhash on rmdir, dir rename
9p: remove unnecessary dentry_unhash on rmdir, dir rename
ncpfs: fix rename over directory with dangling references
ncpfs: document dentry_unhash usage
ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
hfs: remove unnecessary dentry_unhash on rmdir, dir rename
omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
udf: remove unnecessary dentry_unhash from rmdir, dir rename
...
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, asm: Clean up desc.h a bit
x86, amd: Do not enable ARAT feature on AMD processors below family 0x12
x86: Move do_page_fault()'s error path under unlikely()
x86, efi: Retain boot service code until after switching to virtual mode
x86: Remove unnecessary check in detect_ht()
x86: Reorder mm_context_t to remove x86_64 alignment padding and thus shrink mm_struct
x86, UV: Clean up uv_tlb.c
x86, UV: Add support for SGI UV2 hub chip
x86, cpufeature: Update CPU feature RDRND to RDRAND
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
rcu: Remove waitqueue usage for cpu, node, and boost kthreads
rcu: Avoid acquiring rcu_node locks in timer functions
atomic: Add atomic_or()
Documentation: Add statistics about nested locks
rcu: Decrease memory-barrier usage based on semi-formal proof
rcu: Make rcu_enter_nohz() pay attention to nesting
rcu: Don't do reschedule unless in irq
rcu: Remove old memory barriers from rcu_process_callbacks()
rcu: Add memory barriers
rcu: Fix unpaired rcu_irq_enter() from locking selftests
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
perf: Fix SIGIO handling
perf top: Don't stop if no kernel symtab is found
perf top: Handle kptr_restrict
perf top: Remove unused macro
perf events: initialize fd array to -1 instead of 0
perf tools: Make sure kptr_restrict warnings fit 80 col terms
perf tools: Fix build on older systems
perf symbols: Handle /proc/sys/kernel/kptr_restrict
perf: Remove duplicate headers
ftrace: Add internal recursive checks
tracing: Update btrfs's tracepoints to use u64 interface
tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
ftrace: Set ops->flag to enabled even on static function tracing
tracing: Have event with function tracer check error return
ftrace: Have ftrace_startup() return failure code
jump_label: Check entries limit in __jump_label_update
ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
scripts/tags.sh: Add magic for trace-events for etags too
scripts/tags.sh: Fix ctags for DEFINE_EVENT()
x86/ftrace: Fix compiler warning in ftrace.c
...
* 'for-usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci:
Intel xhci: Limit number of active endpoints to 64.
Intel xhci: Ignore spurious successful event.
Intel xhci: Support EHCI/xHCI port switching.
Intel xhci: Add PCI id for Panther Point xHCI host.
xhci: STFU: Be quieter during URB submission and completion.
xhci: STFU: Don't print event ring dequeue pointer.
xhci: STFU: Remove function tracing.
xhci: Don't submit commands when the host is dead.
xhci: Clear stopped_td when Stop Endpoint command completes.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (33 commits)
x86: poll waiting for I/OAT DMA channel status
maintainers: add dma engine tree details
dmaengine: add TODO items for future work on dma drivers
dmaengine: Add API documentation for slave dma usage
dmaengine/dw_dmac: Update maintainer-ship
dmaengine: move link order
dmaengine/dw_dmac: implement pause and resume in dwc_control
dmaengine/dw_dmac: Replace spin_lock* with irqsave variants and enable submission from callback
dmaengine/dw_dmac: Divide one sg to many desc, if sg len is greater than DWC_MAX_COUNT
dmaengine/dw_dmac: set residue as total len in dwc_tx_status if status is !DMA_SUCCESS
dmaengine/dw_dmac: don't call callback routine in case dmaengine_terminate_all() is called
dmaengine: at_hdmac: pause: no need to wait for FIFO empty
pch_dma: modify pci device table definition
pch_dma: Support new device ML7223 IOH
pch_dma: Support I2S for ML7213 IOH
pch_dma: Fix DMA setting issue
pch_dma: modify for checkpatch
pch_dma: fix dma direction issue for ML7213 IOH video-in
dmaengine: at_hdmac: use descriptor chaining help function
dmaengine: at_hdmac: implement pause and resume in atc_control
...
Fix up trivial conflict in drivers/dma/dw_dmac.c
* 'gpio/next' of git://git.secretlab.ca/git/linux-2.6:
gpio/pch_gpio: Support new device ML7223
gpio: make gpio_{request,free}_array gpio array parameter const
GPIO: OMAP: move to drivers/gpio
GPIO: OMAP: move register offset defines into <plat/gpio.h>
gpio: Convert gpio_is_valid to return bool
gpio: Move the s5pc100 GPIO to drivers/gpio
gpio: Move the s5pv210 GPIO to drivers/gpio
gpio: Move the exynos4 GPIO to drivers/gpio
gpio: Move to Samsung common GPIO library to drivers/gpio
gpio/nomadik: add function to read GPIO pull down status
gpio/nomadik: show all pins in debug
gpio: move Nomadik GPIO driver to drivers/gpio
gpio: move U300 GPIO driver to drivers/gpio
langwell_gpio: add runtime pm support
gpio/pca953x: Add support for pca9574 and pca9575 devices
gpio/cs5535: Show explicit dependency between gpio_cs5535 and mfd_cs5535
32bit and 64bit on x86 are tested and working. The rest I have looked
at closely and I can't find any problems.
setns is an easy system call to wire up. It just takes two ints so I
don't expect any weird architecture porting problems.
While doing this I have noticed that we have some architectures that are
very slow to get new system calls. cris seems to be the slowest where
the last system calls wired up were preadv and pwritev. avr32 is weird
in that recvmmsg was wired up but never declared in unistd.h. frv is
behind with perf_event_open being the last syscall wired up. On h8300
the last system call wired up was epoll_wait. On m32r the last system
call wired up was fallocate. mn10300 has recvmmsg as the last system
call wired up. The rest seem to at least have syncfs wired up which was
new in the 2.6.39.
v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch
v5: ported to v2.6.39-rc6
v6: rebased onto parisc-next and net-next to avoid syscall conflicts.
v7: ported to Linus's latest post 2.6.39 tree.
> arch/blackfin/include/asm/unistd.h | 3 ++-
> arch/blackfin/mach-common/entry.S | 1 +
Acked-by: Mike Frysinger <vapier@gentoo.org>
Oh - ia64 wiring looks good.
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.
Why xattr lookup on every write I hear you ask?
write wants to drop suid and security related xattrs that could set o
capabilities for executables. To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.
In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.
Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.
This patch adds a simple per inode to avoid this problem. We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.
I also used the same flag to avoid the suid check, although
that one is pretty cheap.
A file system can also set this flag when it creates the inode,
if it has a cheap way to do so. This is done for some common file systems
in followon patches.
With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.
v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
An atomic_or() function is needed by TREE_RCU to avoid deadlock, so
add a generic version.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.infradead.org/mtd-2.6: (97 commits)
mtd: kill CONFIG_MTD_PARTITIONS
mtd: remove add_mtd_partitions, add_mtd_device and friends
mtd: convert remaining users to mtd_device_register()
mtd: samsung onenand: convert to mtd_device_register()
mtd: omap2 onenand: convert to mtd_device_register()
mtd: txx9ndfmc: convert to mtd_device_register()
mtd: tmio_nand: convert to mtd_device_register()
mtd: socrates_nand: convert to mtd_device_register()
mtd: sharpsl: convert to mtd_device_register()
mtd: s3c2410 nand: convert to mtd_device_register()
mtd: ppchameleonevb: convert to mtd_device_register()
mtd: orion_nand: convert to mtd_device_register()
mtd: omap2: convert to mtd_device_register()
mtd: nomadik_nand: convert to mtd_device_register()
mtd: ndfc: convert to mtd_device_register()
mtd: mxc_nand: convert to mtd_device_register()
mtd: mpc5121_nfc: convert to mtd_device_register()
mtd: jz4740_nand: convert to mtd_device_register()
mtd: h1910: convert to mtd_device_register()
mtd: fsmc_nand: convert to mtd_device_register()
...
Fixed up trivial conflicts in
- drivers/mtd/maps/integrator-flash.c: removed in ARM tree
- drivers/mtd/maps/physmap.c: addition of afs partition probe type
clashing with removal of CONFIG_MTD_PARTITIONS
* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (45 commits)
ARM: 6945/1: Add unwinding support for division functions
ARM: kill pmd_off()
ARM: 6944/1: mm: allow ASID 0 to be allocated to tasks
ARM: 6943/1: mm: use TTBR1 instead of reserved context ID
ARM: 6942/1: mm: make TTBR1 always point to swapper_pg_dir on ARMv6/7
ARM: 6941/1: cache: ensure MVA is cacheline aligned in flush_kern_dcache_area
ARM: add sendmmsg syscall
ARM: 6863/1: allow hotplug on msm
ARM: 6832/1: mmci: support for ST-Ericsson db8500v2
ARM: 6830/1: mach-ux500: force PrimeCell revisions
ARM: 6829/1: amba: make hardcoded periphid override hardware
ARM: 6828/1: mach-ux500: delete SSP PrimeCell ID
ARM: 6827/1: mach-netx: delete hardcoded periphid
ARM: 6940/1: fiq: Briefly document driver responsibilities for suspend/resume
ARM: 6938/1: fiq: Refactor {get,set}_fiq_regs() for Thumb-2
ARM: 6914/1: sparsemem: fix highmem detection when using SPARSEMEM
ARM: 6913/1: sparsemem: allow pfn_valid to be overridden when using SPARSEMEM
at91: drop at572d940hf support
at91rm9200: introduce at91rm9200_set_type to specficy cpu package
at91: drop boot_params and PLAT_PHYS_OFFSET
...
gpio_{request,free}_array should not (and do not) modify the passed gpio
array, so make the parameter const.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Acked-by: Eric Miao <eric.y.miao@gmail.com>
Acked-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
TI-RPC introduces the capability of performing RPC over AF_LOCAL
sockets. It uses this mainly for registering and unregistering
local RPC services securely with the local rpcbind, but we could
also conceivably use it as a generic upcall mechanism.
This patch provides a client-side only implementation for the moment.
We might also consider a server-side implementation to provide
AF_LOCAL access to NLM (for statd downcalls, and such like).
Autobinding is not supported on kernel AF_LOCAL transports at this
time. Kernel ULPs must specify the pathname of the remote endpoint
when an AF_LOCAL transport is created. rpcbind supports registering
services available via AF_LOCAL, so the kernel could handle it with
some adjustment to ->rpcbind and ->set_port. But we don't need this
feature for doing upcalls via well-known named sockets.
This has not been tested with ULPs that move a substantial amount of
data. Thus, I can't attest to how robust the write_space and
congestion management logic is.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
net: Kill ratelimit.h dependency in linux/net.h
net: Add linux/sysctl.h includes where needed.
net: Kill ether_table[] declaration.
inetpeer: fix race in unused_list manipulations
atm: expose ATM device index in sysfs
IPVS: bug in ip_vs_ftp, same list heaad used in all netns.
bug.h: Move ratelimit warn interfaces to ratelimit.h
bonding: cleanup module option descriptions
net:8021q:vlan.c Fix pr_info to just give the vlan fullname and version.
net: davinci_emac: fix dev_err use at probe
can: convert to %pK for kptr_restrict support
net: fix ETHTOOL_SFEATURES compatibility with old ethtool_ops.set_flags
netfilter: Fix several warnings in compat_mtw_from_user().
netfilter: ipset: fix ip_set_flush return code
netfilter: ipset: remove unused variable from type_pf_tdel()
netfilter: ipset: Use proper timeout value to jiffies conversion
Ingo Molnar noticed that we have this unnecessary ratelimit.h
dependency in linux/net.h, which hid compilation problems from
people doing builds only with CONFIG_NET enabled.
Move this stuff out to a seperate net/net_ratelimit.h file and
include that in the only two places where this thing is needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Several networking headers were depending upon the implicit
linux/sysctl.h include they get when including linux/net.h
Add explicit includes.
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs:
Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/<moved_file> to Documentation/security/<moved_file>.
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6:
[media] v1.88 DM04/QQBOX Move remote to use rc_core dvb-usb-remote
[media] Add missing include guard to header file
[media] Inlined functions should be static
[media] Remove invalid parameter description
[media] cpia2: fix warning about invalid trigraph sequence
[media] s5p-csis: Add missing dependency on PLAT_S5P
[media] gspca/kinect: wrap gspca_debug with GSPCA_DEBUG
[media] fintek-cir: new driver for Fintek LPC SuperIO CIR function
[media] uvcvideo: Connect video devices to media entities
[media] uvcvideo: Register subdevices for each entity
[media] uvcvideo: Register a v4l2_device
[media] add V4L2-PIX-FMT-SRGGB12 & friends to docbook
[media] Documentation/DocBook: Rename media fops xml files
[media] Media DocBook: fix validation errors
[media] wl12xx: g_volatile_ctrl fix: wrong field set
[media] fix kconfig dependency warning for VIDEO_TIMBERDALE
[media] dm1105: GPIO handling added, I2C on GPIO added, LNB control through GPIO reworked
[media] Add support for M-5MOLS 8 Mega Pixel camera ISP
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6: (42 commits)
regulator: Fix _regulator_get_voltage if get_voltage callback is NULL
USB: TWL6025 allow different regulator name
REGULATOR: TWL6025: add support to twl-regulator
regulator: twl6030: do not write to _GRP for regulator disable
regulator: twl6030: do not write to _GRP for regulator enable
TPS65911: Comparator: Add comparator driver
TPS65911: Add support for added GPIO lines
GPIO: TPS65910: Move driver to drivers/gpio/
TPS65911: Add new irq definitions
regulator: tps65911: Add new chip version
MFD: TPS65910: Add support for TPS65911 device
regulator: Fix off-by-one value range checking for mc13xxx_regulator_get_voltage
regulator: mc13892: Fix voltage unit in test case.
regulator: Remove MAX8997_REG_BUCK1DVS/MAX8997_REG_BUCK2DVS/MAX8997_REG_BUCK5DVS macros
mfd: Fix off-by-one value range checking for tps65910_i2c_write
regulator: Only apply voltage constraints from consumers that set them
regulator: If we can't configure optimum mode we're always in the best one
regulator: max8997: remove useless code
regulator: Fix memory leak in max8998_pmic_probe failure path
regulator: Fix desc_id for tps65023/6507x/65910
...
* git://git.infradead.org/battery-2.6:
PXA: Use dev_pm_ops in z2_battery
ds2760_battery: Fix rated capacity of the hx4700 1800mAh battery
ds2760_battery: Fix indexing of the 4 active full EEPROM registers
power: Make test_power driver more dynamic.
bq27x00_battery: Name of cycle count property
max8903_charger: Add GENERIC_HARDIRQS as a dependency (fixes S390 build)
ARM: RX-51: Enable isp1704 power on/off
isp1704_charger: Allow board specific powering routine
gpio-charger: Add gpio_charger_resume
power_supply: Add driver for MAX8903 charger