There seems to be a regression in direct write path due to following
commit in for-2.6.33 branch of block tree.
commit 1af60fbd75
Author: Jeff Moyer <jmoyer@redhat.com>
Date: Fri Oct 2 18:56:53 2009 -0400
block: get rid of the WRITE_ODIRECT flag
Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
more request from the queue and not idle on it (despite the fact that
queue's think time is less and it is not seeky).
So direct writers lose big time when competing with sequential readers.
Using fio, I have run one direct writer and two sequential readers and
following are the results with 2.6.32-rc7 kernel and with for-2.6.33
branch.
Test
====
1 direct writer and 2 sequential reader running simultaneously.
[global]
directory=/mnt/sdc/fio/
runtime=10
[seqwrite]
rw=write
size=4G
direct=1
[seqread]
rw=read
size=2G
numjobs=2
2.6.32-rc7
==========
direct writes: aggrb=2,968KB/s
readers : aggrb=101MB/s
for-2.6.33 branch
=================
direct write: aggrb=19KB/s
readers aggrb=137MB/s
This patch brings back the WRITE_ODIRECT flag, with the difference that we
don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
submission of request and an explicit unplug from submitter is required.
That way we fix the jeff's issue of not enough merging taking place in aio
path as well as make sure direct writes get their fair share.
After the fix
=============
for-2.6.33 + fix
----------------
direct writes: aggrb=2,728KB/s
reads: aggrb=103MB/s
Thanks
Vivek
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Mtdblock driver doesn't call flush_dcache_page for pages in request. So,
this causes problems on architectures where the icache doesn't fill from
the dcache or with dcache aliases. The patch fixes this.
The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
pointless empty cache-thrashing loops on architectures for which
flush_dcache_page() is a no-op. Every architecture was provided with this
flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
equal 1 or do nothing otherwise.
See "fix mtd_blkdevs problem with caches on some architectures" discussion
on LKML for more information.
Signed-off-by: Ilya Loginov <isloginov@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Peter Horton <phorton@bitbox.co.uk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The size of EFI GPT header is not static, but whole sector is
allocated for the header. The HeaderSize field must be greater
than 92 (= sizeof(struct gpt_header) and must be less than or
equal to the logical block size.
It means we have to read whole sector with the header, because the
header crc32 checksum is calculated according to HeaderSize.
For more details see UEFI standard (version 2.3, May 2009):
- 5.3.1 GUID Format overview, page 93
- Table 13. GUID Partition Table Header, page 96
Signed-off-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
While SSDs track block usage on a per-sector basis, RAID arrays often
have allocation blocks that are bigger. Allow the discard granularity
and alignment to be set and teach the topology stacking logic how to
handle them.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
sendfile(2) was reworked with the splice infrastructure, but it still
checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although
if f_op.sendpage() exists, f_op.splice_write() always exists at the same
time currently, the assumption will be broken in future silently. This
patch also brings a side effect: sendfile(2) can work with any output
file. Some security checks related to f_op are added too.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit 451a9ebf accidentally broke bio_alloc() and bio_kmalloc() comments by
(almost) swapping them.
This patch fixes that, by placing the comments in the right place.
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In bio_put()'s comment, add bio_clone() to the list of functions that can
give you a bio reference.
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The xfs_quota returns ENOSYS when remove command is executed.
Reproducable with following steps.
# mount -t xfs -o uquota /dev/sda7 /mnt/mp1
# xfs_quota -x -c off -c remove
XFS_QUOTARM: Function not implemented.
The remove command is allowed during quotaoff, but xfs_fs_set_xstate()
checks whether quota is running, and it leads to ENOSYS.
To solve this problem, add a check for X_QUOTARM.
Signed-off-by: Ryota Yamauchi <r-yamauchi@vf.jp.nec.com>
Signed-off-by: Utako Kusaka <u-kusaka@wm.jp.nec.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Commit bd16956599 seems
to have a slight regression where this code path:
if (!--searchdistance) {
/*
* Not in range - save last search
* location and allocate a new inode
*/
...
goto newino;
}
doesn't free the temporary cursor (tcur) that got dup'd in
this function.
This leaks an item in the xfs_btree_cur zone, and it's caught
on module unload:
===========================================================
BUG xfs_btree_cur: Objects remaining on kmem_cache_close()
-----------------------------------------------------------
It seems like maybe a single free at the end of the function might
be cleaner, but for now put a del_cursor right in this code block
similar to the handling in the rest of the function.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
backing-dev: ensure that a removed bdi no longer has super_block referencing it
block: use after free bug in __blkdev_get
block: silently error unsupported empty barriers too
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFSv4: The link() operation should return any delegation on the file
NFSv4: Fix two unbalanced put_rpccred() issues.
NFSv4: Fix a bug when the server returns NFS4ERR_RESOURCE
nfs: Panic when commit fails
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/ppc64: Use preempt_schedule_irq instead of preempt_schedule
powerpc: Minor cleanup to lib/Kconfig.debug
powerpc: Minor cleanup to sound/ppc/Kconfig
powerpc: Minor cleanup to init/Kconfig
powerpc: Limit memory hotplug support to PPC64 Book-3S machines
powerpc: Limit hugetlbfs support to PPC64 Book-3S machines
powerpc: Fix compile errors found by new ppc64e_defconfig
powerpc: Add a Book-3E 64-bit defconfig
powerpc/booke: Fix xmon single step on PowerPC Book-E
powerpc: Align vDSO base address
powerpc: Fix segment mapping in vdso32
powerpc/iseries: Remove compiler version dependent hack
powerpc/perf_events: Fix priority of MSR HV vs PR bits
powerpc/5200: Update defconfigs
drivers/serial/mpc52xx_uart.c: Use UPIO_MEM rather than SERIAL_IO_MEM
powerpc/boot/dts: drop obsolete 'fsl5200-clocking'
of: Remove nested function
mpc5200: support for the MAN mpc5200 based board mucmc52
mpc5200: support for the MAN mpc5200 based board uc101
A particular fsfuzzer run caused an hfs file system to crash on mount.
This is due to a corrupted MDB extent record causing a miscalculation of
HFS_I(inode)->first_blocks for the extent tree. If the extent records are
zereod out, it won't trigger the first_blocks special case. Instead it
falls through to the extent code which we're still in the middle of
initializing.
This patch catches the 0 size extent records, reports the corruption, and
fails the mount.
Reported-by: Ramon de Carvalho Valle <rcvalle@linux.vnet.ibm.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As found in <http://bugs.debian.org/550010>, hfsplus is using type u32
rather than sector_t for some sector number calculations.
In particular, hfsplus_get_block() does:
u32 ablock, dblock, mask;
...
map_bh(bh_result, sb, (dblock << HFSPLUS_SB(sb).fs_shift) + HFSPLUS_SB(sb).blockoffset + (iblock & mask));
I am not confident that I can find and fix all cases where a sector number
may be truncated. For now, avoid data loss by refusing to mount HFS+
volumes with more than 2^32 sectors (2TB).
[akpm@linux-foundation.org: fix 32 and 64-bit issues]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Eric Sesterhenn <snakebyte@gmx.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Given such a long name, the kB count in /proc/meminfo's HardwareCorrupted
line is being shown too far right (it does align with x86_64's VmallocChunk
above, but I hope nobody will ever have that much corrupted!). Align it.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is no barrier support in the block device code. That
means we cannot guarantee any sort of data integerity when using the
block device node with dis kwrite caches enabled. Using the raw block
device node is a typical use case for virtualization (and I assume
databases, too). This patch changes block_fsync to issue a cache flush
and thus make fsync on block device nodes actually useful.
Note that in mainline we would also need to add such code to the
->aio_write method for O_SYNC handling, but assuming that Jan's patch
series for the O_SYNC rewrite goes in it will also call into ->fsync
for 2.6.32.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There's nothing block related about them, the backing device
is used by things like NFS etc as well. This gets rid of the
need to protect such calls by CONFIG_BLOCK.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Hi,
Some workloads issue batches of small I/O, and the performance is poor
due to the call to blk_run_address_space for every single iocb. Nathan
Roberts pointed this out, and suggested that by deferring this call
until all I/Os in the iocb array are submitted to the block layer, we
can realize some impressive performance gains (up to 30% for sequential
4k reads in batches of 16).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Hi,
The WRITE_ODIRECT flag is only used in one place, and that code path
happens to also call blk_run_address_space. The introduction of this
flag, then, could result in the device being unplugged twice for every
I/O.
Further, with the batching changes in the next patch, we don't want an
O_DIRECT write to imply a queue unplug.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The hugetlb dependencies presently depend on SUPERH && MMU while the
hugetlb page size definitions depend on CPU_SH4 or CPU_SH5. This
unfortunately allows SH-3 + MMU configurations to enable hugetlbfs
without a corresponding HPAGE_SHIFT definition, resulting in the build
blowing up.
As SH-3 doesn't support variable page sizes, we tighten up the
dependenies a bit to prevent hugetlbfs from being enabled. These days
we also have a shiny new SYS_SUPPORTS_HUGETLBFS, so switch to using
that rather than adding to the list of corner cases in fs/Kconfig.
Reported-by: Kristoffer Ericson <kristoffer.ericson@gmail.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
commit 0762b8bde9
(from 14 months ago) introduced a use-after-free bug which has just
recently started manifesting in my md testing.
I tried git bisect to find out what caused the bug to start
manifesting, and it could have been the recent change to
blk_unregister_queue (48c0d4d4c0) but the results were inconclusive.
This patch certainly fixes my symptoms and looks correct as the two
calls are now in the same order as elsewhere in that function.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commits 29fba38b (nfs41: lease renewal) and fc01cea9 (nfs41: sequence
operation) introduce a couple of put_rpccred() calls on credentials for
which there is no corresponding get_rpccred().
See http://bugzilla.kernel.org/show_bug.cgi?id=14249
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC 3530 states that when we recieve the error NFS4ERR_RESOURCE, we are not
supposed to bump the sequence number on OPEN, LOCK, LOCKU, CLOSE, etc
operations. The problem is that we map that error into EREMOTEIO in the XDR
layer, and so the NFSv4 middle-layer routines like seqid_mutating_err(),
and nfs_increment_seqid() don't recognise it.
The fix is to defer the mapping until after the middle layers have
processed the error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Actually pass the NFS_FILE_SYNC option to the server to avoid a
Panic in nfs_direct_write_complete() when a commit fails.
At the end of an nfs write, if the nfs commit fails, all the writes
will be rescheduled. They are supposed to be rescheduled as NFS_FILE_SYNC
writes, but the rpc_task structure is not completely intialized and so
the option is not passed. When the rescheduled writes complete, the
return indicates that they are NFS_UNSTABLE and we try to do another
commit. This leads to a Panic because the commit data structure pointer
was set to null in the initial (failed) commit attempt.
Signed-off-by: Terry Loftin <terry.loftin@hp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
dnotify: ignore FS_EVENT_ON_CHILD
inotify: fix coalesce duplicate events into a single event in special case
inotify: deprecate the inotify kernel interface
fsnotify: do not set group for a mark before it is on the i_list
Fix a (small) memory leak in one of the error paths of the NFS mount
options parsing code.
Regression introduced in 2.6.30 by commit a67d18f (NFS: load the
rpc/rdma transport module automatically).
Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes a null pointer exception in pipe_rdwr_open() which
generates the stack trace:
> Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
> [<ffffffff802899a5>] pipe_rdwr_open+0x35/0x70
> [<ffffffff8028125c>] __dentry_open+0x13c/0x230
> [<ffffffff8028143d>] do_filp_open+0x2d/0x40
> [<ffffffff802814aa>] do_sys_open+0x5a/0x100
> [<ffffffff8021faf3>] sysenter_do_call+0x1b/0x67
The failure mode is triggered by an attempt to open an anonymous
pipe via /proc/pid/fd/* as exemplified by this script:
=============================================================
while : ; do
{ echo y ; sleep 1 ; } | { while read ; do echo z$REPLY; done ; } &
PID=$!
OUT=$(ps -efl | grep 'sleep 1' | grep -v grep |
{ read PID REST ; echo $PID; } )
OUT="${OUT%% *}"
DELAY=$((RANDOM * 1000 / 32768))
usleep $((DELAY * 1000 + RANDOM % 1000 ))
echo n > /proc/$OUT/fd/1 # Trigger defect
done
=============================================================
Note that the failure window is quite small and I could only
reliably reproduce the defect by inserting a small delay
in pipe_rdwr_open(). For example:
static int
pipe_rdwr_open(struct inode *inode, struct file *filp)
{
msleep(100);
mutex_lock(&inode->i_mutex);
Although the defect was observed in pipe_rdwr_open(), I think it
makes sense to replicate the change through all the pipe_*_open()
functions.
The core of the change is to verify that inode->i_pipe has not
been released before attempting to manipulate it. If inode->i_pipe
is no longer present, return ENOENT to indicate so.
The comment about potentially using atomic_t for i_pipe->readers
and i_pipe->writers has also been removed because it is no longer
relevant in this context. The inode->i_mutex lock must be used so
that inode->i_pipe can be dealt with correctly.
Signed-off-by: Earl Chew <earl_chew@agilent.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we do rename a dir entry, like this:
rename("/tmp/ino7UrgoJ.rename1", "/tmp/ino7UrgoJ.rename2")
rename("/tmp/ino7UrgoJ.rename2", "/tmp/ino7UrgoJ")
The duplicate events should be coalesced into a single event. But those two
events do not be coalesced into a single event, due to some bad check in
event_compare(). It can not match the two NULL inodes as the same event.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
fsnotify_add_mark is supposed to add a mark to the g_list and i_list and to
set the group and inode for the mark. fsnotify_destroy_mark_by_entry uses
the fact that ->group != NULL to know if this group should be destroyed or
if it's already been done.
But fsnotify_add_mark sets the group and inode before it actually adds the
mark to the i_list and g_list. This can result in a race in inotify, it
requires 3 threads.
sys_inotify_add_watch("file") sys_inotify_add_watch("file") sys_inotify_rm_watch([a])
inotify_update_watch()
inotify_new_watch()
inotify_add_to_idr()
^--- returns wd = [a]
inotfiy_update_watch()
inotify_new_watch()
inotify_add_to_idr()
fsnotify_add_mark()
^--- returns wd = [b]
returns to userspace;
inotify_idr_find([a])
^--- gives us the pointer from task 1
fsnotify_add_mark()
^--- this is going to set the mark->group and mark->inode fields, but will
return -EEXIST because of the race with [b].
fsnotify_destroy_mark()
^--- since ->group != NULL we call back
into inotify_freeing_mark() which calls
inotify_remove_from_idr([a])
since fsnotify_add_mark() failed we call:
inotify_remove_from_idr([a]) <------WHOOPS it's not in the idr, this could
have been any entry added later!
The fix is to make sure we don't set mark->group until we are sure the mark is
on the inode and fsnotify_add_mark will return success.
Signed-off-by: Eric Paris <eparis@redhat.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: always pin metadata in discard mode
Btrfs: enable discard support
Btrfs: add -o discard option
Btrfs: properly wait log writers during log sync
Btrfs: fix possible ENOSPC problems with truncate
Btrfs: fix btrfs acl #ifdef checks
Btrfs: streamline tree-log btree block writeout
Btrfs: avoid tree log commit when there are no changes
Btrfs: only write one super copy during fsync
sysfs_notify_dirent is a simple atomic operation that can be used to
alert user-space that new data can be read from a sysfs attribute.
Unfortunately it cannot currently be called from non-process context
because of its use of spin_lock which is sometimes taken with
interrupts enabled.
So change all lockers of sysfs_open_dirent_lock to disable interrupts,
thus making sysfs_notify_dirent safe to be called from non-process
context (as drivers/md does in md_safemode_timeout).
sysfs_get_open_dirent is (documented as being) only called from
process context, so it uses spin_lock_irq. Other places
use spin_lock_irqsave.
The usage for sysfs_notify_dirent in md_safemode_timeout was
introduced in 2.6.28, so this patch is suitable for that and more
recent kernels.
Reported-by: Joel Andres Granados <jgranado@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As device_move() and kobject_move() both handle a NULL destination,
sysfs_move_dir() should do this as well (again) and fall back to
sysfs_root in that case.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We have an optimization in btrfs to allow blocks to be
immediately freed if they were allocated in this transaction and never
written. Otherwise they are pinned and freed when the transaction
commits.
This isn't optimal for discard mode because immediately freeing
them means immediately discarding them. It is better to give the
block to the pinning code and letting the (slow) discard happen later.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The discard support code in btrfs currently is guarded by ifdefs for
BIO_RW_DISCARD, which is never defines as it's the name of an enum
memeber. Just remove the useless ifdefs to actually enable the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Enable discard by default is not a good idea given the the trim speed
of SSD prototypes we've seen, and the carecteristics for many high-end
arrays. Turn of discards by default and require the -o discard option
to enable them on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A recently fsync optimization make btrfs_sync_log skip calling
wait_for_writer in the single log writer case. This is incorrect
since the writer count can also be increased by btrfs_pin_log.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There's a problem where we don't do any space reservation for truncates, which
can cause you to OOPs because you will be allowed to go off in the weeds a bit
since we don't account for the delalloc bytes that are created as a result of
the truncate.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
xfs_dqrele_inode calls xfs_iput to release the ilock and a reference
and then also calls IRELE which does a second decrement of the reference
count. This leads to a premature freeing of inodes when quotas were turned
off while the filesystem was mounted.
Thanks to Utako Kusaka for reporting the bug and provinding a good testcase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Utako Kusaka <u-kusaka@wm.jp.nec.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The btrfs acl code was #ifdefing for a define
that didn't exist. This correctly matches it
to the values used by the Kconfig file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>