I've bisected the deadlock when many small appends are done on jffs2 down to
this commit:
commit 6fe6900e1e
Author: Nick Piggin <npiggin@suse.de>
Date: Sun May 6 14:49:04 2007 -0700
mm: make read_cache_page synchronous
Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.
I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
block2mtd. All depending on whether the filler is async and/or can return
with a !uptodate page.
It introduced a wait to read_cache_page, as well as a
read_cache_page_async function equivalent to the old read_cache_page
without any callers.
Switching jffs2_gc_fetch_page to read_cache_page_async for the old
behavior makes the deadlocks go away, but maybe reintroduces the
use-before-uptodate problem? I don't understand the mm/fs interaction
well enough to say.
[It's fine. dwmw2.]
Signed-off-by: Jason Lunz <lunz@falooley.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Ryusuke Konishi says:
The recent truncate_complete_page() clears the dirty flag from a page
before calling a_ops->invalidatepage(),
^^^^^^
static void
truncate_complete_page(struct address_space *mapping, struct page *page)
{
...
cancel_dirty_page(page, PAGE_CACHE_SIZE); <--- Inserted here at
kernel 2.6.20
if (PagePrivate(page))
do_invalidatepage(page, 0); ---> will call
a_ops->invalidatepage()
...
}
and this is disturbing nfs_wb_page_priority() from calling
nfs_writepage_locked() that is expected to handle the pending
request (=nfs_page) associated with the page.
int nfs_wb_page_priority(struct inode *inode, struct page *page, int how)
{
...
if (clear_page_dirty_for_io(page)) {
ret = nfs_writepage_locked(page, &wbc);
if (ret < 0)
goto out;
}
...
}
Since truncate_complete_page() will get rid of the page after
a_ops->invalidatepage() returns, the request (=nfs_page) associated
with the page becomes a garbage in nfs_inode->nfs_page_tree.
------------------------
Fix this by ensuring that nfs_wb_page_priority() recognises that it may
also need to clear out non-dirty pages that have an nfs_page associated
with them.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
According to the mount(2) man page, the proper error return code for the
mount(2) system call when the special device name or the mounted-on
directory name is too long is ENAMETOOLONG.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The hostname was getting truncated in the new text-based NFS mount API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't filter the return code from the in-kernel rpcbind or NFS mount
clients. Return the real error code so that callers of the new NFS
text-based mount API can apply a useful retry strategy.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The new text-based NFS mount option parsing logic doesn't recognize any
valid transport protocols due to a silly mistake in the protocol token
matching logic. This prevents basic mount requests such as:
mount.nfs server:/export /mnt -o proto=tcp
from working with the new text-based NFS mount API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Doh! We can't use cancel_delayed_work_sync because we may have been called
from an unmount that was being performed by nfs_automount_task.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This avoids the recent NFS mount regression (returning EBUSY when
mounting the same filesystem twice with different parameters).
The best I can do given the constraints appears to be to have the kernel
first look for a superblock that matches both the fsid and the
user-specified mount options, and then spawn off a new superblock if
that search fails.
Note that this is not the same as specifying nosharecache everywhere
since nosharecache will never attempt to match an existing superblock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For hugepage mappings, the file offset, like the address and size, needs to
be aligned to the size of a hugepage.
In commit 68589bc353, the check for this was
moved into prepare_hugepage_range() along with the address and size checks.
But since BenH's rework of the get_unmapped_area() paths leading up to
commit 4b1d89290b, prepare_hugepage_range()
is only called for MAP_FIXED mappings, not for other mappings. This means
we're no longer ever checking for an aligned offset - I've confirmed that
mmap() will (apparently) succeed with a misaligned offset on both powerpc
and i386 at least.
This patch restores the check, removing it from prepare_hugepage_range()
and putting it back into hugetlbfs_file_mmap(). I'm putting it there,
rather than in the get_unmapped_area() path so it only needs to go in one
place, than separately in the half-dozen or so arch-specific
implementations of hugetlb_get_unmapped_area().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will avoid a possible fault in ecryptfs_sync_page().
In the function, eCryptfs calls sync_page() method of a lower filesystem
without checking its existence. However, there are many filesystems that
don't have this method including network filesystems such as NFS, AFS, and
so forth. They may fail when an eCryptfs page is waiting for lock.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix possible NULL pointer dereference when freeing blocks in case table of
free space is used. Also fix handling of the case when we need to move
extent from one block to another one to make space for indirect extent.
BTW: Nobody seem to have ever used this code.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If UDF superblock is incorrect, we can fail to find a table of free /
allocated space and consequently Oops. Handle this situation more
gracefully by ignoring the broken UDF partition.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: fix bad error path in conversion routines
9p: remove deprecated v9fs_fid_lookup_remove()
9p: update maintainers and documentation
9p: fix use after free
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6:
sysfs: don't warn on removal of a nonexistent binary file
HOWTO: latest lxr url address changed
HOWTO: korean translation of Documentation/HOWTO
Fix Off-by-one in /sys/module/*/refcnt
sysfs: fix locking in sysfs_lookup() and sysfs_rename_dir()
This patch removes the v9fs_fid_lookup_remove which is no longer used.
Based on original patch from Adrian Bunk <bunk@stusta.de> which
used #if 0 to isolate the code.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Fix the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING. It
reverts parts of commit b27f03d4bd by
converting fs/proc/array.c back to cputime_t. The new functions
task_utime and task_stime now return cputime_t instead of clock_t. If
CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are
returned directly instead of using sum_exec_runtime.
Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as
on i386.
[ mingo@elte.hu: cleanups, comments. ]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
de_thread:
if (atomic_read(&oldsighand->count) <= 1)
BUG_ON(atomic_read(&sig->count) != 1);
This is not safe without the rmb() in between. The results of two
correctly ordered __exit_signal()->atomic_dec_and_test()'s could be seen
out of order on our CPU.
The same is true for the "thread_group_empty()" case, __unhash_process()'s
changes could be seen before atomic_dec_and_test(&sig->count).
On some platforms (including i386) atomic_read() doesn't provide even the
compiler barrier, in that case these checks are simply racy.
Remove these BUG_ON()'s. Alternatively, we can do something like
BUG_ON( ({ smp_rmb(); atomic_read(&sig->count) != 1; }) );
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to inconsistent locking in the VFS between calls to lookup and
revalidate deadlock can occur in the automounter.
The inconsistency is that the directory inode mutex is held for both lookup
and revalidate calls when called via lookup_hash whereas it is held only
for lookup during a path walk. Consequently, if the mutex is held during a
call to revalidate autofs4 can't release the mutex to callback the daemon
as it can't know whether it owns the mutex.
This situation happens when a process tries to create a directory within an
automount and a second process also tries to create the same directory
between the lookup and the mkdir. Since the first process has dropped the
mutex for the daemon callback, the second process takes it during
revalidate leading to deadlock between the autofs daemon and the second
process when the daemon tries to create the mount point directory.
After spending quite a bit of time trying to resolve this on more than one
occassion, using rather complex and ulgy approaches, it turns out that just
delaying the hashing of the dentry until the create operation works fine.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this patch any thread can dequeue its own private signals via signalfd,
even if it was created by another sub-thread.
To do so, we pass "current" to dequeue_signal() if the caller is from the same
thread group. This also fixes the scheduling of posix timers broken by the
previous patch.
If the caller doesn't belong to this thread group, we can't handle __SI_TIMER
case properly anyway. Perhaps we should forbid the cross-process signalfd usage
and convert ctx->tsk to ctx->sighand.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When ecryptfs_lookup() is called against special files, eCryptfs generates
the following errors because it tries to treat them like regular eCryptfs
files.
Error opening lower file for lower_dentry [0xffff810233a6f150], lower_mnt [0xffff810235bb4c80], and flags [0x8000]
Error opening lower_file to read header region
Error attempting to read the [user.ecryptfs] xattr from the lower file; return value = [-95]
Valid metadata not found in header region or xattr region; treating file as unencrypted
For instance, the problem can be reproduced by the steps below.
# mkdir /root/crypt /mnt/crypt
# mount -t ecryptfs /root/crypt /mnt/crypt
# mknod /mnt/crypt/c0 c 0 0
# umount /mnt/crypt
# mount -t ecryptfs /root/crypt /mnt/crypt
# ls -l /mnt/crypt
This patch fixes it by adding a check similar to directories and
symlinks.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch (as960) removes the error message and stack dump logged by
sysfs_remove_bin_file() when someone tries to remove a nonexistent
file. The warning doesn't seem to be needed, since none of the other
file-, symlink-, or directory-removal routines in sysfs complain in a
comparable way.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sd children list walking in sysfs_lookup() and sd renaming in
sysfs_rename_dir() were left out during i_mutex -> sysfs_mutex
conversion. Fix them.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch uses kzalloc to zero all of struct dio rather than manually
trying to track which fields we rely on being zero. It passed aio+dio
stress testing and some bug regression testing on ext3.
This patch was introduced by Linus in the conversation that lead up to
Badari's minimal fix to manually zero .map_bh.b_state in commit:
6a648fa721
It makes the code a bit smaller. Maybe a couple fewer cachelines to
load, if we're lucky:
text data bss dec hex filename
3285925 568506 1304616 5159047 4eb887 vmlinux
3285797 568506 1304616 5158919 4eb807 vmlinux.patched
I was unable to measure a stable difference in the number of cpu cycles
spent in blockdev_direct_IO() when pushing aio+dio 256K reads at
~340MB/s.
So the resulting intent of the patch isn't a performance gain but to
avoid exposing ourselves to the risk of finding another field like
.map_bh.b_state where we rely on zeroing but don't enforce it in the
code.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a491486a20 introduced a locking
problem in JFFS2 -- we up() the alloc_sem when we weren't previously
holding it. This leads to all kinds of fun behaviour later.
There was a _reason_ for the
if (1 /* alternative path needs testing */ ||
which the above-mentioned commit removed :)
Discovered and debugged by Giulio Fedel <giulio.fedel@andorsystems.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Check return code on failed alloc
[CIFS] Update CIFS project web site
[CIFS] Fix hang in find_writable_file
This fixes a vulnerability in the "parent process death signal"
implementation discoverd by Wojciech Purczynski of COSEINC PTE Ltd.
and iSEC Security Research.
http://marc.info/?l=bugtraq&m=118711306802632&w=2
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 569a7b6c2e. The
code was correct originally. The default setting for ACLs after a
remount should be to be the same as before the remount.
Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Due to a mix up between the jdata attribute and inherit jdata attribute
it has not been possible to set the inherit jdata attribute on
directories. This is now fixed and the ioctl will report the inherit
jdata attribute for directories rather than the jdata attribute as it
did previously. This stems from our need to have the one bit in the
ioctl attr flags mean two different things according to whether the
underlying inode is a directory or not.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The error path in prepare_write() was incorrect in the (very rare) event
that the transaction fails to start. The following prevents a NULL
pointer dereference,
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patch fixes a bug where 0 was being used as a return code
to indicate "nothing to do" when in fact 0 was a valid block location
which might be returned by the function.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch seems to fix the problem described in bugzilla bug 246114.
It was written by Steve Whitehouse with some tweaking by me.
The code was looping in the relatively new section of code designed to
search for and reuse unlinked inodes. In cases where it was finding an
appropriate inode to reuse, it was looping around and finding the same
block over and over because a "<=" check should have been a "<" when
comparing the goal block to the last unlinked block found.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is part 2 of the patch for bug #245832, part 1 of which is already
in the git tree.
The problem was that sdp->sd_log_num_databuf was not always being
protected by the gfs2_log_lock spinlock, but the sd_log_le_databuf
(which it is supposed to reflect) was protected. That meant there
was a timing window during which gfs2_log_flush called
databuf_lo_before_commit and the count didn't match what was
really on the linked list in that window. So when it ran out of
items on the linked list, it decremented total_dbuf from 0 to -1 and
thus never left the "while(total_dbuf)" loop.
The solution is to protect the variable sdp->sd_log_num_databuf so
that the value will always match the contents of the linked list,
and therefore the number will never go negative, and therefore, the
loop will be exited properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix a long standing bug where a blocking callback would be missed
when there's a granted lock in PR mode and waiting locks in both
PR and CW modes (and the PR lock was added to the waiting queue
before the CW lock). The logic simply compared the numerical values
of the modes to determine if a blocking callback was required, but in
the one case of PR and CW, the lower valued CW mode blocks the higher
valued PR mode. We just need to add a special check for this PR/CW
case in the tests that decide when a blocking callback is needed.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The last patch to clean out 'othercon' structures only fixed half the problem.
The attached addresses the other situations too, and fixes bz#238490
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There's a memory leak in fs/dlm/member.c::dlm_add_member().
If "dlm_node_weight(ls->ls_name, nodeid)" returns < 0, then
we'll return without freeing the memory allocated to the (at
that point yet unused) 'memb'.
This patch frees the allocated memory in that case and thus
avoids the leak.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When we build a sockaddr_storage for an IP address, clear the unused parts as
they could be used for node comparisons.
I have seen this occasionally make sctp connections fail.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix regression in recent patch "[DLM] variable allocation" which
attempts to dereference an "ls" struct when it's NULL.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch clears the othercon pointer and frees the memory when a connnection
is closed. This could cause a small memory leak when nodes leave the cluster.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: set non-default s_time_gran during mount
ocfs2: Retry sendpage() if it returns EAGAIN
ocfs2: Fix rename/extend race
[2.6 patch] ocfs2_insert_extent(): remove dead code
ocfs2: Fix max offset calculations
ocfs2: check ia_size limits in setattr
ocfs2: Fix some casting errors related to file writes
ocfs2: use s_maxbytes directly in ocfs2_change_file_space()
ocfs2: Restrict inode changes in ocfs2_update_inode_atime()
ecryptfs_init() exits without doing any cleanup jobs if
ecryptfs_init_messaging() fails. In that case, eCryptfs leaves
sysfs entries, leaks memory, and causes an invalid page fault.
This patch fixes the problem.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When ecryptfs_lookup() is called against special files, eCryptfs generates
the following errors because it tries to treat them like regular eCryptfs
files.
Error opening lower file for lower_dentry [0xffff810233a6f150], lower_mnt [0xffff810235bb4c80], and flags
[0x8000]
Error opening lower_file to read header region
Error attempting to read the [user.ecryptfs] xattr from the lower file; return value = [-95]
Valid metadata not found in header region or xattr region; treating file as unencrypted
For instance, the problem can be reproduced by the steps below.
# mkdir /root/crypt /mnt/crypt
# mount -t ecryptfs /root/crypt /mnt/crypt
# mknod /mnt/crypt/c0 c 0 0
# umount /mnt/crypt
# mount -t ecryptfs /root/crypt /mnt/crypt
# ls -l /mnt/crypt
This patch fixes it by adding a check similar to directories and
symlinks.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Need to initialize map_bh.b_state to zero. Otherwise, in case of a faulty
user-buffer its possible to go into dio_zero_block() and submit a page by
mistake - since it checks for buffer_new().
http://marc.info/?l=linux-kernel&m=118551339032528&w=2
akpm: Linus had a (better) patch to just do a kzalloc() in there, but it got
lost. Probably this version is better for -stable anwyay.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Cc: gurudas pai <gurudas.pai@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to manually set this to '1' during mount, otherwise inode_setattr()
will chop off the nanosecond portion of our timestamps.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Instead of treating EAGAIN, returned from sendpage(), as an error, this
patch retries the operation.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
If one process is extending a file while another is renaming it, there
exists a window when rename could flush the old inode's stale i_size to
disk. This patch recognizes the fact that rename is only updating the old
inode's ctime, so it ensures only that value is flushed to disk.
Signed-off-by: Sunil Mushran <sunil.musran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch removes some now dead code.
Spotted by the Coverity checker.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2_max_file_offset() was over-estimating the largest file size for
several cases. This wasn't really a problem before, but now that we support
sparse files, it needs to be more accurate.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We have to manually check the requested truncate size as the check in
vmtruncate() comes too late for Ocfs2.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2_align_clusters_to_page_index() needs to cast the clusters shift to
pgoff_t and ocfs2_file_buffered_write() needs loff_t when calculating
destination start for memcpy.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
There's no need to recalculate things via ocfs2_max_file_offset() as we've
already done that to fill s_maxbytes, so use that instead. We can also
un-export ocfs2_max_file_offset() then.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2_update_inode_atime() calls ocfs2_mark_inode_dirty() to push changes
from the struct inode into the ocfs2 disk inode. The problem is,
ocfs2_mark_inode_dirty() might change other fields, depending on what
happened to the struct inode. Since we don't always have locking to
serialize changes to other fields (like i_size, etc), just fix things up to
only touch the atime field.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
* git://git.linux-nfs.org/pub/linux/nfs-2.6:
SUNRPC: Replace flush_workqueue() with cancel_work_sync() and friends
NFS: Replace flush_scheduled_work with cancel_work_sync() and friends
SUNRPC: Don't call gss_delete_sec_context() from an rcu context
NFSv4: Don't call put_rpccred() from an rcu callback
NFS: Fix NFSv4 open stateid regressions
NFSv4: Fix a locking regression in nfs4_set_mode_locked()
NFS: Fix put_nfs_open_context
SUNRPC: Fix a race in rpciod_down()
Do not allow cached open for O_RDONLY or O_WRONLY unless the file has been
previously opened in these modes.
Also Fix the calculation of the mode in nfs4_close_prepare. We should only
issue an OPEN_DOWNGRADE if we're sure that we will still be holding the
correct open modes. This may not be the case if we've been doing delegated
opens.
Finally, there is no need to adjust the open mode bit flags in
nfs4_close_done(): that has already been done in nfs4_close_prepare().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't really need to clear &state->inode_states inside
nfs4_set_mode_locked, and doing so without holding the inode->i_lock would
in any case be a bug...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to grab the inode->i_lock atomically with the last reference put in
order to remove the open context that is being freed from the
nfsi->open_files list.
Fix by converting the kref to a standard atomic counter and then using
atomic_dec_and_lock()...
Thanks to Arnd Bergmann for pointing out the problem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes some duplicated wireless ioctl entries in the array
'struct ioctl_trans ioctl_start[]' of fs/compat_ioctl.c
These entries are registered twice like:
COMPATIBLE_IOCTL(SIOCGIWPRIV)
and
HANDLE_IOCTL(SIOCGIWPRIV, do_wireless_ioctl)
Signed-off-by: Masakazu Mokuno <mokuno@sm.sony.co.jp>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Debugging the hardware problems in OLPC trac #1905 would be a whole lot
easier if the correct node offsets were printed for the offending nodes.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
The try_to_freeze() call was in the wrong place; we need it in the
signal-pending loop now that a pending freeze also makes
signal_pending() return true.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
jffs2_add_physical_node_ref() should never really return error -- it's
an internal debugging check which triggered. We really need to work out
why and stop it happening. But in the meantime, let's make the failure
mode a little less nasty.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This patch fix weird behaviour of UDF mounting procedure. To get UID
changed (for now) we have to type
mount -t udf -o uid=some_user,uid=ignore /dev/device /mnt/moun_point
and specifying two uid at once is strange a bit. So with the patch we are
able to mount without additional 'uid=ignore' option. The same for GID
option is done.
This patch will not break current mount scheme (with two option).
Btw this does fix (I hope) the following
[BUG 6124] mount of UDF fs ignores UID and GID options
http://bugzilla.kernel.org/show_bug.cgi?id=6124
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael <auslands-kv@gmx.de>
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it a little more clear that this is the default implementation for
the setleast operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible that another process could acquire a new file lease right
after break_lease() is called during a truncate, but before lease-granting
is disabled by the subsequent get_write_access(). Merely switching the
order of the break_lease() and get_write_access() calls prevents this race.
Signed-off-by: David M. Richter <richterd@citi.umich.edu>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turned out that mounting a corrupted ISO image to a regular file may
succeed, e.g. if an image was prepared as follows:
$ dd if=correct.iso of=bad.iso bs=4k count=8
We then can mount it to a regular file:
# mount -o loop -t iso9660 bad.iso /tmp/file
But mounting it to a directory fails with -ENOTDIR, simply because
the root directory inode doesn't have S_IFDIR set and the condition
in graft_tree() is met:
if (S_ISDIR(nd->dentry->d_inode->i_mode) !=
S_ISDIR(mnt->mnt_root->d_inode->i_mode))
return -ENOTDIR
This is because the root directory inode was read from an incorrect
block. It's supposed to be read from sbi->s_firstdatazone, which is
an absolute value and gets messed up in the case of an incorrect image.
In order to somehow circumvent this we have to check that the root
directory inode is actually a directory after all.
Signed-off-by: Kirill Kuvaldin <kuvkir@epsmu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix file locking for AFS:
(*) Start the lock manager thread under a mutex to avoid a race.
(*) Made the locking non-fair: New readlocks will jump pending writelocks if
there's a readlock currently granted on a file. This makes the behaviour
similar to Linux's VFS locking.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A succesful downcall with a negative result (which indicates that the given
filesystem is not exported to the given user) should not return an error.
Currently mountd is depending on stdio to write these downcalls. With some
versions of libc this appears to cause subsequent writes to attempt to write
all accumulated data (for which writes previously failed) along with any new
data. This can prevent the kernel from seeing responses to later downcalls.
Symptoms will be that nfsd fails to respond to certain requests.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We shouldn't be using negative uid's and gid's in the idmap upcalls.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RFC 3530 says:
If the server uses an attribute to store the exclusive create verifier, it
will signify which attribute by setting the appropriate bit in the attribute
mask that is returned in the results.
Linux uses the atime and mtime to store the verifier, but sends a zeroed out
bitmask back to the client. This patch makes sure that we set the correct
bits in the bitmask in this situation.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yan Zheng wrote:
> I think I found a bug in ext4/extents.c, "ext4_ext_put_in_cache" uses
> "__u32" to receive physical block number. "ext4_ext_put_in_cache" is
> used in "ext4_ext_get_blocks", it sets ext4 inode's extent cache
> according most recently tree lookup (higher 16 bits of saved physical
> block number are always zero). when serving a mapping request,
> "ext4_ext_get_blocks" first check whether the logical block is in
> inode's extent cache. if the logical block is in the cache and the
> cached region isn't a gap, "ext4_ext_get_blocks" gets physical block
> number by using cached region's physical block number and offset in
> the cached region. as described above, "ext4_ext_get_blocks" may
> return wrong result when there are physical block numbers bigger than
> 0xffffffff.
>
You are right. Thanks for reporting this!
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Cc: Yan Zheng <yanzheng@21cn.com>
Cc: <stable@kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the SYSV IPC SHM to work with the changes applied by the new fault handler
patches when CONFIG_MMU=n.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are handled in a ->compat_ioctl() handler, so it's just noise
when compat_ioctl.c warns which occurs when they are used on non-SBUS
framebuffer devices.
Signed-off-by: David S. Miller <davem@davemloft.net>
Start doing VTOC validation before using its contents.
The validation is adjusted so as not to break existing setups
that do not set the VTOC version, sanity and partition count entries.
VTOC tables with more than 8 partitions will NOT be used.
Signed-off-by: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct the Solaris x86 number of partitions (slices) is a way that is
backward compatible with the earlier size.
This works without a new VTOC structure definition as the timestamp
and v_asciilabel fields in the VTOC are not used by the kernel yet.
Signed-off-by: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is important to only provide the compat_ioctl method
if the downstream de->proc_fops does too, otherwise this
utterly confuses the logic in fs/compat_ioctl.c and we
end up doing the wrong thing.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
docbook: add pipes, other fixes
blktrace: use cpu_clock() instead of sched_clock()
bsg: Fix build for CONFIG_BLOCK=n
[patch] QUEUE_FLAG_READFULL QUEUE_FLAG_WRITEFULL comment fix
b716395e2b added code to handle
a compatability issue with 32bit quota tools, but the new compat
routines are only needed when CONFIG_COMPAT=y (and with this set
to 'n' there are compilation problems since some new typedefs are
not visible).
Reported by Doug Chapman. Fix tuned by a cast of thousands (Andi,
Andreas, Arthur, HPA, Willy)
Signed-off-by: Tony Luck <tony.luck@intel.com>
Fix some typos in pipe.c and splice.c.
Add pipes API to kernel-api.tmpl.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ext[234]_check_descriptors sanity checks block group descriptor geometry at
mount time, testing whether the block bitmap, inode bitmap, and inode table
reside wholly within the blockgroup. However, the inode table test is off
by one so that if the last block in the inode table resides on the last
block of the block group, the test incorrectly fails. This is because it
tests the last block as (start + length) rather than (start + length - 1).
This can be seen by trying to mount a filesystem made such as:
mkfs.ext2 -F -b 1024 -m 0 -g 256 -N 3744 fsfile 1024
which yields:
EXT2-fs error (device loop0): ext2_check_descriptors: Inode table for group 0 not in group (block 101)!
EXT2-fs: group descriptors corrupted!
There is a similar bug in e2fsprogs, patch already sent for that.
(I wonder if inside(), outside(), and/or in_range() should someday be
used in this and other tests throughout the ext filesystems...)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davi fixed a missing cast in the __put_user(), that was making timerfd
return a single byte instead of the full value.
Talking with Michael about the timerfd man page, we think it'd be better to
use a u64 for the returned value, to align it with the eventfd
implementation.
This is an ABI change. The timerfd code is new in 2.6.22 and if we merge this
into 2.6.23 then we should also merge it into 2.6.22.x. That will leave a few
early 2.6.22 kernels out in the wild which might misbehave when a future
timerfd-enabled glibc is run on them.
mtk says: The difference would be that read() will only return 4 bytes, while
the application will expect 8. If the application is checking the size of
returned value, as it should, then it will be able to detect the problem (it
could even be sophisticated enough to know that if this is a 4-byte return,
then it is running on an old 2.6.22 kernel). If the application is not
checking the return from read(), then its 8-byte buffer will not be filled --
the contents of the last 4 bytes will be undefined, so the u64 value as a
whole will be junk.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Davi Arnaut <davi@haxent.com.br>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is probably a leftover from a time when the return wasn't there yet.
Now the extra assignment is just irritating.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Caused by unneeded reopen during reconnect while spinlock held.
Fixes kernel bugzilla bug #7903
Thanks to Lin Feng Shen for testing this, and Amit Arora for
some nice problem determination to narrow this down.
Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
kunmap_atomic() takes the virtual address, not the mapped page as
argument.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'request-queue-t' of git://git.kernel.dk/linux-2.6-block:
[BLOCK] Add request_queue_t and mark it deprecated
[BLOCK] Get rid of request_queue_t typedef
The fallocate syscall returns ENOSYS in case the filesystem does not support
the operation and expects the userlevel code to fill in. This is good in
concept.
The problem is that the libc code for old kernels should be able to
distinguish the case where the syscall is not at all available vs not
functioning for a specific mount point. As is this is not possible and we
always have to invoke the syscall even if the kernel doesn't support it.
I suggest the following patch. Using EOPNOTSUPP is IMO the right thing to do.
Cc: Amit Arora <aarora@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Obviously broken on little-endian; fortunately, the option is not
frequently used...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ Hey, sparse is wonderful, but even better than sparse is having people
like Al that actually _run_ it and fix bugs using it. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Too many remote cpu references due to /proc/stat.
On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing lots
of cacheline transfers
This statistic seems useless. Other 'big iron' arches disable this.
AK: changed to remove for all SMP setups
AK: add comment
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>