linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Trond Myklebust	78ea323be6	NFSv4: Don't use cred->cr_ops->cr_name in nfs4_proc_setclientid() With the recent change to generic creds, we can no longer use cred->cr_ops->cr_name to distinguish between RPCSEC_GSS principals and AUTH_SYS/AUTH_NULL identities. Replace it with the rpc_authops->au_name instead... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:54:53 -04:00
Fred Isaman	4410924157	nfs: fix printout of multiword bitfields Benny points out that zero-padding of multiword bitfields is necessary, and that delimiting each word is nice to avoid endianess confusion. bhalevy: without zero padding output can be ambiguous. Also, since the printed array of two 32-bit unsigned integers is not a 64-bit number, delimiting the output with a semicolon makes more sense. Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:54:50 -04:00
Benny Halevy	856dff3d38	nfs: return negative error value from nfs{,4}_stat_to_errno All use sites for nfs{,4}_stat_to_errno negate their return value. It's more efficient to return a negative error from the stat_to_errno convertors rather than negating its return value everywhere. This also produces slightly smaller code. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:54:47 -04:00
Trond Myklebust	d11d10cc05	NLM/lockd: Ensure client locking calls use correct credentials Now that we've added the 'generic' credentials (that are independent of the rpc_client) to the nfs_open_context, we can use those in the NLM client to ensure that the lock/unlock requests are authenticated to whoever originally opened the file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:54:43 -04:00
Trond Myklebust	c4d7c402b7	NFS: Remove the buggy lock-if-signalled case from do_setlk() Both NLM and NFSv4 should be able to clean up adequately in the case where the user interrupts the RPC call... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:52 -04:00
Trond Myklebust	5f50c0c6d6	NLM/lockd: Fix a race when cancelling a blocking lock We shouldn't remove the lock from the list of blocked locks until the CANCEL call has completed since we may be racing with a GRANTED callback. Also ensure that we send an UNLOCK if the CANCEL request failed. Normally that should only happen if the process gets hit with a fatal signal. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:49 -04:00
Trond Myklebust	6b4b3a752b	NLM/lockd: Ensure that nlmclnt_cancel() returns results of the CANCEL call Currently, it returns success as long as the RPC call was sent. We'd like to know if the CANCEL operation succeeded on the server. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:45 -04:00
Trond Myklebust	8ec7ff7444	NLM: Remove the signal masking in nlmclnt_proc/nlmclnt_cancel The signal masks have been rendered obsolete by the preceding patch. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:42 -04:00
Trond Myklebust	dc9d8d0481	NLM/lockd: convert __nlm_async_call to use rpc_run_task() Peter Staubach comments: > In the course of investigating testing failures in the locking phase of > the Connectathon testsuite, I discovered a couple of things. One was > that one of the tests in the locking tests was racy when it didn't seem > to need to be and two, that the NFS client asynchronously releases locks > when a process is exiting. ... > The Single UNIX Specification Version 3 specifies that: "All locks > associated with a file for a given process shall be removed when a file > descriptor for that file is closed by that process or the process holding > that file descriptor terminates.". > > This does not specify whether those locks must be released prior to the > completion of the exit processing for the process or not. However, > general assumptions seem to be that those locks will be released. This > leads to more deterministic behavior under normal circumstances. The following patch converts the NFSv2/v3 locking code to use the same mechanism as NFSv4 for sending asynchronous RPC calls and then waiting for them to complete. This ensures that the UNLOCK and CANCEL RPC calls will complete even if the user interrupts the call, yet satisfies the above request for synchronous behaviour on process exit. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:39 -04:00
Trond Myklebust	5e7f37a76f	NLM/lockd: Add a reference counter to struct nlm_rqst When we replace the existing synchronous RPC calls with asynchronous calls, the reference count will be needed in order to allow us to examine the result of the RPC call. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:36 -04:00
Trond Myklebust	536ff0f809	NFSv4: Ensure we don't corrupt fl->fl_flags in nfs4_proc_unlck Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:33 -04:00
Trond Myklebust	4a9af59fee	NLM/lockd: Ensure we don't corrupt fl->fl_flags in nlmclnt_unlock() Also fix up nlmclnt_lock() so that it doesn't pass modified versions of fl->fl_flags to nlmclnt_cancel() and other helpers. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:30 -04:00
Trond Myklebust	c1d519312d	NFSv4: Only increment the sequence id if the server saw it It is quite possible that the OPEN, CLOSE, LOCK, LOCKU,... compounds fail before the actual stateful operation has been executed (for instance in the PUTFH call). There is no way to tell from the overall status result which operations were executed from the COMPOUND. The fix is to move incrementing of the sequence id into the XDR layer, so that we do it as we process the results from the stateful operation. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:15 -04:00
Trond Myklebust	35d05778e2	NFSv4: Remove bogus call to nfs4_drop_state_owner() in _nfs4_open_expired() There should be no need to invalidate a perfectly good state owner just because of a stale filehandle. Doing so can cause the state recovery code to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely on finding active state owners. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:12 -04:00
Trond Myklebust	dbae4c73f0	NFS: Ensure that rpc_run_task() errors are propagated back to the caller Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:08 -04:00
Trond Myklebust	c9d8f89d98	NFS: Ensure that the write code cleans up properly when rpc_run_task() fails Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:05 -04:00
Trond Myklebust	fdd1e74c89	NFS: Ensure that the read code cleans up properly when rpc_run_task() fails In the case of readpage() we need to ensure that the pages get unlocked, and that the error is flagged. In the case of O_DIRECT, we need to ensure that the pages are all released. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:53:01 -04:00
Trond Myklebust	73e3302f60	NFS: Fix nfs_wb_page() to always exit with an error or a clean page It is possible for nfs_wb_page() to sometimes exit with 0 return value, yet the page is left in a dirty state. For instance in the case where the server rebooted, and the COMMIT request failed, then all the previously "clean" pages which were cached by the server, but were not guaranteed to have been writted out to disk, have to be redirtied and resent to the server. The fix is to have nfs_wb_page_priority() check that the page is clean before it exits... This fixes a condition that triggers the BUG_ON(PagePrivate(page)) in nfs_create_request() when we're in the nfs_readpage() path. Also eliminate a redundant BUG_ON(!PageLocked(page)) while we're at it. It turns out that clear_page_dirty_for_io() has the exact same test. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-04-19 16:52:58 -04:00
Dave Hansen	ad775f5a8f	[PATCH] r/o bind mounts: debugging for missed calls There have been a few oopses caused by 'struct file's with NULL f_vfsmnts. There was also a set of potentially missed mnt_want_write()s from dentry_open() calls. This patch provides a very simple debugging framework to catch these kinds of bugs. It will WARN_ON() them, but should stop us from having any oopses or mnt_writer count imbalances. I'm quite convinced that this is a good thing because it found bugs in the stuff I was working on as soon as I wrote it. [hch: made it conditional on a debug option. But it's still a little bit too ugly] [hch: merged forced remount r/o fix from Dave and akpm's fix for the fix] Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:28 -04:00
Dave Hansen	2e4b7fcd92	[PATCH] r/o bind mounts: honor mount writer counts at remount Originally from: Herbert Poetzl <herbert@13thfloor.at> This is the core of the read-only bind mount patch set. Note that this does _not_ add a "ro" option directly to the bind mount operation. If you require such a mount, you must first do the bind, then follow it up with a 'mount -o remount,ro' operation: If you wish to have a r/o bind mount of /foo on bar: mount --bind /foo /bar mount -o remount,ro /bar Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:27 -04:00
Dave Hansen	3d733633a6	[PATCH] r/o bind mounts: track numbers of writers to mounts This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All want/drop operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two Upon a remount,ro request, all of the data from the percpu variables is collected (expensive, but very rare) and we determine if there are any outstanding writers to the mount. I've written a little benchmark to sit in a loop for a couple of seconds in several cpus in parallel doing open/write/close loops. http://sr71.net/~dave/linux/openbench.c The code in here is a a worst-possible case for this patch. It does opens on a _pair_ of files in two different mounts in parallel. This should cause my code to lose its "operate on the same mount" optimization completely. This worst-case scenario causes a 3% degredation in the benchmark. I could probably get rid of even this 3%, but it would be more complex than what I have here, and I think this is getting into acceptable territory. In practice, I expect writing more than 3 bytes to a file, as well as disk I/O to mask any effects that this has. (To get rid of that 3%, we could have an #defined number of mounts in the percpu variable. So, instead of a CPU getting operate only on percpu data when it accesses only one mount, it could stay on percpu data when it only accesses N or fewer mounts.) [AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:27 -04:00
Dave Hansen	2c463e9548	[PATCH] r/o bind mounts: check mnt instead of superblock directly If we depend on the inodes for writeability, we will not catch the r/o mounts when implemented. This patches uses __mnt_want_write(). It does not guarantee that the mount will stay writeable after the check. But, this is OK for one of the checks because it is just for a printk(). The other two are probably unnecessary and duplicate existing checks in the VFS. This won't make them better checks than before, but it will make them detect r/o mounts. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:27 -04:00
Dave Hansen	ec82687f29	[PATCH] r/o bind mounts: elevate count for xfs timestamp updates Elevate the write count during the xfs m/ctime updates. XFS has to do it's own timestamp updates due to an unfortunate VFS design limitation, so it will have to track writers by itself aswell. [hch: split out from the touch_atime patch as it's not related to it at all] Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:26 -04:00
Dave Hansen	2f676cbc0d	[PATCH] r/o bind mounts: make access() use new r/o helper It is OK to let access() go without using a mnt_want/drop_write() pair because it doesn't actually do writes to the filesystem, and it is inherently racy anyway. This is a rare case when it is OK to use __mnt_is_readonly() directly. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:26 -04:00
Dave Hansen	9ac9b8474c	[PATCH] r/o bind mounts: write counts for truncate() Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:25 -04:00
Dave Hansen	2af482a7ed	[PATCH] r/o bind mounts: elevate write count for chmod/chown callers chown/chmod,etc... don't call permission in the same way that the normal "open for write" calls do. They still write to the filesystem, so bump the write count during these operations. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:25 -04:00
Dave Hansen	4a3fd211cc	[PATCH] r/o bind mounts: elevate write count for open()s This is the first really tricky patch in the series. It elevates the writer count on a mount each time a non-special file is opened for write. We used to do this in may_open(), but Miklos pointed out that __dentry_open() is used as well to create filps. This will cover even those cases, while a call in may_open() would not have. There is also an elevated count around the vfs_create() call in open_namei(). See the comments for more details, but we need this to fix a 'create, remount, fail r/w open()' race. Some filesystems forego the use of normal vfs calls to create struct files. Make sure that these users elevate the mnt writer count because they will get __fput(), and we need to make sure they're balanced. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:25 -04:00
Dave Hansen	42a74f206b	[PATCH] r/o bind mounts: elevate write count for ioctls() Some ioctl()s can cause writes to the filesystem. Take these, and make them use mnt_want/drop_write() instead. [AV: updated] Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:24 -04:00
Dave Hansen	20ddee2c75	[PATCH] r/o bind mounts: write count for file_update_time() Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:24 -04:00
Dave Hansen	74f9fdfa1f	[PATCH] r/o bind mounts: elevate write count for do_utimes() Now includes fix for oops seen by akpm. "never let a libc developer write your kernel code" - hch "nor, apparently, a kernel developer" - akpm Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:24 -04:00
Dave Hansen	cdb70f3f74	[PATCH] r/o bind mounts: write counts for touch_atime() Remove handling of NULL mnt while we are at it - that can't happen these days. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:23 -04:00
Dave Hansen	a761a1c03a	[PATCH] r/o bind mounts: elevate write count for ncp_ioctl() Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:23 -04:00
Dave Hansen	18f335aff8	[PATCH] r/o bind mounts: elevate write count for xattr_permission() callers This basically audits the callers of xattr_permission(), which calls permission() and can perform writes to the filesystem. [AV: add missing parts - removexattr() and nfsd posix acls, plug for a leak spotted by Miklos] Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:29:15 -04:00
Dave Hansen	9079b1eb17	[PATCH] r/o bind mounts: get write access for vfs_rename() callers This also uses the little helper in the NFS code to make an if() a little bit less ugly. We introduced the helper at the beginning of the series. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:34 -04:00
Dave Hansen	75c3f29de7	[PATCH] r/o bind mounts: write counts for link/symlink [AV: add missing nfsd pieces] Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:34 -04:00
Dave Hansen	463c319726	[PATCH] r/o bind mounts: get callers of vfs_mknod/create/mkdir() This takes care of all of the direct callers of vfs_mknod(). Since a few of these cases also handle normal file creation as well, this also covers some calls to vfs_create(). So that we don't have to make three mnt_want/drop_write() calls inside of the switch statement, we move some of its logic outside of the switch and into a helper function suggested by Christoph. This also encapsulates a fix for mknod(S_IFREG) that Miklos found. [AV: merged mkdir handling, added missing nfsd pieces] Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:34 -04:00
Dave Hansen	0622753b80	[PATCH] r/o bind mounts: elevate write count for rmdir and unlink. Elevate the write count during the vfs_rmdir() and vfs_unlink(). [AV: merged rmdir and unlink parts, added missing pieces in nfsd] Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:33 -04:00
Dave Hansen	49e0d02cf0	[PATCH] r/o bind mounts: drop write during emergency remount The emergency remount code forcibly removes FMODE_WRITE from filps. The r/o bind mount code notices that this was done without a proper mnt_drop_write() and properly gives a warning. This patch does a mnt_drop_write() to keep everything balanced. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:33 -04:00
Dave Hansen	aceaf78da9	[PATCH] r/o bind mounts: create helper to drop file write access If someone decides to demote a file from r/w to just r/o, they can use this same code as __fput(). NFS does just that, and will use this in the next patch. AV: drop write access in __fput() only after we evict from file list. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Erez Zadok <ezk@cs.sunysb.edu> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J Bruce Fields" <bfields@fieldses.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:32 -04:00
Dave Hansen	8366025eb8	[PATCH] r/o bind mounts: stub functions This patch adds two function mnt_want_write() and mnt_drop_write(). These are used like a lock pair around and fs operations that might cause a write to the filesystem. Before these can become useful, we must first cover each place in the VFS where writes are performed with a want/drop pair. When that is complete, we can actually introduce code that will safely check the counts before allowing r/w<->r/o transitions to occur. Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:32 -04:00
Christoph Hellwig	a70e65df88	[PATCH] merge open_namei() and do_filp_open() open_namei() will, in the future, need to take mount write counts over its creation and truncation (via may_open()) operations. It needs to keep these write counts until any potential filp that is created gets __fput()'d. This gets complicated in the error handling and becomes very murky as to how far open_namei() actually got, and whether or not that mount write count was taken. That makes it a bad interface. All that the current do_filp_open() really does is allocate the nameidata on the stack, then call open_namei(). So, this merges those two functions and moves filp_open() over to namei.c so it can be close to its buddy: do_filp_open(). It also gets a kerneldoc comment in the process. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:32 -04:00
Dave Hansen	d57999e152	[PATCH] do namei_flags calculation inside open_namei() My end goal here is to make sure all users of may_open() return filps. This will ensure that we properly release mount write counts which were taken for the filp in may_open(). This patch moves the sys_open flags to namei flags calculation into fs/namei.c. We'll shortly be moving the nameidata_to_filp() calls into namei.c, and this gets the sys_open flags to a place where we can get at them when we need them. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-19 00:25:31 -04:00
Matthew Wilcox	6188e10d38	Convert asm/semaphore.h users to linux/semaphore.h Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-18 22:22:54 -04:00
Matthew Wilcox	cb688371e2	fs: Remove unnecessary inclusions of asm/semaphore.h None of these files use any of the functionality promised by asm/semaphore.h. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-18 22:16:44 -04:00
Linus Torvalds	334d094504	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.26 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.26: (1090 commits) [NET]: Fix and allocate less memory for ->priv'less netdevices [IPV6]: Fix dangling references on error in fib6_add(). [NETLABEL]: Fix NULL deref in netlbl_unlabel_staticlist_gen() if ifindex not found [PKT_SCHED]: Fix datalen check in tcf_simp_init(). [INET]: Uninline the __inet_inherit_port call. [INET]: Drop the inet_inherit_port() call. SCTP: Initialize partial_bytes_acked to 0, when all of the data is acked. [netdrvr] forcedeth: internal simplifications; changelog removal phylib: factor out get_phy_id from within get_phy_device PHY: add BCM5464 support to broadcom PHY driver cxgb3: Fix __must_check warning with dev_dbg. tc35815: Statistics cleanup natsemi: fix MMIO for PPC 44x platforms [TIPC]: Cleanup of TIPC reference table code [TIPC]: Optimized initialization of TIPC reference table [TIPC]: Remove inlining of reference table locking routines e1000: convert uint16_t style integers to u16 ixgb: convert uint16_t style integers to u16 sb1000.c: make const arrays static sb1000.c: stop inlining largish static functions ...	2008-04-18 18:02:35 -07:00
Steve French	076d8423a9	[CIFS] Fix UNC path prefix on QueryUnixPathInfo to have correct slash When a share was in DFS and the server was Unix/Linux, we were sending paths of the form \\server\share/dir/file rather than //server/share/dir/file There was some discussion between me and jra over whether we should use /server/share/dir/file as MS sometimes says - but the documentation for this claims it should be doubleslash for this type of UNC-like path format and that works, so leaving it as doubleslash but converting the \ to / in the the //server/share portion. This gets Samba to now correctly return STATUS_PATH_NOT_COVERED when it is supposed to (Windows already did since the direction of the slash was not an issue for them). Still need another minor change to fully enable DFS (need to finish some chages to SMBGetDFSRefer Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-04-18 23:26:26 +00:00
Linus Torvalds	e675349e2b	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (64 commits) ocfs2/net: Add debug interface to o2net ocfs2: Only build ocfs2/dlm with the o2cb stack module ocfs2/cluster: Get rid of arguments to the timeout routines ocfs2: Put tree in MAINTAINERS ocfs2: Use BUG_ON ocfs2: Convert ocfs2 over to unlocked_ioctl ocfs2: Improve rename locking fs/ocfs2/aops.c: test for IS_ERR rather than 0 ocfs2: Add inode stealing for ocfs2_reserve_new_inode ocfs2: Add ac_alloc_slot in ocfs2_alloc_context ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits ocfs2: Enable cross extent block merge. ocfs2: Add support for cross extent block ocfs2: Move /sys/o2cb to /sys/fs/o2cb sysfs: Allow removal of symlinks in the sysfs root ocfs2: Reconnect after idle time out. ocfs2/dlm: Cleanup lockres print ocfs2/dlm: Fix lockname in lockres print function ocfs2/dlm: Move dlm_print_one_mle() from dlmmaster.c to dlmdebug.c ocfs2/dlm: Dumps the purgelist into a debugfs file ...	2008-04-18 10:15:22 -07:00
Linus Torvalds	ef38ff9d37	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (49 commits) [GFS2] fix assertion in log_refund() [GFS2] fix GFP_KERNEL misuses [GFS2] test for IS_ERR rather than 0 [GFS2] Invalidate cache at correct point [GFS2] fs/gfs2/recovery.c: suppress warnings [GFS2] Faster gfs2_bitfit algorithm [GFS2] Streamline quota lock/check for no-quota case [GFS2] Remove drop of module ref where not needed [GFS2] gfs2_adjust_quota has broken unstuffing code [GFS2] possible null pointer dereference fixup [GFS2] Need to ensure that sector_t is 64bits for GFS2 [GFS2] re-support special inode [GFS2] remove gfs2_dev_iops [GFS2] fix file_system_type leak on gfs2meta mount [GFS2] Allow bmap to allocate extents [GFS2] Fix a page lock / glock deadlock [GFS2] proper extern for gfs2/locking/dlm/mount.c:gdlm_ops [GFS2] gfs2/ops_file.c should #include "ops_inode.h" [GFS2] be*_add_cpu conversion [GFS2] Fix bug where we called drop_bh incorrectly ...	2008-04-18 10:02:46 -07:00
Steve French	2302aca850	[CIFS] Reserve new proxy cap for WAFS New WAFS filer uses ioctls which are shown to be available on a share by querying this info level Acked-by: Sam Liddicott <sam@liddicott.com> Signed-off-by: Stevef French <sfrench@us.ibm.com>	2008-04-18 16:40:32 +00:00
Sunil Mushran	2309e9e040	ocfs2/net: Add debug interface to o2net This patch exposes o2net information via debugfs. The information includes the list of sockets (sock_containers) as well as the list of outstanding messages (send_tracking). Useful for o2dlm debugging. (This patch is derived from an earlier one written by Zach Brown that exposed the same information via /proc.) [Mark: checkpatch fixes] Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Reviewed-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:20 -07:00
Mark Fasheh	93b06edb51	ocfs2: Only build ocfs2/dlm with the o2cb stack module fs/ocfs2/dlm/ocfs2_dlm.ko and fs/ocfs2/dlm/ocfs2_dlmfs.ko get built if CONFIG_FS_OCFS2 is specified. This isn't quite how it should happen any more - the "o2cb" dlm modules should only be built if CONFIG_FS_OCFS2_O2CB is set, so update the dlm Makefile accordingly. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com>	2008-04-18 08:56:12 -07:00
Jeff Mahoney	409753bf6d	ocfs2/cluster: Get rid of arguments to the timeout routines We keep seeing bug reports related to NULL pointer derefs in o2net_set_nn_state(). When I originally wrote up the configurable timeout patch, I had tried to plan for multiple clusters. This was silly. The timeout routines all use o2nm_single_cluster so there's no point in passing an argument at all. This patch removes the arguments and kills those bugs dead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:12 -07:00
Julia Lawall	b1f3550fa1	ocfs2: Use BUG_ON if (...) BUG(); should be replaced with BUG_ON(...) when the test has no side-effects to allow a definition of BUG_ON that drops the code completely. The semantic patch that makes this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @ disable unlikely @ expression E,f; @@ ( if (<... f(...) ...>) { BUG(); } \| - if (unlikely(E)) { BUG(); } + BUG_ON(E); ) @@ expression E,f; @@ ( if (<... f(...) ...>) { BUG(); } \| - if (E) { BUG(); } + BUG_ON(E); ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:11 -07:00
Andi Kleen	c9ec14884d	ocfs2: Convert ocfs2 over to unlocked_ioctl As far as I can see there is nothing in ocfs2_ioctl that requires the BKL, so use unlocked_ioctl Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:11 -07:00
Jan Kara	5dabd69515	ocfs2: Improve rename locking ocfs2_rename() was being too aggressive with the rename lock - we only need it for certain forms of directory rename. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:11 -07:00
Julia Lawall	58dadcdbc2	fs/ocfs2/aops.c: test for IS_ERR rather than 0 The function ocfs2_start_trans always returns either a valid pointer or a value made with ERR_PTR, so its result should be tested with IS_ERR, not with a test for 0. Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:11 -07:00
Tao Ma	4d0ddb2ce2	ocfs2: Add inode stealing for ocfs2_reserve_new_inode Inode allocation is modified to look in other nodes allocators during extreme out of space situations. We retry our own slot when space is freed back to the global bitmap, or whenever we've allocated more than 1024 inodes from another slot. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Tao Ma	a4a4891164	ocfs2: Add ac_alloc_slot in ocfs2_alloc_context In inode stealing, we no longer restrict the allocation to happen in the local node. So it is neccessary for us to add a new member in ocfs2_alloc_context to indicate which slot we are using for allocation. We also modify the process of local alloc so that this member can be used there also. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Tao Ma	ffda89a3bf	ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits In some cases(Inode stealing from other nodes), we may not want ocfs2_reserve_suballoc_bits to allocate new groups from the global_bitmap since it may already be full. So add a new parameter for this. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Tao Ma	ad5a4d7093	ocfs2: Enable cross extent block merge. In ocfs2_figure_merge_contig_type, we judge whether there exists a cross extent block merge and enable it by setting CONTIG_LEFT and CONTIG_RIGHT accordingly. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Tao Ma	677b975282	ocfs2: Add support for cross extent block In ocfs2_merge_rec_left, when we find the merge extent is "CONTIG_RIGHT" with the first extent record of the next extent block, we will merge it to the next extent block and change all the related extent blocks accordingly. In ocfs2_merge_rec_right, when we find the merge extent is "CONTIG_LEFT" with the last extent record of the previous extent block, we will merge it to the prevoius extent block and change all the related extent blocks accordingly. As for CONTIG_LEFTRIGHT, we will handle CONTIG_RIGHT first so that when the index is zero, the merge process will be more efficient and easier. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Mark Fasheh	52f7c21b61	ocfs2: Move /sys/o2cb to /sys/fs/o2cb /sys/fs is where we really want file system specific sysfs objects. Ocfs2-tools has been updated to look in /sys/fs/o2cb. We can maintain backwards compatibility with old ocfs2-tools by using a sysfs symlink. After some time (2 years), the symlink can be safely removed. This patch also adds documentation to make it easier for people to figure out what /sys/fs/o2cb is used for. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Mark Fasheh	a839c5afcd	sysfs: Allow removal of symlinks in the sysfs root Allow callers of sysfs_remove_link() to pass a NULL kobj, in which case sysfs_root will be used as the parent directory. This allows us to tear down top level symlinks created via sysfs_create_link(), which already has similar handling of a NULL parent object. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-18 08:56:10 -07:00
Tao Ma	5cc3bf2786	ocfs2: Reconnect after idle time out. Currently, o2net connects to a node on hb_up and disconnects on hb_down and net timeout. It disconnects on net timeout is ok, but it should attempt to reconnect back. This is because sometimes nodes get overloaded enough that the network connection breaks but the disk hb does not. And if we get into that situation, we either fence (unnecessarily) or wait for its disk hb to die (and sometimes hang in the process). So in this updated scheme, when the network disconnects, we keep attempting to reconnect till we succeed or we get a disk hb down event. If the other node is really dead, then we will eventually get a node down event. If not, we should be able to connect again and continue. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Sunil Mushran	8f50eb9789	ocfs2/dlm: Cleanup lockres print A previous patch added KERN_NOTICE to printks printing the lockres that cluttered the output. This patch removes the log level. For people concerned with syslog clutter, please note we now use this facility to print lockres only during an error. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	c834cdb157	ocfs2/dlm: Fix lockname in lockres print function __dlm_print_one_lock_resource was printing lockname incorrectly. Also, we now use printk directly instead of mlog as the latter prints the line context which is not useful for this print. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	e5a0334cbd	ocfs2/dlm: Move dlm_print_one_mle() from dlmmaster.c to dlmdebug.c This patch helps in consolidating debugging related functions in dlmdebug.c. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	7209300a9b	ocfs2/dlm: Dumps the purgelist into a debugfs file This patch dumps all the lockres' on the purgelist it can fit in one page into a debugfs file. Useful for debugging. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	d0129aceae	ocfs2/dlm: Dumps the mles into a debugfs file This patch dumps all mles it can fit in one page into a debugfs file. Useful for debugging. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	751155a953	ocfs2/dlm: Move struct dlm_master_list_entry to dlmcommon.h This patch moves some mle related definitions from dlmmaster.c to dlmcommon.h. Future patches need these definitions to dump mle debugging information. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.beckeroracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	4e3d24ed1a	ocfs2/dlm: Dumps the lockres' into a debugfs file This patch dumps all the lockres' alongwith all the locks into a debugfs file. Useful for debugging. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:09 -07:00
Sunil Mushran	007dce53a2	ocfs2/dlm: Dump the dlm state in a debugfs file This patch dumps the dlm state (dlm_ctxt) into a debugfs file. Useful for debugging. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:08 -07:00
Sunil Mushran	6325b4a22b	ocfs2/dlm: Create debugfs dirs This patch creates the debugfs directories that will hold the files to be used to dump the dlm state. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:08 -07:00
Sunil Mushran	29576f8bb5	ocfs2/dlm: Link all lockres' to a tracking list This patch links all the lockres' to a tracking list in dlm_ctxt. We will use this in an upcoming patch that will walk the entire list and to dump the lockres states to a debugfs file. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:08 -07:00
Sunil Mushran	724bdca9b8	ocfs2/dlm: Create slabcaches for lock and lockres This patch makes the o2dlm allocate memory for lockres, lockname and lock structures from slabcaches rather than kmalloc. This allows us to not only make these allocs more efficient but also allows us to track the memory being consumed by these structures. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:08 -07:00
Sunil Mushran	12eb0035d6	ocfs2/dlm: Rename slabcache dlm_mle_cache to o2dlm_mle This patch renames dlm_mle_slabcache to prevent namespace clashes with fs/dlm. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:08 -07:00
Joel Becker	9341d22942	ocfs2: Allow selection of cluster plug-ins. ocfs2 now supports plug-ins for the classic O2CB stack as well as userspace cluster stacks in conjunction with fs/dlm. This allows zero, one, or both of the plug-ins to be selected in Kconfig. For local mounts (non-clustered), neither plug-in is needed. Both plugins can be loaded at one time, the runtime will select the one needed for the cluster systme in use. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:07 -07:00
Joel Becker	b92eccdd28	ocfs2: Add kbuild for ocfs2_stack_user.ko Add ocfs2_stack_user.ko to the Makefile so that it builds. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:07 -07:00
Joel Becker	8f318311fa	ocfs2: Change mlog_bug_on to BUG_ON in ocfs2_lockid.h The masklog code is in the o2cb stack, but ocfs2_lockid.h now needs to be included by the user stack. The BUG() in ocfs2_lock_type_string() does not need masklog support, so change it to a regular BUG_ON(). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:07 -07:00
David Teigland	cf4d8d75d8	ocfs2: add fsdlm to stackglue Add code to use fs/dlm. [ Modified to be part of the stack_user module -- Joel ] Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:07 -07:00
Joel Becker	d4b95eef4d	ocfs2: Add the 'set version' message to the ocfs2_control device. The "SETV" message sets the filesystem locking protocol version as negotiated by the client. The client negotiates based on the maximum version advertised in /sys/fs/ocfs2/max_locking_protocol. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:07 -07:00
Joel Becker	3cfd4ab6b6	ocfs2: Add the local node id to the handshake. This is the second part of the ocfs2_control handshake. After negotiating the ocfs2_control protocol, the daemon tells the filesystem what the local node id is via the SETN message. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:06 -07:00
Joel Becker	de870ef022	ocfs2: Introduce the DOWN message to ocfs2_control When the control daemon sees a node go down, it sends a DOWN message through the ocfs2_control device. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:06 -07:00
Joel Becker	462c7e6a25	ocfs2: Start the ocfs2_control handshake. When a control daemon opens the ocfs2_control device, it must perform a handshake to tell the filesystem it is something capable of monitoring cluster status. Only after the handshake is complete will the filesystem allow mounts. This is the first part of the handshake. The daemon reads all supported ocfs2_control protocols, then writes in the protocol it will use. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:06 -07:00
Joel Becker	6427a72755	ocfs2: Add the ocfs2_control misc device. The ocfs2_control misc device is how a userspace control daemon (controld) talks to the filesystem. Introduce the bare-bones filesystem ops. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:06 -07:00
Joel Becker	8adf0536c9	ocfs2: Add the user stack module. Add a skeleton for the stack_user module. It's just the barebones module code. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:06 -07:00
Joel Becker	9c6c877c04	ocfs2: Add the 'cluster_stack' sysfs file. Userspace can now query and specify the cluster stack in use via the /sys/fs/ocfs2/cluster_stack file. By default, it is 'o2cb', which is the classic stack. Thus, old tools that do not know how to modify this file will work just fine. The stack cannot be modified if there is a live filesystem. ocfs2_cluster_connect() now takes the expected cluster stack as an argument. This way, the filesystem and the stack glue ensure they are speaking to the same backend. If the stack is 'o2cb', the o2cb stack plugin is used. For any other value, the fsdlm stack plugin is selected. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:05 -07:00
Joel Becker	b61817e116	ocfs2: Add the USERSPACE_STACK incompat bit. The filesystem gains the USERSPACE_STACK incomat bit and the s_cluster_info field on the superblock. When a userspace stack is in use, the name of the stack is stored on-disk for mount-time verification. The "cluster_stack" option is added to mount(2) processing. The mount process needs to pass the matching stack name. If the passed name and the on-disk name do not match, the mount is failed. When using the classic o2cb stack, the incompat bit is not set and no mount option is used other than the usual heartbeat=local. Thus, the filesystem is compatible with older tools. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:05 -07:00
Joel Becker	74ae4e104d	ocfs2: Create stack glue sysfs files. Introduce a set of sysfs files that describe the current stack glue state. The files live under /sys/fs/ocfs2. The locking_protocol file displays the version of ocfs2's locking code. The loaded_cluster_plugins file displays all of the currently loaded stack plugins. When filesystems are mounted, the active_cluster_plugin file will display the plugin in use. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:05 -07:00
Joel Becker	286eaa95c5	ocfs2: Break out stackglue into modules. We define the ocfs2_stack_plugin structure to represent a stack driver. The o2cb stack code is split into stack_o2cb.c. This becomes the ocfs2_stack_o2cb.ko module. The stackglue generic functions are similarly split into the ocfs2_stackglue.ko module. This module now provides an interface to register drivers. The ocfs2_stack_o2cb driver registers itself. As part of this interface, ocfs2_stackglue can load drivers on demand. This is accomplished in ocfs2_cluster_connect(). ocfs2_cluster_disconnect() is now notified when a _hangup() is pending. If a hangup is pending, it will not release the driver module and will let _hangup() do that. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-04-18 08:56:05 -07:00
Joel Becker	e3dad42bf9	ocfs2: Create ocfs2_stack_operations and split out the o2cb stack. Define the ocfs2_stack_operations structure. Build o2cb_stack_ops from all of the o2cb-specific stack functions. Change the generic stack glue functions to call the stack_ops instead of the o2cb functions directly. The o2cb functions are moved to stack_o2cb.c. The headers are cleaned up to where only needed headers are included. In this code, stackglue.c and stack_o2cb.c refer to some shared extern variables. When they become modules, that will change. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:05 -07:00
Joel Becker	553aa7e408	ocfs2: Split o2cb code from generic stack functions. Split off the o2cb-specific funtionality from the generic stack glue calls. This is a precurser to wrapping the o2cb functionality in an operations vector. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:05 -07:00
Joel Becker	63e0c48ae6	ocfs2: Clean up stackglue initialization The stack glue initialization function needs a better name so that it can be used cleanly when stackglue becomes a module. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:05 -07:00
Joel Becker	cf0acdcd64	ocfs2: Abstract out a debugging function for underlying dlms. dlmglue.c was still referencing a raw o2dlm lksb in one instance. Let's create a generic ocfs2_dlm_dump_lksb() function. This allows underlying DLMs to print whatever they want about their lock. We then move the o2dlm dump into stackglue.c where it belongs. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
David Teigland	1693a5c011	ocfs2: handle async EAGAIN from NOQUEUE request When using fsdlm, -EAGAIN is returned in the async callback for NOQUEUE requests. Fix up dlmglue to expect this. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Joel Becker	de551246e7	ocfs2: Remove CANCELGRANT from the view of dlmglue. o2dlm has the non-standard behavior of providing a cancel callback (unlock_ast) even when the cancel has failed (the locking operation succeeded without canceling). This is called CANCELGRANT after the status code sent to the callback. fs/dlm does not provide this callback, so dlmglue must be changed to live without it. o2dlm_unlock_ast_wrapper() in stackglue now ignores CANCELGRANT calls. Because dlmglue no longer sees CANCELGRANT, ocfs2_unlock_ast() no longer needs to check for it. ocfs2_locking_ast() must catch that a cancel was tried and clear the cancel state. Making these changes opens up a locking race. dlmglue uses the the OCFS2_LOCK_BUSY flag to ensure only one thread is calling the dlm at any one time. But dlmglue must unlock the lockres before calling into the dlm. In the small window of time between unlocking the lockres and calling the dlm, the downconvert thread can try to cancel the lock. The downconvert thread is checking the OCFS2_LOCK_BUSY flag - it doesn't know that ocfs2_dlm_lock() has not yet been called. Because ocfs2_dlm_lock() has not yet been called, the cancel operation will just be a no-op. There's nothing to cancel. With CANCELGRANT, dlmglue uses the CANCELGRANT callback to clear up the cancel state. When it comes around again, it will retry the cancel. Eventually, the first thread will have called into ocfs2_dlm_lock(), and either the lock or the cancel will succeed. The downconvert thread can then do its downconvert. Without CANCELGRANT, there is nothing to clean up the cancellation state. The downconvert thread does not know to retry its operations. More importantly, the original lock may be blocking on the other node that is trying to cancel us. With neither able to make progress, the ast is never called and the cancellation state is never cleaned up that way. dlmglue is deadlocked. The OCFS2_LOCK_PENDING flag is introduced to remedy this window. It is set at the same time OCFS2_LOCK_BUSY is. Thus, the downconvert thread can check whether the lock is cancelable. If not, it just loops around to try again. Once ocfs2_dlm_lock() is called, the thread then clears OCFS2_LOCK_PENDING and wakes the downconvert thread. Now, if the downconvert thread finds the lock BUSY, it can safely try to cancel it. Whether the cancel works or not, the state will be properly set and the lock processing can continue. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Mark Fasheh	0abd6d1803	ocfs2: Fill node number during cluster stack init It doesn't make sense to query for a node number before connecting to the cluster stack. This should be safe to do because node_num is only just printed, and we're actually only moving the setting of node num a small amount further in the mount process. [ Disconnect when node query fails -- Joel ] Reviewed-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Joel Becker	6953b4c008	ocfs2: Move o2hb functionality into the stack glue. The last bit of classic stack used directly in ocfs2 code is o2hb. Specifically, the check for heartbeat during mount and the call to ocfs2_hb_ctl during unmount. We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call to ocfs2_hb_ctl. Other stacks will just leave hangup() empty. The check for heartbeat is moved into ocfs2_cluster_connect(). It will be matched by a similar check for other stacks. With this change, only stackglue.c includes cluster/ headers. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Joel Becker	19fdb624dc	ocfs2: Abstract out node number queries. ocfs2 asks the cluster stack for the local node's node number for two reasons; to fill the slot map and to print it. While the slot map isn't necessary for userspace cluster stacks, the printing is very nice for debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get this value. It is anticipated that the slot map will not be used under a userspace cluster stack, so validity checks of the node num only need to exist in the slot map code. Otherwise, it just gets used and printed as an opaque value. [ Fixed up some "int" versus "unsigned int" issues and made osb->node_num truly opaque. --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Joel Becker	4670c46ded	ocfs2: Introduce the new ocfs2_cluster_connect/disconnect() API. This step introduces a cluster stack agnostic API for initializing and exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to connect to the stack. It is all handled in stackglue.c. heartbeat.c no longer needs to know how it gets called. ocfs2_do_node_down() is now a clean recovery trigger. The big gotcha is the ordering of initializations and de-initializations done underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all o2dlm initialization in one block. Thus, the o2dlm functionality of ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(), however, did a few things between de-registration of the eviction callback and actually shutting down the domain. Now de-registration and shutdown of the domain are wrapped within the single ocfs2_cluster_disconnect() call. I've checked the code paths to make sure we can safely tear down things in ocfs2_dlm_shutdown() before calling ocfs2_cluster_disconnect(). The filesystem has already set itself to ignore the callback. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Joel Becker	8f2c9c1b16	ocfs2: Create the lock status block union. Wrap the lock status block (lksb) in a union. Later we will add a union element for the fs/dlm lksb. Create accessors for the status and lvb fields. Other than a debugging function, dlmglue.c does not directly reference the o2dlm locking path anymore. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:04 -07:00
Joel Becker	7431cd7e8d	ocfs2: Use -errno instead of dlm_status for ocfs2_dlm_lock/unlock() API. Change the ocfs2_dlm_lock/unlock() functions to return -errno values. This is the first step towards elminiating dlm_status in fs/ocfs2/dlmglue.c. The change also passes -errno values to ->unlock_ast(). [ Fix a return code in dlmglue.c and change the error translation table into an array of ints. --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:03 -07:00
Joel Becker	bd3e76105d	ocfs2: Use global DLM_ constants in generic code. The ocfs2 generic code should use the values in <linux/dlmconstants.h>. stackglue.c will convert them to o2dlm values. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:03 -07:00
Joel Becker	24ef1815e5	ocfs2: Separate out dlm lock functions. This is the first in a series of patches to isolate ocfs2 from the underlying cluster stack. Here we wrap the dlm locking functions with ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status callbacks, we can eliminate the callbacks from the filesystem visible functions. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:03 -07:00
Joel Becker	386a2ef857	ocfs2: New slot map format The old slot map had a few limitations: - It was limited to one block, so the maximum slot count was 255. - Each slot was signed 16bits, limiting node numbers to INT16_MAX. - An empty slot was marked by the magic 0xFFFF (-1). The new slot map format provides 32bit node numbers (UINT32_MAX), a separate space to mark a slot in use, and extra room to grow. The slot map is now bounded by i_size, not a block. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:03 -07:00
Joel Becker	fb86b1f071	ocfs2: Define the contents of the slot_map file. The slot map file is merely an array of __le16. Wrap it in a structure for cleaner reference. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:03 -07:00
Joel Becker	fc881fa0d5	ocfs2: De-magic the in-memory slot map. The in-memory slot map uses the same magic as the on-disk one. There is a special value to mark a slot as invalid. It relies on the size of certain types and so on. Write a new in-memory map that keeps validity as a separate field. Outside of the I/O functions, OCFS2_INVALID_SLOT now means what it is supposed to. It also is no longer tied to the type size. This also means that only the I/O functions refer to 16bit quantities. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:03 -07:00
Joel Becker	1c8d9a6a33	ocfs2: slot_map I/O based on max_slots. The slot map code assumed a slot_map file has one block allocated. This changes the code to I/O as many blocks as will cover max_slots. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:02 -07:00
Joel Becker	553abd046a	ocfs2: Change the recovery map to an array of node numbers. The old recovery map was a bitmap of node numbers. This was sufficient for the maximum node number of 254. Going forward, we want node numbers to be UINT32. Thus, we need a new recovery map. Note that we can't keep track of slots here. We must write down the node number to recovery before we get the locks needed to convert a node number into a slot number. The recovery map is now an array of unsigned ints, max_slots in size. It moves to journal.c with the rest of recovery. Because it needs to be initialized, we move all of recovery initialization into a new function, ocfs2_recovery_init(). This actually cleans up ocfs2_initialize_super() a little as well. Following on, recovery cleaup becomes part of ocfs2_recovery_exit(). A number of node map functions are rendered obsolete and are removed. Finally, waiting on recovery is wrapped in a function rather than naked checks on the recovery_event. This is a cleanup from Mark. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:02 -07:00
Joel Becker	d85b20e4b3	ocfs2: Make ocfs2_slot_info private. Just use osb_lock around the ocfs2_slot_info data. This allows us to take the ocfs2_slot_info structure private in slot_info.c. All access is now via accessors. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:02 -07:00
Mark Fasheh	8e8a4603b5	ocfs2: Move slot map access into slot_map.c journal.c and dlmglue.c would refresh the slot map by hand. Instead, have the update and clear functions do the work inside slot_map.c. The eventual result is to make ocfs2_slot_info defined privately in slot_map.c Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:02 -07:00
Linus Torvalds	253ba4e79e	Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6 * 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (87 commits) [XFS] Fix merge failure [XFS] The forward declarations for the xfs_ioctl() helpers and the [XFS] Update XFS documentation for noikeep/ikeep. [XFS] Update XFS Documentation for ikeep and ihashsize [XFS] Remove unused HAVE_SPLICE macro. [XFS] Remove CONFIG_XFS_SECURITY. [XFS] xfs_bmap_compute_maxlevels should be based on di_forkoff [XFS] Always use di_forkoff when checking for attr space. [XFS] Ensure the inode is joined in xfs_itruncate_finish [XFS] Remove periodic logging of in-core superblock counters. [XFS] fix logic error in xfs_alloc_ag_vextent_near() [XFS] Don't error out on good I/Os. [XFS] Catch log unmount failures. [XFS] Sanitise xfs_log_force error checking. [XFS] Check for errors when changing buffer pointers. [XFS] Don't allow silent errors in xfs_inactive(). [XFS] Catch errors from xfs_imap(). [XFS] xfs_bulkstat_one_dinode() never returns an error. [XFS] xfs_iflush_fork() never returns an error. [XFS] Catch unwritten extent conversion errors. ...	2008-04-18 08:39:39 -07:00
Linus Torvalds	4adeaaf51e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: replace __inline with inline jfs: le*_add_cpu conversion	2008-04-18 08:37:19 -07:00
Roel Kluin	62be1f7167	[GFS2] fix assertion in log_refund() since unsigned, unused >= 0 is always true. Signed-off-by: Roel Kluin <12o3l@tiscali.nl> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2008-04-18 08:36:09 +01:00
David S. Miller	1e42198609	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6	2008-04-17 23:56:30 -07:00
Lachlan McIlroy	65e67f5165	[XFS] Fix merge failure	2008-04-18 12:59:45 +10:00
Lachlan McIlroy	3b2816be27	[XFS] The forward declarations for the xfs_ioctl() helpers and the associated comment about gcc behavior really aren't needed; all of these functions are marked STATIC which includes noinline, and the stack usage won't be a problem. This effectively just removes the forward declarations and moves xfs_ioctl() back to the end of the file. SGI-PV: 971186 SGI-Modid: xfs-linux-melb:xfs-kern:30534a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:43:35 +10:00
Donald Douwsma	e687330b5e	[XFS] Remove unused HAVE_SPLICE macro. HAVE_SPLICE was part of the infrastructure for building 2.4 and 2.6 kernels out of the same tree. Now we don't build 2.4 kernels this SGI-PV: 971046 SGI-Modid: xfs-linux-melb:xfs-kern:30878a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:04:29 +10:00
Eric Sandeen	f7d3c34788	[XFS] Remove CONFIG_XFS_SECURITY. There is no point to the CONFIG_XFS_SECURITY option; it disables the ability to set security attributes at runtime, but it does not actually slim down or remove any code for runtime. Just remove it and always allow security attributes to be set. SGI-PV: 980310 SGI-Modid: xfs-linux-melb:xfs-kern:30877a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:04:19 +10:00
Tim Shimmin	6d1337b29b	[XFS] xfs_bmap_compute_maxlevels should be based on di_forkoff Fix up xfs_bmap_compute_maxlevels() to account for the case when we go from using attr2 to using attr1. In that case attr1 will no longer necessarily be at m_attr_offset>>3, but could be at a different value for di_forkoff. Therefore, we return the worst case scenario using MINDBTPTRS and MINABTPTRS, as this function is used for determining the maximum log space. SGI-PV: 979606 SGI-Modid: xfs-linux-melb:xfs-kern:30862a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:04:08 +10:00
Eric Sandeen	cb49dbb130	[XFS] Always use di_forkoff when checking for attr space. In the case where we mount a filesystem which was previously using the attr2 format as attr1, returning the default mp->m_attroffset instead of the per-inode di_forkoff for inline attribute fit calculations, may result in corruption, if for example, the data fork is already taking more space than the default fork offset and we try to add an extended attribute. Fix tested by xfstests/186. SGI-PV: 979606 SGI-Modid: xfs-linux-melb:xfs-kern:30861a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:03:40 +10:00
David Chinner	f6485057c5	[XFS] Ensure the inode is joined in xfs_itruncate_finish On success, we still need to join the inode to the current transaction in xfs_itruncate_finish(). Fixes regression from error handling changes. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30845a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:03:26 +10:00
David Chinner	7e20694d91	[XFS] Remove periodic logging of in-core superblock counters. xfssyncd triggers the logging of superblock counters every 30s if the filesystem is made with lazy-count=1. This will prevent disks from idling and spinning down as there will be a log write every 30s. With the way counter recovery works for lazy-count=1, this code is unnecessary and provides no real benefit, so just remove it. SGI-PV: 980145 SGI-Modid: xfs-linux-melb:xfs-kern:30840a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Barry Naujok <bnaujok@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:03:12 +10:00
David Chinner	e6430037e9	[XFS] fix logic error in xfs_alloc_ag_vextent_near() Fix a logic error in xfs_alloc_ag_vextent_near(). This is a regression introduced by the error handling changes. SGI-PV: 890084 SGI-Modid: xfs-linux-melb:xfs-kern:30838a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Barry Naujok <bnaujok@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:03:02 +10:00
David Chinner	d4055947bd	[XFS] Don't error out on good I/Os. xfsbdstrat() made all I/Os error out, good or bad. Fix it. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30836a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:02:41 +10:00
David Chinner	1bb7d6b5a8	[XFS] Catch log unmount failures. Unmounting the log can fail. unlikely, but it can. Catch all the error conditions an make sure it's propagated upwards. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30833a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:02:30 +10:00
David Chinner	b911ca0472	[XFS] Sanitise xfs_log_force error checking. xfs_log_force() is declared to return an error, but we almost never check it. We don't need to check it in most cases; if there's a log I/O error then we'll be shutting down the filesystem anyway and that means we'll catch the error somewhere else. However, on certain calls we should be returning an error - sync transactions, fsync, sync writes, etc. so this isn't a pure black and white distinction. Hence make xfs_log_force() a void function that issues a warning to the syslog on error, and call _xfs_log_force() in all the places where we actually care about the error status returned. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30832a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:02:20 +10:00
David Chinner	234f56aca2	[XFS] Check for errors when changing buffer pointers. xfs_buf_associate_memory() can fail, but the return is never checked. Propagate the error through XFS_BUF_SET_PTR() so that failures are detected. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30831a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:02:10 +10:00
David Chinner	78e9da77f1	[XFS] Don't allow silent errors in xfs_inactive(). xfs_inactive() fails to report errors when committing the inactive transaction. Hence we can get silent failures either finishing off the truncation or committing the transaction. Even if we get errors, we need to continue, so simply warn loudly to the system if we get errors here. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30830a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:01:58 +10:00
David Chinner	64bfe1bfae	[XFS] Catch errors from xfs_imap(). Catch errors from xfs_imap() in log recovery when we might be trying to map an invalid inode number due to a corrupted log. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30829a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:01:39 +10:00
David Chinner	7b07339048	[XFS] xfs_bulkstat_one_dinode() never returns an error. Mark it void. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30828a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:01:27 +10:00
David Chinner	e4ac967b11	[XFS] xfs_iflush_fork() never returns an error. xfs_iflush_fork() never returns an error. Mark it void and clean up the code calling it that checks for errors. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30827a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:01:11 +10:00
David Chinner	cc88466f3f	[XFS] Catch unwritten extent conversion errors. On unwritten I/O completion, we fail to propagate an error when converting the extent to a written extent. This means that the I/O silently fails. propagate the error onto the ioend so that the inode is marked with an error appropriately. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30826a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:00:58 +10:00
David Chinner	958d4ec606	[XFS] xfs_bdwrite() does not return errors. xfs_bdwrite() cannot return an error; it only queues buffers to the delayed write list and as such never encounters anything that can fail. Mark it void. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30825a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:00:46 +10:00
David Chinner	db7a19f2c8	[XFS] Ensure xfs_bawrite() errors are checked. xfs_bawrite() can return immediate error status on async writes. Unlike xfsbdstrat() we don't ever check the error on the buffer after the call, so we currently do not catch errors at all here. Ensure we catch and propagate or warn to the syslog about up-front async write errors. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30824a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:00:35 +10:00
David Chinner	d64e31a2f5	[XFS] Ensure errors from xfs_bdstrat() are correctly checked. xfsbdstrat() is declared to return an error. That is never checked because the error is propagated by the xfs_buf_t that is passed through the function. Mark xfsbdstrat() as returning void and comment the prototype on the methods needed for error checking. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30823a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:00:24 +10:00
Barry Naujok	556b8b166c	[XFS] remove bhv_vname_t and xfs_rename code SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30804a Signed-off-by: Barry Naujok <bnaujok@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 12:00:12 +10:00
David Chinner	7c9ef85c56	[XFS] Catch errors returned from xfs_bmap_last_offset(). xfs_bmap_last_offset() can fail and return an error. xfs_iomap_write_allocate() fails to detect and propagate the error. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30802a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:59:45 +10:00
David Chinner	fc6149d8d9	[XFS] Check for xfs_free_extent() failing. xfs_free_extent() can fail, but log recovery never bothers to check if it successfully free the extent it was supposed to. This could lead to silent corruption during log recovery. Abort log recovery if we fail to free an extent. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30801a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:59:23 +10:00
David Chinner	d87dd6360d	[XFS] Warn if errors come from block_truncate_page(). block_truncate_page() can return errors that we currently ignore and silently discard. We should not ever get errors reported here - an error indicates a bug somewhere else. Hence catch the error and issue a stack dump to the syslog because we cannot propagate the error any further up the call chain. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30800a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:59:12 +10:00
David Chinner	c2b1cba683	[XFS] xfs_bmap_adjacent() never returns an error. Mark it void. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30798a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:58:46 +10:00
David Chinner	12375c8237	[XFS] Make xfs_alloc_compute_aligned() void. xfs_alloc_compute_aligned() returns a value based on a comparison of the computed extent length and the minimum length allowed. This is only used by some callers - the other four return parameters are used more often. Hence move the comparison to the code that actually needs to do it and make xfs_alloc_compute_aligned() a void function. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30797a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:58:36 +10:00
David Chinner	f4586e4061	[XFS] Clean up xfs_alloc_search_busy() return values. xfs_alloc_search_busy() returns an index into the busy array if the extent was found in the array. This is never checked, and the xfs_alloc_search_busy() does a log force to prevent reuse of the extent before the free transaction hits the disk. Hence the return value is useless. Declare the function void and remove the slot number from the tracing as well. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30796a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:58:27 +10:00
David Chinner	e5720eec05	[XFS] Propagate errors from xfs_trans_commit(). xfs_trans_commit() can return errors when there are problems in the transaction subsystem. They are indicative that the entire transaction may be incomplete, and hence the error should be propagated as there is a good possibility that there is something fatally wrong in the filesystem. Catch and propagate or warn about commit errors in the places where they are currently ignored. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30795a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:58:17 +10:00
David Chinner	3c1e2bbe5b	[XFS] Propagate xfs_trans_reserve() errors. xfs_trans_reserve() reports errors that should not be ignored. For example, a shutdown filesystem will report errors through xfs_trans_reserve() to prevent further changes from being attempted on a damaged filesystem. Catch and propagate all error conditions from xfs_trans_reserve(). SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30794a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:58:08 +10:00
David Chinner	5ca1f261a0	[XFS] Catch errors from xfs_acl_vremove(). Removing an ACL can return an error. Propagate it. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30793a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:57:57 +10:00
David Chinner	0c92829967	[XFS] Catch errors from xfs_acl_setmode(). Propagate the error status from xfs_acl_setmode() so that callers know if the ACl was set correctly or not. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30792a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:57:46 +10:00
David Chinner	88ab020853	[XFS] Propagate quota file truncation errors. Truncating the quota files can silently fail. Ensure that truncation errors are propagated to the callers. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30791a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:57:36 +10:00
David Chinner	cb6edc26c3	[XFS] Catch errors when turning off quotas. When turning off quota, we need to write various transactions to the log to ensure that they are cleanly removed in the case of a crash. We need to check that the transactions hit the disk correctly. If we fail to write the final quota off transaction, we are corrupt in memory and so the only option is to shut the filesystem down at this point. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30790a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:57:26 +10:00
David Chinner	31d5577b35	[XFS] Catch errors resetting quota flags. Warn to the syslog if we fail to reset the quota flags in the superblock when a quota check fails. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30789a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:57:16 +10:00
David Chinner	53aa7915d6	[XFS] Clean up quotamount error handling. xfs_qm_mount_quotas() returns an error status that is ignored. If we fail to mount quotas, we continue with quota's turned off, which is all handled inside xfs_qm_mount_quotas(). Mark it as void to indicate that errors need not be returned to the callers. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30788a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:57:05 +10:00
David Chinner	3c56836f92	[XFS] Check for dquot flush errors xfs_qm_dqflush() can fail, but the return is not checked anywhere. Hence we never know if we've failed to flush a dquot to disk. Propagate the error and warn to the syslog if a flush ever fails. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30787a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:56:55 +10:00
David Chinner	4b8879df8c	[XFS] Propagate xfs_qm_dqflush_all() errors. xfs_qm_dqflush_all() can return flush errors. Ensure they are propagated into the quotacheck code to determine if the quotacheck succeeded or not. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30786a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:54:56 +10:00
David Chinner	5b1397385b	[XFS] xfs_qm_reset_dqcounts() does not return errors. Declare it void. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30785a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:54:06 +10:00
David Chinner	714082bc12	[XFS] Report errors from xfs_reserve_blocks(). xfs_reserve_blocks() can fail in interesting ways. In neither case is it a fatal error, but the result can lead to sub-optimal behaviour. Warn to the syslog if the call fails but otherwise continue. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30784a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:53:51 +10:00
David Chinner	36fbe6e6bd	[XFS] xfs_icsb_counter_disabled() never returns an error. Mark it void. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30782a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:52:43 +10:00
David Chinner	a414047fc9	[XFS] Remove useless whitespace in function prototypes Makes it simpler to annotate function prototypes with __must_check via sed scripts. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30781a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:51:58 +10:00
David Chinner	3c85c36cc2	[XFS] xfs_quiesce_fs() never returns an error. Mark it void. SGI-PV: 980084 SGI-Modid: xfs-linux-melb:xfs-kern:30780a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:51:46 +10:00
Christoph Hellwig	b6ddc4e6fe	[XFS] Don't validate symlink target component length This target component validation is not POSIX conformant and it is not done by any other Linux filesystem so remove it from XFS. SGI-PV: 980080 SGI-Modid: xfs-linux-melb:xfs-kern:30776a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:51:36 +10:00
Harvey Harrison	34a622b2e1	[XFS] replace remaining __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30775a Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:51:26 +10:00
Harvey Harrison	0225da1f35	[XFS] Replace __inline with inline Remove the remaining uses of __inline in the XFS code base. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30774a Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:51:15 +10:00
David Chinner	6b1d1a732f	[XFS] Fix lock inversion in forced shutdown. Recent changes to xlog_state_release_iclog() placed the grant_lock inside the icloglock. forced unmount of the log does this the opposite way around, but does not depend on the order for correct working. Fix the inversion by changing the order locks are gained in xfs_log_force_umount(). SGI-PV: 979661 SGI-Modid: xfs-linux-melb:xfs-kern:30773a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:51:04 +10:00
David Chinner	4679b2d36d	[XFS] Reorganise xlog_t for better cacheline isolation of contention To reduce contention on the log in large CPU count, separate out different parts of the xlog_t structure onto different cachelines. Move each lock onto a different cacheline along with all the members that are accessed/modified while that lock is held. Also, move the debugging code into debug code. SGI-PV: 978729 SGI-Modid: xfs-linux-melb:xfs-kern:30772a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:50:53 +10:00
David Chinner	eb01c9cd87	[XFS] Remove the xlog_ticket allocator The ticket allocator is just a simple slab implementation internal to the log. It requires the icloglock to be held when manipulating it and this contributes to contention on that lock. Just kill the entire allocator and use a memory zone instead. While there, allow us to gracefully fail allocation with ENOMEM. SGI-PV: 978729 SGI-Modid: xfs-linux-melb:xfs-kern:30771a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:50:39 +10:00
David Chinner	114d23aae5	[XFS] Per iclog callback chain lock Rather than use the icloglock for protecting the iclog completion callback chain, use a new per-iclog lock so that walking the callback chain doesn't require holding a global lock. This reduces contention on the icloglock during transaction commit and log I/O completion by reducing the number of times we need to hold the global icloglock during these operations. SGI-PV: 978729 SGI-Modid: xfs-linux-melb:xfs-kern:30770a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:50:22 +10:00
Lachlan McIlroy	2abdb8c881	[XFS] Prevent xfs_bmap_check_leaf_extents() referencing unmapped memory. While investigating the extent corruption bug I ran into this bug in debug only code. xfs_bmap_check_leaf_extents() loops through the leaf blocks of the extent btree checking that every extent is entirely before the next extent. It also compares the last extent in the previous block to the first extent in the current block when the previous block has been released and potentially unmapped. So take a copy of the last extent instead of a pointer. Also move the last extent check out of the loop because we only need to do it once. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30718a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>	2008-04-18 11:49:51 +10:00
Christoph Hellwig	433550990e	[XFS] remove most calls to VN_RELE Most VN_RELE calls either directly contain a XFS_ITOV or have the corresponding xfs_inode already in scope. Use the IRELE helper instead of VN_RELE to clarify the code. With a little more work we can kill VN_RELE altogether and define IRELE in terms of iput directly. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30710a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:49:08 +10:00
Lachlan McIlroy	df26cfe849	[XFS] split xfs_ioc_xattr The three subcases of xfs_ioc_xattr don't share any semantics and almost no code, so split it into three separate helpers. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30709a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:44:03 +10:00
Christoph Hellwig	f3dcc13f6f	[XFS] cleanup root inode handling in xfs_fs_fill_super - rename rootvp to root for clarify - remove useless vn_to_inode call - check is_bad_inode before calling d_alloc_root - use iput instead of VN_RELE in the error case SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30708a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:42:36 +10:00
David Chinner	59a33f9f77	[XFS] Ensure a btree insert returns a valid cursor. When writing into preallocated regions there is a case where XFS can oops or hang doing the unwritten extent conversion on I/O completion. It turns out that the problem is related to the btree cursor being invalid. When we do an insert into the tree, we may need to split blocks in the tree. When we only split at the leaf level (i.e. level 0), everything works just fine. However, if we have a multi-level split in the btreee, the cursor passed to the insert function is no longer valid once the insert is complete. The leaf level split is handled correctly because all the operations at level 0 are done using the original cursor, hence it is updated correctly. However, when we need to update the next level up the tree, we don't use that cursor - we use a cloned cursor that points to the index in the next level up where we need to do the insert. Hence if we need to split a second level, the changes to the tree are reflected in the cloned cursor and not the original cursor. This clone-and-move-up-a-level-on-split behaviour recurses all the way to the top of the tree. The complexity here is that these cloned cursors do not point to the original index that was inserted - they point to the newly allocated block (the right block) and the original cursor pointer to that level may still point to the left block. Hence, without deep examination of the cloned cursor and buffers, we cannot update the original cursor with the new path from the cloned cursor. In these cases the original cursor could be pointing to the wrong block(s) and hence a subsequent modification to the tree using that cursor will lead to corruption of the tree. The crash case occurs when the tree changes height - we insert a new level in the tree, and the cursor does not have a buffer in it's path for that level. Hence any attempt to walk back up the cursor to the root block will result in a null pointer dereference. To make matters even more complex, the BMAP BT is rooted in an inode, so we can have a change of height in the btree without a root split. That is, if the root block in the inode is full when we split a leaf node, we cannot fit the pointer to the new block in the root, so we allocate a new block, migrate all the ptrs out of the inode into the new block and point the inode root block at the newly allocated block. This changes the height of the tree without a root split having occurred and hence invalidates the path in the original cursor. The patch below prevents xfs_bmbt_insert() from returning with an invalid cursor by detecting the cases that invalidate the original cursor and refresh it by do a lookup into the btree for the original index we were inserting at. Note that the INOBT, AGFBNO and AGFCNT btree implementations also have this bug, but the cursor is currently always destroyed or revalidated after an insert for those trees. Hence this patch only address the problem in the BMBT code. SGI-PV: 979339 SGI-Modid: xfs-linux-melb:xfs-kern:30701a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:42:21 +10:00
David Chinner	75de2a91c9	[XFS] Account for inode cluster alignment in all allocations At ENOSPC, we can get a filesystem shutdown due to a cancelling a dirty transaction in xfs_mkdir or xfs_create. This is due to the initial allocation attempt not taking into account inode alignment and hence we can prepare the AGF freelist for allocation when it's not actually possible to do an allocation. This results in inode allocation returning ENOSPC with a dirty transaction, and hence we shut down the filesystem. Because the first allocation is an exact allocation attempt, we must tell the allocator that the alignment does not affect the allocation attempt. i.e. we will accept any extent alignment as long as the extent starts at the block we want. Unfortunately, this means that if the longest free extent is less than the length + alignment necessary for fallback allocation attempts but is long enough to attempt a non-aligned allocation, we will modify the free list. If we then have the exact allocation fail, all other allocation attempts will also fail due to the alignment constraint being taken into account. Hence the initial attempt needs to set the "alignment slop" field so that alignment, while not required, must be taken into account when determining if there is enough space left in the AG to do the allocation. That means if the exact allocation fails, we will not dirty the freelist if there is not enough space available fo a subsequent allocation to succeed. Hence we get an ENOSPC error back to userspace without shutting down the filesystem. SGI-PV: 978886 SGI-Modid: xfs-linux-melb:xfs-kern:30699a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:42:09 +10:00
Josef 'Jeff' Sipek	535f6b3735	[XFS] Replace custom AIL linked-list code with struct list_head Replace the xfs_ail_entry_t with a struct list_head and clean the surrounding code up. Also fixes a livelock in xfs_trans_first_push_ail() by terminating the loop at the head of the list correctly. SGI-PV: 978682 SGI-Modid: xfs-linux-melb:xfs-kern:30636a Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:41:57 +10:00
Christoph Hellwig	a45c796867	[XFS] Remove superflous xfs_readsb call in xfs_mountfs. When xfs_mountfs is called by xfs_mount xfs_readsb was called 35 lines above unconditionally, so there is no need to try to read the superblock if it's not present. If any other port doesn't have the superblock read at this point it should just call it directly from it's xfs_mount equivalent. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30603a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:41:46 +10:00
Niv Sardi	dfa18b1179	[XFS] kill t_sema member of struct xfs_trans It's completely unused so we might aswell kill it. Note that there is another t_sema in struct xlog_ticket, which is used and actually an sv_t despite the name. That one is left untouched by this patch. SGI-PV: 971186 SGI-Modid: xfs-linux-melb:xfs-kern:30591a Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:41:35 +10:00
Christoph Hellwig	5f90150aba	[XFS] cleanup vnode use in xfs_bmap.c SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30553a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:41:25 +10:00
Christoph Hellwig	af048193fc	[XFS] cleanup vnode use in xfs_iops.c SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30552a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:41:14 +10:00
Christoph Hellwig	dcf49cc5cf	[XFS] cleanup vnode use in xfs_lrw.c SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30551a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:41:04 +10:00
Christoph Hellwig	ef1f5e7ad3	[XFS] cleanup vnode use in xfs_lookup SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30550a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:40:55 +10:00
Christoph Hellwig	3937be5ba8	[XFS] cleanup vnode use in xfs_symlink and xfs_rename SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30548a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:40:45 +10:00
Christoph Hellwig	a3da789640	[XFS] cleanup vnode use in xfs_link SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30547a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:40:35 +10:00
Christoph Hellwig	979ebab116	[XFS] cleanup vnode use in xfs_create/mknod/mkdir SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30546a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:40:25 +10:00
Christoph Hellwig	bc4ac74a4e	[XFS] cleanup vnode use in dmapi calls SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30545a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:40:15 +10:00
David Chinner	d234154125	[XFS] Use power-of-2 sized buffers to reduce overhead Now that the ktrace_enter() code is using atomics, the non-power-of-2 buffer sizes - which require modulus operations to get the index - are showing up as using substantial CPU in the profiles. Force the buffer sizes to be rounded up to the nearest power of two and use masking rather than modulus operations to convert the index counter to the buffer index. This reduces ktrace_enter overhead to 8% of a CPU time, and again almost halves the trace intensive test runtime. SGI-PV: 977546 SGI-Modid: xfs-linux-melb:xfs-kern:30538a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:40:04 +10:00
David Chinner	6ee4752ffe	[XFS] Use atomic counters for ktrace buffer indexes ktrace_enter() is consuming vast amounts of CPU time due to the use of a single global lock for protecting buffer index increments. Change it to use per-buffer atomic counters - this reduces ktrace_enter() overhead during a trace intensive test on a 4p machine from 58% of all CPU time to 12% and halves test runtime. SGI-PV: 977546 SGI-Modid: xfs-linux-melb:xfs-kern:30537a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:39:55 +10:00
David Chinner	44d814ced4	[XFS] Update c/mtime correctly on truncates XFS changes the c/mtime of an inode when truncating it to the same size. The c/mtime is only supposed to change if the size is changed. Not to be confused with ftruncate, where the c/mtime is supposed to be changed even if the size is not changed. The Linux VFS encodes this semantic difference in the flags it sends down to ->setattr, which XFS currently ignores. We need to make XFS pay attention to the VFS flags and hence Do The Right Thing. SGI-PV: 977547 SGI-Modid: xfs-linux-melb:xfs-kern:30536a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:39:45 +10:00
Christoph Hellwig	24bd861d1c	[XFS] don't encode parent in nfs filehandles unless nessecary As Dave pointed out after the export ops changes we now always encode the parent into the filehandle for regular files, but it's not actually needed when the filesystem is export with no_subtree_check. This one-liner fixes xfs_fs_encode_fh to skip encoding the parent unless nessecary. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30535a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:39:35 +10:00
Christoph Hellwig	126468b115	[XFS] kill xfs_rwlock/xfs_rwunlock We can just use xfs_ilock/xfs_iunlock instead and get rid of the ugly bhv_vrwlock_t. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30533a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:39:25 +10:00
Christoph Hellwig	43973964a3	[XFS] kill xfs_get_dir_entry Instead of of xfs_get_dir_entry use a macro to get the xfs_inode from the dentry in the callers and grab the reference manually. Only grab the reference once as it's fine to keep it over the dmapi calls. (And even that reference is actually superflous in Linux but I'll leave that for another patch) SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30531a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:39:14 +10:00
Christoph Hellwig	a8b3acd57e	[XFS] vnode cleanup in xfs_fs_subr.c Cleanup the unneeded intermediate vnode step in the flushing helpers and go directly from the xfs_inode to the struct address_space. SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30530a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:39:03 +10:00
Christoph Hellwig	db0bb7baa1	[XFS] cleanup xfs_vn_mknod - use proper goto based unwinding instead of the current mess of multiple conditionals - rename ip to inode because that's the normal convention for Linux inodes while ip is the convention for xfs_inodes - remove unlikely checks for the default_acl - branches marked unlikely might lead to extreme branch bredictor slowdons if taken and for some workloads a default acl is quite common - properly indent the switch statements - remove xfs_has_fs_struct as nfsd has a fs_struct in any semi-recent kernel SGI-PV: 976035 SGI-Modid: xfs-linux-melb:xfs-kern:30529a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:38:53 +10:00
David Chinner	155cc6b784	[XFS] Use atomics for iclog reference counting Now that we update the log tail LSN less frequently on transaction completion, we pass the contention straight to the global log state lock (l_iclog_lock) during transaction completion. We currently have to take this lock to decrement the iclog reference count. there is a reference count on each iclog, so we need to take �he global lock for all refcount changes. When large numbers of processes are all doing small trnasctions, the iclog reference counts will be quite high, and the state change that absolutely requires the l_iclog_lock is the except rather than the norm. Change the reference counting on the iclogs to use atomic_inc/dec so that we can use atomic_dec_and_lock during transaction completion and avoid the need for grabbing the l_iclog_lock for every reference count decrement except the one that matters - the last. SGI-PV: 975671 SGI-Modid: xfs-linux-melb:xfs-kern:30505a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:38:10 +10:00
David Chinner	b589334c7a	[XFS] Prevent AIL lock contention during transaction completion When hundreds of processors attempt to commit transactions at the same time, they can contend on the AIL lock when updating the tail LSN held in the in-core log structure. At the moment, the tail LSN is only needed when actually writing out an iclog, so it really does not need to be updated on every single transaction completion - only those that result in switching iclogs and flushing them to disk. The result is that we reduce the number of times we need to grab the AIL lock and the log grant lock by up to two orders of magnitude on large processor count machines. The problem has previously been hidden by AIL lock contention walking the AIL list which was recently solved and uncovered this issue. SGI-PV: 975671 SGI-Modid: xfs-linux-melb:xfs-kern:30504a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:38:01 +10:00
David Chinner	3354040897	[XFS] Use xfs_inode_clean() in more places Remove open coded checks for the whether the inode is clean and replace them with an inlined function. SGI-PV: 977461 SGI-Modid: xfs-linux-melb:xfs-kern:30503a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:37:51 +10:00
David Chinner	bad5584332	[XFS] Remove the xfs_icluster structure Remove the xfs_icluster structure and replace with a radix tree lookup. We don't need to keep a list of inodes in each cluster around anymore as we can look them up quickly when we need to. The only time we need to do this now is during inode writeback. Factor the inode cluster writeback code out of xfs_iflush and convert it to use radix_tree_gang_lookup() instead of walking a list of inodes built when we first read in the inodes. This remove 3 pointers from each xfs_inode structure and the xfs_icluster structure per inode cluster. Hence we reduce the cache footprint of the xfs_inodes by between 5-10% depending on cluster sparseness. To be truly efficient we need a radix_tree_gang_lookup_range() call to stop searching once we are past the end of the cluster instead of trying to find a full cluster's worth of inodes. Before (ia64): $ cat /sys/slab/xfs_inode/object_size 536 After: $ cat /sys/slab/xfs_inode/object_size 512 SGI-PV: 977460 SGI-Modid: xfs-linux-melb:xfs-kern:30502a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:37:41 +10:00
David Chinner	a3f74ffb6d	[XFS] Don't block pdflush when writing back inodes When pdflush is writing back inodes, it can get stuck on inode cluster buffers that are currently under I/O. This occurs when we write data to multiple inodes in the same inode cluster at the same time. Effectively, delayed allocation marks the inode dirty during the data writeback. Hence if the inode cluster was flushed during the writeback of the first inode, the writeback of the second inode will block waiting for the inode cluster write to complete before writing it again for the newly dirtied inode. Basically, we want to avoid this from happening so we don't block pdflush and slow down all of writeback. Hence we introduce a non-blocking async inode flush flag that pdflush uses. If this flag is set, we use non-blocking operations (e.g. try locks) whereever we can to avoid blocking or extra I/O being issued. SGI-PV: 970925 SGI-Modid: xfs-linux-melb:xfs-kern:30501a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:37:32 +10:00
David Chinner	4ae29b4321	[XFS] Factor xfs_itobp() and xfs_inotobp(). The only difference between the functions is one passes an inode for the lookup, the other passes an inode number. However, they don't do the same validity checking or set all the same state on the buffer that is returned yet they should. Factor the functions into a common implementation. SGI-PV: 970925 SGI-Modid: xfs-linux-melb:xfs-kern:30500a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:37:19 +10:00
Lachlan McIlroy	e9a56b7cda	[XFS] Fix regression due to refcache removal SGI-PV: 971186 SGI-Modid: xfs-linux-melb:xfs-kern:30490a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Donald Douwsma <donaldd@sgi.com>	2008-04-18 11:37:06 +10:00
Donald Douwsma	163d3686bb	[XFS] Remove the xfs_refcache Remove the xfs_refcache, it was only needed while we were still building for 2.4 kernels. SGI-PV: 971186 SGI-Modid: xfs-linux-melb:xfs-kern:30472a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:36:55 +10:00
Lachlan McIlroy	461aa8a225	[XFS] make inode reclaim synchronise with xfs_iflush_done() On a forced shutdown, xfs_finish_reclaim() will skip flushing the inode. If the inode flush lock is not already held and there is an outstanding xfs_iflush_done() then we might free the inode prematurely. By acquiring and releasing the flush lock we will synchronise with xfs_iflush_done(). SGI-PV: 909874 SGI-Modid: xfs-linux-melb:xfs-kern:30468a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <dgc@sgi.com>	2008-04-18 11:34:54 +10:00
Niv Sardi	e12070a5dc	[XFS] actually check error returned by xfs_flush_pages, clean up and bailout if fails. SGI-PV: 973041 SGI-Modid: xfs-linux-melb:xfs-kern:30462a Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>	2008-04-18 11:34:47 +10:00

... 2 3 4 5 6 ...

8680 Commits