linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Huang Weiyi	292ec4cf35	xfs: xfs_trace.c: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_trace.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:24 -05:00
Dave Chinner	07f1a4f5e8	xfs: Check new inode size is OK before preallocating The new xfsqa test 228 tries to preallocate more space than the filesystem contains. it should fail, but instead triggers an assert about lock flags. The failure is due to the size extension failing in vmtruncate() due to rlimit being set. Check this before we start the preallocation to avoid allocating space that will never be used. Also the path through xfs_vn_allocate already holds the IO lock, so it should not be present in the lock flags when the setattr fails. Hence the assert needs to take this into account. This will prevent other such callers from hitting this incorrect ASSERT. (Fixed a reference to "newsize" to read "new_size". -Alex) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:12 -05:00
Christoph Hellwig	7ea8085910	drop unused dentry argument to ->fsync Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:05:02 -04:00
Alex Elder	88e88374ee	Merge branch 'delayed-logging-for-2.6.35' into for-linus	2010-05-24 11:57:36 -05:00
Dave Chinner	71e330b593	xfs: Introduce delayed logging core code The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:03 -05:00
Dave Chinner	ed3b4d6cdc	xfs: Improve scalability of busy extent tracking When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:34:00 -05:00
Dave Chinner	c11554104f	xfs: Clean up XFS_BLI_* flag namespace Clean up the buffer log format (XFS_BLI_) flags because they have a polluted namespace. They XFS_BLI_ prefix is used for both in-memory and on-disk flag feilds, but have overlapping values for different flags. Rename the buffer log format flags to use the XFS_BLF_ prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed flags. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:39 -05:00
Linus Torvalds	e8bebe2f71	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode , kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)	2010-05-21 19:37:45 -07:00
Stephen Hemminger	46e58764f0	xfs: constify xattr_handler Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:19 -04:00
Jens Axboe	ee9a3607fb	Merge branch 'master' into for-2.6.35 Conflicts: fs/ext3/fsync.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:27:26 +02:00
Christoph Hellwig	c472b43275	quota: unify ->set_dqblk Pass the larger struct fs_disk_quota to the ->set_dqblk operation so that the Q_SETQUOTA and Q_XSETQUOTA operations can be implemented with a single filesystem operation and we can retire the ->set_xquota operation. The additional information (RT-subvolume accounting and warn counts) are left zero for the VFS quota implementation. Add new fieldmask values for setting the numer of blocks and inodes values which is required for the VFS quota, but wasn't for XFS. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-21 19:30:44 +02:00
Christoph Hellwig	b9b2dd36c1	quota: unify ->get_dqblk Pass the larger struct fs_disk_quota to the ->get_dqblk operation so that the Q_GETQUOTA and Q_XGETQUOTA operations can be implemented with a single filesystem operation and we can retire the ->get_xquota operation. The additional information (RT-subvolume accounting and warn counts) are left zero for the VFS quota implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-21 19:30:43 +02:00
Christoph Hellwig	bd1556a146	xfs: clean up end index calculation in xfs_page_state_convert Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:20 -05:00
Christoph Hellwig	2b8f12b7e4	xfs: clean up mapping size calculation in __xfs_get_blocks Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:19 -05:00
Christoph Hellwig	558e689169	xfs: clean up xfs_iomap_valid Rename all iomap_valid identifiers to imap_valid to fit the new world order, and clean up xfs_iomap_valid to convert the passed in offset to blocks instead of the imap values to bytes. Use the simpler inode->i_blkbits instead of the XFS macros for this. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:19 -05:00
Christoph Hellwig	34a52c6c06	xfs: move I/O type flags into xfs_aops.c The IOMAP_ flags are now only used inside xfs_aops.c for extent probing and I/O completion tracking, so more them here, and rename them to IO_* as there's no mapping involved at all. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	207d041602	xfs: kill struct xfs_iomap Now that struct xfs_iomap contains exactly the same units as struct xfs_bmbt_irec we can just use the latter directly in the aops code. Replace the missing IOMAP_NEW flag with a new boolean output parameter to xfs_iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	e513182d4d	xfs: report iomap_bn in block base Report the iomap_bn field of struct xfs_iomap in terms of filesystem blocks instead of in terms of bytes. Shift the byte conversions into the caller, and replace the IOMAP_DELAY and IOMAP_HOLE flag checks with checks for HOLESTARTBLOCK and DELAYSTARTBLOCK. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	8699bb0a48	xfs: report iomap_offset and iomap_bsize in block base Report the iomap_offset and iomap_bsize fields of struct xfs_iomap in terms of fsblocks instead of in terms of disk blocks. Shift the byte conversions into the callers temporarily, but they will disappear or get cleaned up later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	9563b3d899	xfs: remove iomap_delta The iomap_delta field in struct xfs_iomap just contains the difference between the offset passed to xfs_iomap and the iomap_offset. Just calculate it in the only caller that cares. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:17 -05:00
Christoph Hellwig	046f1685bb	xfs: remove iomap_target Instead of using the iomap_target field in struct xfs_iomap and the IOMAP_REALTIME flag just use the already existing xfs_find_bdev_for_inode helper. There's some fallout as we need to pass the inode in a few more places, which we also use to sanitize some calling conventions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:16 -05:00
Tao Ma	2d1ff3c75a	xfs: Make fiemap work in query mode. According to Documentation/filesystems/fiemap.txt, If fm_extent_count is zero, then the fm_extents[] array is ignored (no extents will be returned), and the fm_mapped_extents count will hold the number of extents needed. But as the commit `97db39a1f6` has changed bmv_count to the caller's input buffer, this number query function can't work any more. As this commit is written to change bmv_count from MAXEXTNUM because of ENOMEM. This patch just try to set bm.bmv_count to something sane. Thanks to Dave Chinner <david@fromorbit.com> for the suggestion. Cc: Eric Sandeen <sandeen@redhat.com> Cc: Alex Elder <aelder@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tao Ma <tao.ma@oracle.com>	2010-05-19 09:58:16 -05:00
Christoph Hellwig	37bc5743fd	xfs: wait for direct I/O to complete in fsync and write_inode We need to wait for all pending direct I/O requests before taking care of metadata in fsync and write_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:14 -05:00
Andrea Gelmini	fce1cad651	xfs: xfs_trace.c: duplicated include fs/xfs/linux-2.6/xfs_trace.c: xfs_attr_sf.h is included more than once. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:14 -05:00
Christoph Hellwig	8c38366f99	xfs: enforce synchronous writes in xfs_bwrite xfs_bwrite is used with the intention of synchronously writing out buffers, but currently it does not actually clear the async flag if that's left from previous writes but instead implements async behaviour if it finds it. Remove the code handling asynchronous writes as we've got rid of those entirely outside of the log and delwri buffers, and make sure that we clear the async and read flags before writing the buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:13 -05:00
Christoph Hellwig	df308bcfec	xfs: remove periodic superblock writeback All modifications to the superblock are done transactional through xfs_trans_log_buf, so there is no reason to initiate periodic asynchronous writeback. This only removes the superblock from the delwri list and will lead to sub-optimal I/O scheduling. Cut down xfs_sync_fsdata now that it's only used for synchronous superblock writes and move the log coverage checks into the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-19 09:58:13 -05:00
Dave Chinner	e6a81f13aa	xfs: convert the dquot hash list to use list heads Convert the dquot hash list on the filesystem to use listhead infrastructure rather than the roll-your-own in the quota code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:11 -05:00
Dave Chinner	368e136174	xfs: remove duplicate code from dquot reclaim The dquot shaker and the free-list reclaim code use exactly the same algorithm but the code is duplicated and slightly different in each case. Make the shaker code use the single dquot reclaim code to remove the code duplication. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:11 -05:00
Dave Chinner	9abbc539bf	xfs: add log item recovery tracing Currently there is no tracing in log recovery, so it is difficult to determine what is going on when something goes wrong. Add tracing for log item recovery to provide visibility into the log recovery process. The tracing added shows regions being extracted from the log transactions and added to the transaction hash forming recovery items, followed by the reordering, cancelling and finally recovery of the items. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:10 -05:00
Dave Chinner	4aaf15d1aa	xfs: Add inode pin counts to traces We don't record pin counts in inode events right now, and this makes it difficult to track down problems related to pinning inodes. Add the pin count to the inode trace class and add trace events for pinning and unpinning inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-05-19 09:58:08 -05:00
Jan Engelhardt	e2a07812e9	xfs: add blockdev name to kthreads This allows to see in `ps` and similar tools which kthreads are allotted to which block device/filesystem, similar to what jbd2 does. As the process name is a fixed 16-char array, no extra space is needed in tasks. PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] 197 ? S 0:00 \_ [jbd2/sda2-8] 198 ? S 0:00 \_ [ext4-dio-unwrit] 204 ? S 0:00 \_ [flush-8:0] 2647 ? S 0:00 \_ [xfs_mru_cache] 2648 ? S 0:00 \_ [xfslogd/0] 2649 ? S 0:00 \_ [xfsdatad/0] 2650 ? S 0:00 \_ [xfsconvertd/0] 2651 ? S 0:00 \_ [xfsbufd/ram0] 2652 ? S 0:00 \_ [xfsaild/ram0] 2653 ? S 0:00 \_ [xfssyncd/ram0] Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:07 -05:00
Zhitong Wang	fda168c245	xfs: Fix integer overflow in fs/xfs/linux-2.6/xfs_ioctl*.c The am_hreq.opcount field in the xfs_attrmulti_by_handle() interface is not bounded correctly. The opcount is used to determine the size of the buffer required. The size is bounded, but can overflow and so the size checks may not be sufficient to catch invalid opcounts. Fix it by catching opcount values that would cause overflows before calculating the size. Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com> Reviewed-by: Dave Chinner <david@fromorbit.com>	2010-05-19 09:58:07 -05:00
Dave Chinner	9bf729c0af	xfs: add a shrinker to background inode reclaim On low memory boxes or those with highmem, kernel can OOM before the background reclaims inodes via xfssyncd. Add a shrinker to run inode reclaim so that it inode reclaim is expedited when memory is low. This is more complex than it needs to be because the VM folk don't want a context added to the shrinker infrastructure. Hence we need to add a global list of XFS mount structures so the shrinker can traverse them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-04-29 16:22:13 -05:00
Jens Axboe	7407cf355f	Merge branch 'master' into for-2.6.35 Conflicts: fs/block_dev.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-29 09:36:24 +02:00
Dmitry Monakhov	fbd9b09a17	blkdev: generalize flags for blkdev_issue_fn functions The patch just convert all blkdev_issue_xxx function to common set of flags. Wait/allocation semantics preserved. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-28 19:47:36 +02:00
Dave Chinner	f1d486a361	xfs: don't warn on EAGAIN in inode reclaim Any inode reclaim flush that returns EAGAIN will result in the inode reclaim being attempted again later. There is no need to issue a warning into the logs about this situation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-04-16 13:51:44 -05:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Dave Chinner	e8c3753ce4	xfs: don't warn about page discards on shutdown If we are doing a forced shutdown, we can get lots of noise about delalloc pages being discarded. This is happens by design during a forced shutdown, so don't spam the logs with these messages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:53 -05:00
Alex Elder	8a262e573d	xfs: use scalable vmap API Re-apply a commit that had been reverted due to regressions that have since been fixed. From `95f8e302c0` Mon Sep 17 00:00:00 2001 From: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:43:09 +1100 Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (`db64fe02`) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Only modifications here were a minor reformat, plus making the patch apply given the new use of xfs_buf_is_vmapped(). Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:36 -05:00
Alex Elder	cd9640a70d	xfs: remove old vmap cache Re-apply a commit that had been reverted due to regressions that have since been fixed. Original commit: `d2859751cd` Author: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:40:44 +1100 XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much). Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> The only change I made was to use the "new" xfs_buf_is_vmapped() function in a place it had been open-coded in the original. Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-16 15:40:19 -05:00
Linus Torvalds	66ce3cf84d	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (21 commits) xfs: return inode fork offset in bulkstat for fsr xfs: Increase the default size of the reserved blocks pool xfs: truncate delalloc extents when IO fails in writeback xfs: check for more work before sleeping in xfssyncd xfs: Fix a build warning in xfs_aops.c xfs: fix locking for inode cache radix tree tag updates xfs: remove xfs_ipin/xfs_iunpin xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait xfs: kill xfs_lrw.h xfs: factor common xfs_trans_bjoin code xfs: stop passing opaque handles to xfs_log.c routines xfs: split xfs_bmap_btalloc xfs: fix xfs_fsblock_t tracing xfs: fix inode pincount check in fsync xfs: Non-blocking inode locking in IO completion xfs: implement optimized fdatasync xfs: remove wrapper for the fsync file operation xfs: remove wrappers for read/write file operations xfs: merge xfs_lrw.c into xfs_file.c xfs: fix dquota trace format ...	2010-03-06 11:32:21 -08:00
Linus Torvalds	05c5cb31ec	Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linux * 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits) nfsd4: fix minor memory leak svcrpc: treat uid's as unsigned nfsd: ensure sockets are closed on error Revert "sunrpc: move the close processing after do recvfrom method" Revert "sunrpc: fix peername failed on closed listener" sunrpc: remove unnecessary svc_xprt_put NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN xfs_export_operations.commit_metadata commit_metadata export operation replacing nfsd_sync_dir lockd: don't clear sm_monitored on nsm_reboot_lookup lockd: release reference to nsm_handle in nlm_host_rebooted nfsd: Use vfs_fsync_range() in nfsd_commit NFSD: Create PF_INET6 listener in write_ports SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found" SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt() NFSD: Support AF_INET6 in svc_addsock() function SUNRPC: Use rpc_pton() in ip_map_parse() nfsd: 4.1 has an rfc number nfsd41: Create the recovery entry for the NFSv4.1 client nfsd: use vfs_fsync for non-directories ...	2010-03-06 11:31:38 -08:00
Linus Torvalds	e213e26ab3	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits) quota: stop using QUOTA_OK / NO_QUOTA dquot: cleanup dquot initialize routine dquot: move dquot initialization responsibility into the filesystem dquot: cleanup dquot drop routine dquot: move dquot drop responsibility into the filesystem dquot: cleanup dquot transfer routine dquot: move dquot transfer responsibility into the filesystem dquot: cleanup inode allocation / freeing routines dquot: cleanup space allocation / freeing routines ext3: add writepage sanity checks ext3: Truncate allocated blocks if direct IO write fails to update i_size quota: Properly invalidate caches even for filesystems with blocksize < pagesize quota: generalize quota transfer interface quota: sb_quota state flags cleanup jbd: Delay discarding buffers in journal_unmap_buffer ext3: quota_write cross block boundary behaviour quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota quota: split out compat_sys_quotactl support from quota.c quota: split out netlink notification support from quota.c quota: remove invalid optimization from quota_sync_all ... Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c	2010-03-05 13:20:53 -08:00
Christoph Hellwig	a9185b41a4	pass writeback_control to ->write_inode This gives the filesystem more information about the writeback that is happening. Trond requested this for the NFS unstable write handling, and other filesystems might benefit from this too by beeing able to distinguish between the different callers in more detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-05 13:25:52 -05:00
Christoph Hellwig	26821ed40b	make sure data is on disk before calling ->write_inode Similar to the fsync issue fixed a while ago in commit `2daea67e96` we need to write for data to actually hit the disk before writing out the metadata to guarantee data integrity for filesystems that modify the inode in the data I/O completion path. Currently XFS and NFS handle this manually, and AFS has a write_inode method that does nothing but waiting for data, while others are possibly missing out on this. Fortunately this change has a lot less impact than the fsync change as none of the write_inode methods starts data writeout of any form by itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-05 13:25:10 -05:00
Alex Elder	9b1f56d60a	Merge branch 'for-2.6.34-rc1-batch2' into for-linus	2010-03-05 11:45:03 -06:00
Dave Chinner	3ed3a4343b	xfs: truncate delalloc extents when IO fails in writeback We currently use block_invalidatepage() to clean up pages where I/O fails in ->writepage(). Unfortunately, if the page has delalloc regions on it, we fail to remove the delalloc regions when we invalidate the page. This can result in tripping a BUG() in xfs_get_blocks() later on if a direct IO read is done on that same region - the delalloc extent is returned when none is supposed to be there. Fix this by truncating away the delalloc regions on the page before invalidating it. Because they are delalloc, we can do this without needing a transaction. Indeed - if we get ENOSPC errors, we have to be able to do this truncation without a transaction as there is no space left for block reservation (typically why we see a ENOSPC in writeback). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:53 -06:00
Dave Chinner	20f6b2c785	xfs: check for more work before sleeping in xfssyncd xfssyncd processes a queue of work by detaching the queue and then iterating over all the work items. It then sleeps for a time period or until new work comes in. If new work is queued while xfssyncd is actively processing the detached work queue, it will not process that new work until after a sleep timeout or the next work event queued wakes it. Fix this by checking the work queue again before going to sleep. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:45 -06:00
Dave Chinner	694189328a	xfs: Fix a build warning in xfs_aops.c Fix a build warning that slipped through. Dave Chinner had posted an updated version of his patch but the previous version--without this fix--was what got committed. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-05 11:01:22 -06:00
Christoph Hellwig	ac0e773718	quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota We already do these checks in the generic code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-05 00:20:25 +01:00
Christoph Hellwig	8c4e4acd66	quota: clean up Q_XQUOTASYNC Currently Q_XQUOTASYNC calls into the quota_sync method, but XFS does something entirely different in it than the rest of the filesystems. xfs_quota which calls Q_XQUOTASYNC expects an asynchronous data writeout to flush delayed allocations, while the "VFS" quota support wants to flush changes to the quota file. So make Q_XQUOTASYNC call into the writeback code directly and make the quota_sync method optional as XFS doesn't need in the sense expected by the rest of the quota code. GFS2 was using limited XFS-style quota and has a quota_sync method fitting neither the style used by vfs_quota_sync nor xfs_fs_quota_sync. I left it in for now as per discussion with Steve it expects to be called from the sync path this way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2010-03-05 00:20:24 +01:00
J. Bruce Fields	4ea41e2de5	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs into for-2.6.34-incoming Resolve merge conflict in fs/xfs/linux-2.6/xfs_export.c.	2010-03-04 12:04:51 -05:00
Christoph Hellwig	f1f724e4b5	xfs: fix locking for inode cache radix tree tag updates The radix-tree code requires it's users to serialize tag updates against other updates to the tree. While XFS protects tag updates against each other it does not serialize them against updates of the tree contents, which can lead to tag corruption. Fix the inode cache to always take pag_ici_lock in exclusive mode when updating radix tree tags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Patrick Schreurs <patrick@news-service.com> Tested-by: Patrick Schreurs <patrick@news-service.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 19:14:36 -06:00
Christoph Hellwig	d7658d487f	xfs: kill xfs_lrw.h Move the two declarations to better fitting headers now that xfs_lrw.c is gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:44 -06:00
Christoph Hellwig	f7008d0aeb	xfs: fix xfs_fsblock_t tracing Using a static buffer in xfs_fmtfsblock means we can corrupt traces if multiple CPUs hit this code path at the same. Just remove xfs_fmtfsblock for now and print the block number purely numerical. If we want the NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be a decoding plugin in the trace-cmd userspace command. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:17 -06:00
Christoph Hellwig	024910cbac	xfs: fix inode pincount check in fsync We need to hold the ilock to check the inode pincount safely. While we're at it also remove the check for ip->i_itemp->ili_last_lsn, a pinned inode always has it set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:35:10 -06:00
Dave Chinner	77d7a0c2ee	xfs: Non-blocking inode locking in IO completion The introduction of barriers to loop devices has created a new IO order completion dependency that XFS does not handle. The loop device implements barriers using fsync and so turns a log IO in the XFS filesystem on the loop device into a data IO in the backing filesystem. That is, the completion of log IOs in the loop filesystem are now dependent on completion of data IO in the backing filesystem. This can cause deadlocks when a flush daemon issues a log force with an inode locked because the IO completion of IO on the inode is blocked by the inode lock. This in turn prevents further data IO completion from occuring on all XFS filesystems on that CPU (due to the shared nature of the completion queues). This then prevents the log IO from completing because the log is waiting for data IO completion as well. The fix for this new completion order dependency issue is to make the IO completion inode locking non-blocking. If the inode lock can't be grabbed, simply requeue the IO completion back to the work queue so that it can be processed later. This prevents the completion queue from being blocked and allows data IO completion on other inodes to proceed, hence avoiding completion order dependent deadlocks. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:52 -06:00
Christoph Hellwig	66d834ea60	xfs: implement optimized fdatasync Allow us to track the difference between timestamp and size updates by using mark_inode_dirty from the I/O completion code, and checking the VFS inode flags in xfs_file_fsync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:45 -06:00
Christoph Hellwig	fd3200bef7	xfs: remove wrapper for the fsync file operation Currently the fsync file operation is divided into a low-level routine doing all the work and one that implements the Linux file operation and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a bit, as well as preparing for the implementation of an optimized fdatasync which needs to look at the Linux inode state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:38 -06:00
Christoph Hellwig	00258e36b2	xfs: remove wrappers for read/write file operations Currently the aio_read, aio_write, splice_read and splice_write file operations are divided into a low-level routine doing all the work and one that implements the Linux file operations and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:29 -06:00
Christoph Hellwig	dda35b8f84	xfs: merge xfs_lrw.c into xfs_file.c Currently the code to implement the file operations is split over two small files. Merge the content of xfs_lrw.c into xfs_file.c to have it in one place. Note that I haven't done various cleanups that are possible after this yet, they will follow in the next patch. Also the function xfs_dev_is_read_only which was in xfs_lrw.c before really doesn't fit in here at all and was moved to xfs_mount.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:18 -06:00
Christoph Hellwig	b262e5dfd9	xfs: fix dquota trace format The be32_to_cpu in the TP_printk output breaks automatic parsing of the trace format by the trace-cmd tools, so we have to move it into the TP_assign block. While we're at it also fix the format for the quota limits to more regular and easier parseable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:34:11 -06:00
Eric Sandeen	a9cc799eca	xfs: increase readdir buffer size While doing some testing of readdir perf a while back, I noticed that the buffer size we're using internally is smaller than what glibc gives us by default. Upping this size helped a bit, and seems safe. glibc's __alloc_dir() does: const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : 4 * BUFSIZ); const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : BUFSIZ); size_t allocation = default_allocation; #ifdef _STATBUF_ST_BLKSIZE if (statp != NULL && default_allocation < statp->st_blksize) allocation = statp->st_blksize; #endif and #define _G_BUFSIZ 8192 #define _IO_BUFSIZ _G_BUFSIZ # define BUFSIZ _IO_BUFSIZ so the default buffer is 4 * 8192 = 32768 (except in the unlikely case of blocks > 32k....) Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-03-01 16:33:41 -06:00
Linus Torvalds	b305956abc	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits) fs/xfs: Correct NULL test xfs: optimize log flushing in xfs_fsync xfs: only clear the suid bit once in xfs_write xfs: kill xfs_bawrite xfs: log changed inodes instead of writing them synchronously xfs: remove invalid barrier optimization from xfs_fsync xfs: kill the unused XFS_QMOPT_* flush flags V2 xfs: Use delay write promotion for dquot flushing xfs: Sort delayed write buffers before dispatch xfs: Don't issue buffer IO direct from AIL push V2 xfs: Use delayed write for inodes rather than async V2 xfs: Make inode reclaim states explicit xfs: more reserved blocks fixups xfs: turn off sign warnings xfs: don't hold onto reserved blocks on remount,ro xfs: quota limit statvfs available blocks xfs: replace KM_LARGE with explicit vmalloc use xfs: cleanup up xfs_log_force calling conventions xfs: kill XLOG_VEC_SET_TYPE xfs: remove duplicate buffer flags ...	2010-02-26 17:18:52 -08:00
Linus Torvalds	f24407d2bd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt: xfs: fix xfs to work with Virtually Indexed architectures sh: add mm API for DMA to vmalloc/vmap areas arm: add mm API for DMA to vmalloc/vmap areas parisc: add mm API for DMA to vmalloc/vmap areas mm: add coherence API for DMA to vmalloc/vmap areas	2010-02-26 17:05:10 -08:00
Ben Myers	978ebd97d1	xfs_export_operations.commit_metadata This is the commit_metadata export operation for XFS. - Takes one inode to be committed. - Forces the log up to the lsn of the inode. - Doesn't force the log if the inode doesn't have a pincount. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> [bfields@citi.umich.edu: trivial whitespace fix] Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-02-20 13:14:50 -08:00
Christoph Hellwig	87185517de	xfs: only clear the suid bit once in xfs_write file_remove_suid already calls into ->setattr to clear the suid and sgid bits if needed, no need to start a second transaction to do it ourselves. Note that xfs_write_clear_setuid issues a sync transaction while the path through ->setattr doesn't, but that is consistant with the other filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-12 13:43:57 -06:00
James Bottomley	73c77e2ccc	xfs: fix xfs to work with Virtually Indexed architectures xfs_buf.c includes what is essentially a hand rolled version of blk_rq_map_kern(). In order to work properly with the vmalloc buffers that xfs uses, this hand rolled routine must also implement the flushing API for vmap/vmalloc areas. [style updates from hch@lst.de] Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-02-05 12:32:35 -06:00
Dave Chinner	5322892d86	xfs: kill xfs_bawrite There are no more users of this function left in the XFS code now that we've switched everything to delayed write flushing. Remove it. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-04 10:09:14 +11:00
Christoph Hellwig	07fec73625	xfs: log changed inodes instead of writing them synchronously When an inode has already be flushed delayed write, xfs_inode_clean() returns true and hence xfs_fs_write_inode() can return on a synchronous inode write without having written the inode. Currently these sycnhronous writes only come sync(1), unmount, a sycnhronous NFS export and cachefiles so should be relatively rare and out of common performance paths. Realistically, a synchronous inode write is not necessary here; we can avoid writing the inode by logging any non-transactional changes that are pending. This needs to be done with synchronous transactions, but it avoids seeking between the log and inode clusters as we do now. We don't force the log if the inode is pinned, though, so this differs from the fsync case. For normal sys_sync and unmount behaviour this is fine because we do a synchronous log force in xfs_sync_data which is called from the ->sync_fs code. It does however break the NFS synchronous export guarantees for now, but work is under way to fix this at a higher level or for the higher level to provide an additional flag in the writeback control to tell us that a log force is needed. Portions of this patch are based on work from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2010-02-09 11:43:49 +11:00
Dave Chinner	089716aa14	xfs: Sort delayed write buffers before dispatch Currently when the xfsbufd writes delayed write buffers, it pushes them to disk in the order they come off the delayed write list. If there are lots of buffers ѕpread widely over the disk, this results in overwhelming the elevator sort queues in the block layer and we end up losing the posibility of merging adjacent buffers to minimise the number of IOs. Use the new generic list_sort function to sort the delwri dispatch queue before issue to ensure that the buffers are pushed in the most friendly order possible to the lower layers. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-26 15:13:25 +11:00
Dave Chinner	d808f617ad	xfs: Don't issue buffer IO direct from AIL push V2 All buffers logged into the AIL are marked as delayed write. When the AIL needs to push the buffer out, it issues an async write of the buffer. This means that IO patterns are dependent on the order of buffers in the AIL. Instead of flushing the buffer, promote the buffer in the delayed write list so that the next time the xfsbufd is run the buffer will be flushed by the xfsbufd. Return the state to the xfsaild that the buffer was promoted so that the xfsaild knows that it needs to cause the xfsbufd to run to flush the buffers that were promoted. Using the xfsbufd for issuing the IO allows us to dispatch all buffer IO from the one queue. This means that we can make much more enlightened decisions on what order to flush buffers to disk as we don't have multiple places issuing IO. Optimisations to xfsbufd will be in a future patch. Version 2 - kill XFS_ITEM_FLUSHING as it is now unused. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-02 10:13:42 +11:00
Dave Chinner	c854363e80	xfs: Use delayed write for inodes rather than async V2 We currently do background inode flush asynchronously, resulting in inodes being written in whatever order the background writeback issues them. Not only that, there are also blocking and non-blocking asynchronous inode flushes, depending on where the flush comes from. This patch completely removes asynchronous inode writeback. It removes all the strange writeback modes and replaces them with either a synchronous flush or a non-blocking delayed write flush. That is, inode flushes will only issue IO directly if they are synchronous, and background flushing may do nothing if the operation would block (e.g. on a pinned inode or buffer lock). Delayed write flushes will now result in the inode buffer sitting in the delwri queue of the buffer cache to be flushed by either an AIL push or by the xfsbufd timing out the buffer. This will allow accumulation of dirty inode buffers in memory and allow optimisation of inode cluster writeback at the xfsbufd level where we have much greater queue depths than the block layer elevators. We will also get adjacent inode cluster buffer IO merging for free when a later patch in the series allows sorting of the delayed write buffers before dispatch. This effectively means that any inode that is written back by background writeback will be seen as flush locked during AIL pushing, and will result in the buffers being pushed from there. This writeback path is currently non-optimal, but the next patch in the series will fix that problem. A side effect of this delayed write mechanism is that background inode reclaim will no longer directly flush inodes, nor can it wait on the flush lock. The result is that inode reclaim must leave the inode in the reclaimable state until it is clean. Hence attempts to reclaim a dirty inode in the background will simply skip the inode until it is clean and this allows other mechanisms (i.e. xfsbufd) to do more optimal writeback of the dirty buffers. As a result, the inode reclaim code has been rewritten so that it no longer relies on the ambiguous return values of xfs_iflush() to determine whether it is safe to reclaim an inode. Portions of this patch are derived from patches by Christoph Hellwig. Version 2: - cleanup reclaim code as suggested by Christoph - log background reclaim inode flush errors - just pass sync flags to xfs_iflush Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-06 12:39:36 +11:00
Dave Chinner	777df5afdb	xfs: Make inode reclaim states explicit A.K.A.: don't rely on xfs_iflush() return value in reclaim We have gradually been moving checks out of the reclaim code because they are duplicated in xfs_iflush(). We've had a history of problems in this area, and many of them stem from the overloading of the return values from xfs_iflush() and interaction with inode flush locking to determine if the inode is safe to reclaim. With the desire to move to delayed write flushing of inodes and non-blocking inode tree reclaim walks, the overloading of the return value of xfs_iflush makes it very difficult to determine the correct thing to do next. This patch explicitly re-adds the checks to the inode reclaim code, removing the reliance on the return value of xfs_iflush() to determine what to do next. It also means that we can clearly document all the inode states that reclaim must handle and hence we can easily see that we handled all the necessary cases. This also removes the need for the xfs_inode_clean() check in xfs_iflush() as all callers now check this first (safely). Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-02-06 12:37:26 +11:00
Eric Sandeen	d5db0f97fb	xfs: more reserved blocks fixups This mangles the reserved blocks counts a little more. 1) add a helper function for the default reserved count 2) add helper functions to save/restore counts on ro/rw 3) save/restore reserved blocks on freeze/thaw 4) disallow changing reserved count while readonly V2: changed field name to match Dave's changes Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-02-08 17:41:48 -06:00
Dave Chinner	cbe132a8bd	xfs: don't hold onto reserved blocks on remount,ro If we hold onto reserved blocks when doing a remount,ro we end up writing the blocks used count to disk that includes the reserved blocks. Reserved blocks are not actually used, so this results in the values in the superblock being incorrect. Hence if we run xfs_check or xfs_repair -n while the filesystem is mounted remount,ro we end up with an inconsistent filesystem being reported. Also, running xfs_copy on the remount,ro filesystem will result in an inconsistent image being generated. To fix this, unreserve the blocks when doing the remount,ro, and reserved them again on remount,rw. This way a remount,ro filesystem will appear consistent on disk to all utilities. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-26 15:08:49 +11:00
Christoph Hellwig	bdfb04301f	xfs: replace KM_LARGE with explicit vmalloc use We use the KM_LARGE flag to make kmem_alloc and friends use vmalloc if necessary. As we only need this for a few boot/mount time allocations just switch to explicit vmalloc calls there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:56 -06:00
Christoph Hellwig	a14a348bff	xfs: cleanup up xfs_log_force calling conventions Remove the XFS_LOG_FORCE argument which was always set, and the XFS_LOG_URGE define, which was never used. Split xfs_log_force into a two helpers - xfs_log_force which forces the whole log, and xfs_log_force_lsn which forces up to the specified LSN. The underlying implementations already were entirely separate, as were the users. Also re-indent the new _xfs_log_force/_xfs_log_force which previously had a weird coding style. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:49 -06:00
Christoph Hellwig	0cadda1c5f	xfs: remove duplicate buffer flags Currently we define aliases for the buffer flags in various namespaces, which only adds confusion. Remove all but the XBF_ flags to clean this up a bit. Note that we still abuse XFS_B_ASYNC/XBF_ASYNC for some non-buffer uses, but I'll clean that up later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-21 13:44:36 -06:00
Dave Chinner	a9273ca5c6	xfs: convert attr to use unsigned names To be consistent with the directory code, the attr code should use unsigned names. Convert the names from the vfs at the highest level to unsigned, and ænsure they are consistenly used as unsigned down to disk. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:48 +11:00
Dave Chinner	b9c4864957	xfs: xfs_buf_iomove() doesn't care about signedness xfs_buf_iomove() uses xfs_caddr_t as it's parameter types, but it doesn't care about the signedness of the variables as it is just copying the data. Change the prototype to use void * so that we don't get sign warnings at call sites. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-01-20 10:47:39 +11:00
Christoph Hellwig	4e23471a3f	xfs: move more buffer helpers into xfs_buf.c Move xfsbdstrat and xfs_bdstrat_cb from xfs_lrw.c and xfs_bioerror and xfs_bioerror_relse from xfs_rw.c into xfs_buf.c. This also means xfs_bioerror and xfs_bioerror_relse can be marked static now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:35:17 -06:00
Christoph Hellwig	64e0bc7d2a	xfs: clean up xfs_bwrite Fold XFS_bwrite into it's only caller, xfs_bwrite and move it into xfs_buf.c instead of leaving it as a fairly large inline function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:35:07 -06:00
Christoph Hellwig	873ff5501d	xfs: clean up log buffer writes Don't bother using XFS_bwrite as it doesn't provide much code for our use case. Instead opencode it and fold xlog_bdstrat_cb into the new xlog_bdstrat helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:54 -06:00
Dave Chinner	b657fc82a3	xfs: Kill filestreams cache flush The filestreams cache flush is not needed in the sync code as it does not affect data writeback, and it is now not used by the growfs code, either, so kill it. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:22 -06:00
Dave Chinner	0fa800fbd5	xfs: Add trace points for per-ag refcount debugging. Uninline xfs_perag_{get,put} so that tracepoints can be inserted into them to speed debugging of reference count problems. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:34:12 -06:00
Dave Chinner	5017e97d52	xfs: rename xfs_get_perag xfs_get_perag is really getting the perag that an inode belongs to based on it's inode number. Convert the use of this function to just get the perag from a provided ag number. Use this new function to obtain the per-ag structure when traversing the per AG inode trees for sync and reclaim. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:33:02 -06:00
Dave Chinner	c9c129714e	xfs: Don't wake xfsbufd when idle The xfsbufd wakes every xfsbufd_centisecs (once per second by default) for each filesystem even when the filesystem is idle. If the xfsbufd has nothing to do, put it into a long term sleep and only wake it up when there is work pending (i.e. dirty buffers to flush soon). This will make laptop power misers happy. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:32:54 -06:00
Dave Chinner	453eac8a9a	xfs: Don't wake the aild once per second Now that the AIL push algorithm is traversal safe, we don't need a watchdog function in the xfsaild to catch pushes that fail to make progress. Remove the watchdog timeout and make pushes purely driven by demand. This will remove the once-per-second wakeup that is seen when the filesystem is idle and make laptop power misers happy. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:32:46 -06:00
Eric Sandeen	5d77c0dc0c	xfs: make several more functions static Just minor housekeeping, a lot more functions can be trivially made static; others could if we reordered things a bit... Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:31:38 -06:00
Dave Chinner	3a85cd96d3	xfs: add tracing to xfs_swap_extents To be able to diagnose whether the swap extents function is detecting compatible inode data fork configurations for swapping extents, add tracing points to the code to allow us to see the format of the inode forks before and after the swap. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 15:20:06 -06:00
Dave Chinner	57817c6822	xfs: reclaim all inodes by background tree walks We cannot do direct inode reclaim without taking the flush lock to ensure that we do not reclaim an inode under IO. We check the inode is clean before doing direct reclaim, but this is not good enough because the inode flush code marks the inode clean once it has copied the in-core dirty state to the backing buffer. It is the flush lock that determines whether the inode is still under IO, even though it is marked clean, and the inode is still required at IO completion so we can't reclaim it even though it is clean in core. Hence the requirement that we need to take the flush lock even on clean inodes because this guarantees that the inode writeback IO has completed and it is safe to reclaim the inode. With delayed write inode flushing, we coul dend up waiting a long time on the flush lock even for a clean inode. The background reclaim already handles this efficiently, so avoid all the problems by killing the direct reclaim path altogether. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:44:44 -06:00
Dave Chinner	018027be90	xfs: Avoid inodes in reclaim when flushing from inode cache The reclaim code will handle flushing of dirty inodes before reclaim occurs, so avoid them when determining whether an inode is a candidate for flushing to disk when walking the radix trees. This is based on a test patch from Christoph Hellwig. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:44:21 -06:00
Dave Chinner	c8e20be020	xfs: reclaim inodes under a write lock Make the inode tree reclaim walk exclusive to avoid races with concurrent sync walkers and lookups. This is a version of a patch posted by Christoph Hellwig that avoids all the code duplication. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-15 13:43:55 -06:00
Dave Chinner	fd45e47841	xfs: Ensure we force all busy extents in range to disk When we search for and find a busy extent during allocation we force the log out to ensure the extent free transaction is on disk before the allocation transaction. The current implementation has a subtle bug in it--it does not handle multiple overlapping ranges. That is, if we free lots of little extents into a single contiguous extent, then allocate the contiguous extent, the busy search code stops searching at the first extent it finds that overlaps the allocated range. It then uses the commit LSN of the transaction to force the log out to. Unfortunately, the other busy ranges might have more recent commit LSNs than the first busy extent that is found, and this results in xfs_alloc_search_busy() returning before all the extent free transactions are on disk for the range being allocated. This can lead to potential metadata corruption or stale data exposure after a crash because log replay won't replay all the extent free transactions that cover the allocation range. Modified-by: Alex Elder <aelder@sgi.com> (Dropped the "found" argument from the xfs_alloc_busysearch trace event.) Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:22:02 -06:00
Christoph Hellwig	d6d59bada3	xfs: fix timestamp handling in xfs_setattr We currently have some rather odd code in xfs_setattr for updating the a/c/mtime timestamps: - first we do a non-transaction update if all three are updated together - second we implicitly update the ctime for various changes instead of relying on the ATTR_CTIME flag - third we set the timestamps to the current time instead of the arguments in the iattr structure in many cases. This patch makes sure we update it in a consistent way: - always transactional - ctime is only updated if ATTR_CTIME is set or we do a size update, which is a special case - always to the times passed in from the caller instead of the current time The only non-size caller of xfs_setattr that doesn't come from the VFS is updated to set ATTR_CTIME and pass in a valid ctime value. Reported-by: Eric Blake <ebb9@byu.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:21:58 -06:00
Christoph Hellwig	ea9a48881e	xfs: use DECLARE_EVENT_CLASS Using DECLARE_EVENT_CLASS allows us to to use trace event code instead of duplicating it in the binary. This was not available before 2.6.33 so it had to be done as a separate step once the prerequisite was merged. This only requires changes to xfs_trace.h and the results are rather impressive: hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o* text data bss dec hex filename 607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o 1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-10 12:21:56 -06:00
Dave Chinner	a539bd8c86	xfs: kill some warnings on i386 builds Randy Dunlap Reported printk() format-related warnings reported on i386 builds in his environment. Dave Chinner provided this patch to eliminate them. Signed-off by: Dave Chinner <david@fromorbit.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-01-08 13:32:29 -06:00
Christoph Hellwig	eaff8079d4	kill I_LOCK After I_SYNC was split from I_LOCK the leftover is always used together with I_NEW and thus superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-17 11:03:25 -05:00
Linus Torvalds	bea4c899f2	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: XFS: Free buffer pages array unconditionally xfs: kill xfs_bmbt_rec_32/64 types xfs: improve metadata I/O merging in the elevator xfs: check for not fully initialized inodes in xfs_ireclaim	2009-12-16 13:29:39 -08:00

1 2 3 4 5 ...

894 Commits