OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Christoph Hellwig	5a93a064d2	xfs: do not flush data workqueues in xfs_flush_buftarg When we call xfs_flush_buftarg (generally from sync or umount) it already is too late to flush the data workqueues, as I/O completion is signalled for them and we are thus already done with the data we would flush here. There are places where flushing them might be useful, but the current sync interface doesn't give us that opportunity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 22:34:31 -05:00
Christoph Hellwig	a9add83e5a	xfs: remove XFS_bflush Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:11 -05:00
Christoph Hellwig	02b102df15	xfs: remove xfs_buf_target_name The calling convention that returns a pointer to a static buffer is fairly nasty, so just opencode it in the only caller that is left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:11 -05:00
Christoph Hellwig	b38505b09b	xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks Use xfs_ioerror_alert instead of opencoding a very similar error message. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	901796afca	xfs: clean up xfs_ioerror_alert Instead of passing the block number and mount structure explicitly get them off the bp and fix make the argument order more natural. Also move it to xfs_buf.c and stop printing the device name given that we already get the fs name as part of xfs_alert, and we know what device is operates on because of the caller that gets printed, finally rename it to xfs_buf_ioerror_alert and pass __func__ as argument where it makes sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	4347b9d7ad	xfs: clean up buffer allocation Change _xfs_buf_initialize to allocate the buffer directly and rename it to xfs_buf_alloc now that is the only buffer allocation routine. Also remove the xfs_buf_deallocate wrapper around the kmem_zone_free calls for buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	af5c4bee49	xfs: remove buffers from the delwri list in xfs_buf_stale For each call to xfs_buf_stale we call xfs_buf_delwri_dequeue either directly before or after it, or are guaranteed by the surrounding conditionals that we are never called on delwri buffers. Simply this situation by moving the call to xfs_buf_delwri_dequeue into xfs_buf_stale. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	c867cb6164	xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:10 -05:00
Christoph Hellwig	38f2323244	xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	5fde0326dd	xfs: remove XFS_BUF_FINISH_IOWAIT Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	b17b833443	xfs: remove xfs_get_buftarg_list The code is unused and under a config option that doesn't exist, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	87c7bec7fc	xfs: fix buffer flushing during unmount The code to flush buffers in the umount code is a bit iffy: we first flush all delwri buffers out, but then might be able to queue up a new one when logging the sb counts. On a normal shutdown that one would get flushed out when doing the synchronous superblock write in xfs_unmountfs_writesb, but we skip that one if the filesystem has been shut down. Fix this by moving the delwri list flushing until just before unmounting the log, and while we're at it also remove the superflous delwri list and buffer lru flusing for the rt and log device that can never have cached or delwri buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Tested-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Christoph Hellwig	1da2f2dbf2	xfs: optimize fsync on directories Directories are only updated transactionally, which means fsync only needs to flush the log the inode is currently dirty, but not bother with checking for dirty data, non-transactional updates, and most importanly doesn't have to flush disk caches except as part of a transaction commit. While the first two optimizations can't easily be measured, the latter actually makes a difference when doing lots of fsync that do not actually have to commit the inode, e.g. because an earlier fsync already pushed the log far enough. The new xfs_dir_fsync is identical to xfs_nfs_commit_metadata except for the prototype, but I'm not sure creating a common helper for the two is worth it given how simple the functions are. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Dave Chinner	670ce93fef	xfs: reduce the number of log forces from tail pushing The AIL push code will issue a log force on ever single push loop that it exits and has encountered pinned items. It doesn't rescan these pinned items until it revisits the AIL from the start. Hence we only need to force the log once per walk from the start of the AIL to the target LSN. This results in numbers like this: xs_push_ail_flush..... 1456 xs_log_force......... 1485 For an 8-way 50M inode create workload - almost all the log forces are coming from the AIL pushing code. Reduce the number of log forces by only forcing the log if the previous walk found pinned buffers. This reduces the numbers to: xs_push_ail_flush..... 665 xs_log_force......... 682 For the same test. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:09 -05:00
Dave Chinner	3815832a2a	xfs: Don't allocate new buffers on every call to _xfs_buf_find Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing ~1 million cache hit lookups to ~3000 buffer creates. That's almost 3 orders of magnitude more cahce hits than misses, so optimising for cache hits is quite important. In the cache hit case, we do not need to allocate a new buffer in case of a cache miss, so we are effectively hitting the allocator for no good reason for vast the majority of calls to _xfs_buf_find. 8-way create workloads are showing similar cache hit/miss ratios. The result is profiles that look like this: samples pcnt function DSO _______ _____ _______________________________ _________________ 1036.00 10.0% _xfs_buf_find [kernel.kallsyms] 582.00 5.6% kmem_cache_alloc [kernel.kallsyms] 519.00 5.0% __memcpy [kernel.kallsyms] 468.00 4.5% __ticket_spin_lock [kernel.kallsyms] 388.00 3.7% kmem_cache_free [kernel.kallsyms] 331.00 3.2% xfs_log_commit_cil [kernel.kallsyms] Further, there is a fair bit of work involved in initialising a new buffer once a cache miss has occurred and we currently do that under the rbtree spinlock. That increases spinlock hold time on what are heavily used trees. To fix this, remove the initialisation of the buffer from _xfs_buf_find() and only allocate the new buffer once we've had a cache miss. Initialise the buffer immediately after allocating it in xfs_buf_get, too, so that is it ready for insert if we get another cache miss after allocation. This minimises lock hold time and avoids unnecessary allocator churn. The resulting profiles look like: samples pcnt function DSO _______ _____ ___________________________ _________________ 8111.00 9.1% _xfs_buf_find [kernel.kallsyms] 4380.00 4.9% __memcpy [kernel.kallsyms] 4341.00 4.8% __ticket_spin_lock [kernel.kallsyms] 3401.00 3.8% kmem_cache_alloc [kernel.kallsyms] 2856.00 3.2% xfs_log_commit_cil [kernel.kallsyms] 2625.00 2.9% __kmalloc [kernel.kallsyms] 2380.00 2.7% kfree [kernel.kallsyms] 2016.00 2.3% kmem_cache_free [kernel.kallsyms] Showing a significant reduction in time spent doing allocation and freeing from slabs (kmem_cache_alloc and kmem_cache_free). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	ddc3415aba	xfs: simplify xfs_trans_ijoin* again There is no reason to keep a reference to the inode even if we unlock it during transaction commit because we never drop a reference between the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref back into xfs_trans_ijoin - the third argument decides if an unlock is needed now. I'm actually starting to wonder if allowing inodes to be unlocked at transaction commit really is worth the effort. The only real benefit is that they can be unlocked earlier when commiting a synchronous transactions, but that could be solved by doing the log force manually after the unlock, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	23bb0be1a2	xfs: unlock the inode before log force in xfs_change_file_space Let the transaction commit unlock the inode before it potentially causes a synchronous log force. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	8292d88c5c	xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	b103705853	xfs: unlock the inode before log force in xfs_fsync Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. This also removes the only direct caller of _xfs_trans_commit, thus allowing it to be merged into the plain xfs_trans_commit again. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Christoph Hellwig	815cb21662	xfs: XFS_TRANS_SWAPEXT is not a valid flag for xfs_trans_commit XFS_TRANS_SWAPEXT is a transaction type, not a flag for xfs_trans_commit, so don't pass it in xfs_swap_extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:08 -05:00
Lukas Czerner	c029a50d51	xfs: fix possible overflow in xfs_ioc_trim() In xfs_ioc_trim it is possible that computing the last allocation group to discard might overflow for big start & len values, because the result might be bigger then xfs_agnumber_t which is 32 bit long. Fix this by not allowing the start and end block of the range to be beyond the end of the file system. Note that if the start is beyond the end of the file system we have to return -EINVAL, but in the "end" case we have to truncate it to the fs size. Also introduce "end" variable, rather than using start+len which which might be more confusing to get right as this bug shows. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	d952e2f812	xfs: cleanup xfs_bmap.h Convert all function prototypes to the short form used elsewhere, and remove duplicates of comments already placed at the function body. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	b0eab14e74	xfs: dont ignore error code from xfs_bmbt_update Fix a case in xfs_bmap_add_extent_unwritten_real where we aren't passing the returned error on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	c653424985	xfs: pass bmalloca to xfs_bmap_add_extent_hole_real All the parameters passed to xfs_bmap_add_extent_hole_real() are in the xfs_bmalloca structure now. Just pass the bmalloca parameter to the function instead of 8 separate parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	572a4cf04a	xfs: pass bmalloca to xfs_bmap_add_extent_delay_real All the parameters passed to xfs_bmap_add_extent_delay_real() are in the xfs_bmalloca structure now. Just pass the bmalloca parameter to the function instead of 8 separate parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:07 -05:00
Christoph Hellwig	c315c90b7d	xfs: move logflags into bmalloca Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	e0c3da5d89	xfs: move lastx and nallocs into bmalloca Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	29c8d17a89	xfs: move btree cursor into bmalloca Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	963c30cf45	xfs: do not keep local copies of allocation ranges in xfs_bmapi_allocate Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	3a75667e90	xfs: rename allocation range fields in struct xfs_bmalloca Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:06 -05:00
Dave Chinner	0937e0fd8b	xfs: move firstblock and bmap freelist cursor into bmalloca structure Rather than passing the firstblock and freelist structure around, embed it into the bmalloca structure and remove it from the function parameters. This also enables the minleft parameter to be set only once in xfs_bmapi_write(), and the freelist cursor directly queried in xfs_bmapi_allocate to clear it when the lowspace algorithm is activated. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Dave Chinner	baf41a52b9	xfs: move extent records into bmalloca structure Rather that putting extent records on the stack and then pointing to them in the bmalloca structure which is in the same stack frame, put the extent records directly in the bmalloca structure. This reduces the number of args that need to be passed around. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Dave Chinner	1b16447ba2	xfs: pass bmalloca structure to xfs_bmap_isaeof All the variables xfs_bmap_isaeof() is passed are contained within the xfs_bmalloca structure. Pass that instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Christoph Hellwig	a5bd606ba6	xfs: remove xfs_bmap_add_extent There is no real need to the xfs_bmap_add_extent, as the callers know what kind of extents they need to it. Removing it means duplicating the extents to btree conversion logic in three places, but overall it's still much simpler code and quite a bit less code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Christoph Hellwig	27a3f8f2de	xfs: introduce xfs_bmap_last_extent Add a common helper for finding the last extent in a file. Largely based on a patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:05 -05:00
Dave Chinner	c0dc7828af	xfs: rename xfs_bmapi to xfs_bmapi_write Now that all the read-only users of xfs_bmapi have been converted to use xfs_bmapi_read(), we can remove all the read-only handling cases from xfs_bmapi(). Once this is done, rename xfs_bmapi to xfs_bmapi_write to reflect the fact it is for allocation only. This enables us to kill the XFS_BMAPI_WRITE flag as well. Also clean up xfs_bmapi_write to the style used in the newly added xfs_bmapi_read/delay functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Dave Chinner	b447fe5a05	xfs: factor unwritten extent map manipulations out of xfs_bmapi To further improve the readability of xfs_bmapi(), factor the unwritten extent conversion out into a separate function. This removes large block of logic from the xfs_bmapi() code loop and makes it easier to see the operational logic flow for xfs_bmapi(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Dave Chinner	7e47a4efde	xfs: factor extent allocation out of xfs_bmapi To further improve the readability of xfs_bmapi(), factor the extent allocation out into a separate function. This removes a large block of logic from the xfs_bmapi() code loop and makes it easier to see the operational logic flow for xfs_bmapi(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Christoph Hellwig	1fd044d9c6	xfs: do not use xfs_bmap_add_extent for adding delalloc extents We can just call xfs_bmap_add_extent_hole_delay directly to add a delayed allocated regions to the extent tree, instead of going through all the complexities of xfs_bmap_add_extent that aren't needed for this simple case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Christoph Hellwig	4403280aa5	xfs: introduce xfs_bmapi_delay() Delalloc reservations are much simpler than allocations, so give them a separate bmapi-level interface. Using the previously added xfs_bmapi_reserve_delalloc we get a function that is only minimally more complicated than xfs_bmapi_read, which is far from the complexity in xfs_bmapi. Also remove the XFS_BMAPI_DELAY code after switching over the only user to xfs_bmapi_delay. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Christoph Hellwig	b64dfe4e18	xfs: factor delalloc reservations out of xfs_bmapi Move the reservation of delayed allocations, and addition of delalloc regions to the extent trees into a new helper function. For now this adds some twisted goto logic to xfs_bmapi, but that will be cleaned up in the following patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:04 -05:00
Dave Chinner	5b777ad517	xfs: remove xfs_bmapi_single() Now we have xfs_bmapi_read, there is no need for xfs_bmapi_single(). Change the remaining caller over and kill the function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Dave Chinner	5c8ed2021f	xfs: introduce xfs_bmapi_read() xfs_bmapi() currently handles both extent map reading and allocation. As a result, the code is littered with "if (wr)" branches to conditionally do allocation operations if required. This makes the code much harder to follow and causes significant indent issues with the code. Given that read mapping is much simpler than allocation, we can split out read mapping from xfs_bmapi() and reuse the logic that we have already factored out do do all the hard work of handling the extent map manipulations. The results in a much simpler function for the common extent read operations, and will allow the allocation code to be simplified in another commit. Once xfs_bmapi_read() is implemented, convert all the callers of xfs_bmapi() that are only reading extents to use the new function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Dave Chinner	aef9a89586	xfs: factor extent map manipulations out of xfs_bmapi To further improve the readability of xfs_bmapi(), factor the pure extent map manipulations out into separate functions. This removes large blocks of logic from the xfs_bmapi() code loop and makes it easier to see the operational logic flow for xfs_bmapi(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	ecee76ba9d	xfs: remove the nextents variable in xfs_bmapi Instead of using a local variable that needs to updated when we modify the extent map just check ifp->if_bytes directly where we use it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	b9b984d784	xfs: remove impossible to read code in xfs_bmap_add_extent_delay_real We already have the worst case blocks reserved, so xfs_icsb_modify_counters won't fail in xfs_bmap_add_extent_delay_real. In fact we've had an assert to catch this case since day and it never triggered. So remove the code to try smaller reservations, and just return the error for that case in addition to keeping the assert. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Christoph Hellwig	e7455e02e5	xfs: remove the first extent special case in xfs_bmap_add_extent Both xfs_bmap_add_extent_hole_delay and xfs_bmap_add_extent_hole_real already contain code to handle the case where there is no extent to merge with, which is effectively the same as the code duplicated here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:03 -05:00
Mitsuo Hayasaka	ed32201e65	xfs: Return -EIO when xfs_vn_getattr() failed An attribute of inode can be fetched via xfs_vn_getattr() in XFS. Currently it returns EIO, not negative value, when it failed. As a result, the system call returns not negative value even though an error occured. The stat(2), ls and mv commands cannot handle this error and do not work correctly. This patch fixes this bug, and returns -EIO, not EIO when an error is detected in xfs_vn_getattr(). Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:02 -05:00
Chandra Seetharaman	eabbaf1182	xfs: Fix the incorrect comment in the header of _xfs_buf_find Fix the incorrect comment in the header of the function _xfs_buf_find(). Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:02 -05:00
Chandra Seetharaman	2a30f36d90	xfs: Check the return value of xfs_trans_get_buf() Check the return value of xfs_trans_get_buf() and fail appropriately. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Chandra Seetharaman	b522950f0a	xfs: Check the return value of xfs_buf_get() Check the return value of xfs_buf_get() and fail appropriately. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	04f658ee22	xfs: improve ioend error handling Return unwritten extent conversion errors to aio_complete. Skip both unwritten extent conversion and size updates if we had an I/O error or the filesystem has been shut down. Return -EIO to the aio/buffer completion handlers in case of a forced shutdown. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	c58cb165bd	xfs: avoid direct I/O write vs buffered I/O race Currently a buffered reader or writer can add pages to the pagecache while we are waiting for the iolock in xfs_file_dio_aio_write. Prevent this by re-checking mapping->nrpages after we got the iolock, and if nessecary upgrade the lock to exclusive mode. To simplify this a bit only take the ilock inside of xfs_file_aio_write_checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	859f57ca00	xfs: avoid synchronous transactions when deleting attr blocks Currently xfs_attr_inactive causes a synchronous transactions if we are removing a file that has any extents allocated to the attribute fork, and thus makes XFS extremely slow at removing files with out of line extended attributes. The code looks a like a relict from the days before the busy extent list, but with the busy extent list we avoid reusing data and attr extents that have been freed but not commited yet, so this code is just as superflous as the synchronous transactions for data blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	4a06fd262d	xfs: remove i_iocount We now have an i_dio_count filed and surrounding infrastructure to wait for direct I/O completion instead of i_icount, and we have never needed to iocount waits for buffered I/O given that we only set the page uptodate after finishing all required work. Thus remove i_iocount, and replace the actually needed waits with calls to inode_dio_wait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:01 -05:00
Christoph Hellwig	2b3ffd7eb7	xfs: wait for I/O completion when writing out pages in xfs_setattr_size The current code relies on the xfs_ioend_wait call later on to make sure all I/O actually has completed. The xfs_ioend_wait call will go away soon, so prepare for that by using the waiting filemap function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	fc0063c447	xfs: reduce ioend latency There is no reason to queue up ioends for processing in user context unless we actually need it. Just complete ioends that do not convert unwritten extents or need a size update from the end_io context. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c859cdd1da	xfs: defer AIO/DIO completions We really shouldn't complete AIO or DIO requests until we have finished the unwritten extent conversion and size update. This means fsync never has to pick up any ioends as all work has been completed when signalling I/O completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	398d25ef23	xfs: remove dead ENODEV handling in xfs_destroy_ioend No driver returns ENODEV from it bio completion handler, not has this ever been documented. Remove the dead code dealing with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c4e1c098ee	xfs: use the "delwri" terminology consistently And also remove the strange local lock and delwri list pointers in a few functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	c2b006c1da	xfs: let xfs_bwrite callers handle the xfs_buf_relse Remove the xfs_buf_relse from xfs_bwrite and let the caller handle it to mirror the delwri and read paths. Also remove the mount pointer passed to xfs_bwrite, which is superflous now that we have a mount pointer in the buftarg. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:15:00 -05:00
Christoph Hellwig	61551f1ee5	xfs: call xfs_buf_delwri_queue directly Unify the ways we add buffers to the delwri queue by always calling xfs_buf_delwri_queue directly. The xfs_bdwrite functions is removed and opencoded in its callers, and the two places setting XBF_DELWRI while a buffer is locked and expecting xfs_buf_unlock to pick it up are converted to call xfs_buf_delwri_queue directly, too. Also replace the XFS_BUF_UNDELAYWRITE macro with direct calls to xfs_buf_delwri_dequeue to make the explicit queuing/dequeuing more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	5a8ee6bafd	xfs: move more delwri setup into xfs_buf_delwri_queue Do not transfer a reference held by the caller to the buffer on the list, or decrement it in xfs_buf_delwri_queue, but instead grab a new reference if needed, and let the caller drop its own reference. Also move setting of the XBF_DELWRI and XBF_ASYNC flags into xfs_buf_delwri_queue, and only do it if needed. Note that for now xfs_buf_unlock already has XBF_DELWRI, but that will change in the following patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	527cfdf19d	xfs: remove the unlock argument to xfs_buf_delwri_queue We can just unlock the buffer in the caller, and the decrement of b_hold would also be needed in the !unlock, we just never hit that case currently given that the caller handles that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Christoph Hellwig	375ec69d2e	xfs: remove delwri buffer handling from xfs_buf_iorequest We cannot ever reach xfs_buf_iorequest for a buffer with XBF_DELWRI set, given that all write handlers make sure that the buffer is remove from the delwri queue before, and we never do reads with the XBF_DELWRI flag set (which the code would not handle correctly anyway). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Dave Chinner	7271d243f9	xfs: don't serialise adjacent concurrent direct IO appending writes For append write workloads, extending the file requires a certain amount of exclusive locking to be done up front to ensure sanity in things like ensuring that we've zeroed any allocated regions between the old EOF and the start of the new IO. For single threads, this typically isn't a problem, and for large IOs we don't serialise enough for it to be a problem for two threads on really fast block devices. However for smaller IO and larger thread counts we have a problem. Take 4 concurrent sequential, single block sized and aligned IOs. After the first IO is submitted but before it completes, we end up with this state: IO 1 IO 2 IO 3 IO 4 +-------+-------+-------+-------+ ^ ^ \| \| \| \| \| \| \| \- ip->i_new_size \- ip->i_size And the IO is done without exclusive locking because offset <= ip->i_size. When we submit IO 2, we see offset > ip->i_size, and grab the IO lock exclusive, because there is a chance we need to do EOF zeroing. However, there is already an IO in progress that avoids the need for IO zeroing because offset <= ip->i_new_size. hence we could avoid holding the IO lock exlcusive for this. Hence after submission of the second IO, we'd end up this state: IO 1 IO 2 IO 3 IO 4 +-------+-------+-------+-------+ ^ ^ \| \| \| \| \| \| \| \- ip->i_new_size \- ip->i_size There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO writes to the same inode. And so you can see that for the third concurrent IO, we'd avoid exclusive locking for the same reason we avoided the exclusive lock for the second IO. Fixing this is a bit more complex than that, because we need to hold a write-submission local value of ip->i_new_size to that clearing the value is only done if no other thread has updated it before our IO completes..... Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Dave Chinner	0c38a2512d	xfs: don't serialise direct IO reads on page cache checks There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO reads to the same inode. Fix this by taking the IO lock shared to check the page cache state, and only then drop it and take the IO lock exclusively if there is work to be done. Hence for the normal direct IO case, no exclusive locking will occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Joern Engel <joern@logfs.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 21:14:59 -05:00
Sachin Prabhu	875cd04381	cifs: Display strictcache mount option in /proc/mounts Commit `d39454ffe4` adds a strictcache mount option. This patch allows the display of this mount option in /proc/mounts when listing shares mounted with the strictcache mount option. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-11 13:13:18 -05:00
J. Bruce Fields	b6d2f1ca3c	nfsd4: more robust ignoring of WANT bits in OPEN Mask out the WANT bits right at the start instead of on each use. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-11 12:15:15 -04:00
J. Bruce Fields	a084daf512	nfsd4: move name-length checks to xdr Again, these checks are better in the xdr code. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-11 12:15:01 -04:00
Christoph Hellwig	0030807c66	xfs: revert to using a kthread for AIL pushing Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 11:02:49 -05:00
Christoph Hellwig	17b38471c3	xfs: force the log if we encounter pinned buffers in .iop_pushbuf We need to check for pinned buffers even in .iop_pushbuf given that inode items flush into the same buffers that may be pinned directly due operations on the unlinked inode list operating directly on buffers. To do this add a return value to .iop_pushbuf that tells the AIL push about this and use the existing log force mechanisms to unpin it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 11:02:48 -05:00
Christoph Hellwig	bc6e588a89	xfs: do not update xa_last_pushed_lsn for locked items If an item was locked we should not update xa_last_pushed_lsn and thus skip it when restarting the AIL scan as we need to be able to lock and write it out as soon as possible. Otherwise heavy lock contention might starve AIL pushing too easily, especially given the larger backoff once we moved xa_last_pushed_lsn all the way to the target lsn. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-10-11 11:02:48 -05:00
Chris Mason	f7f43cc841	Btrfs: make sure not to defrag extents past i_size The btrfs file defrag code will loop through the extents and force COW on them. But there is a concurrent truncate in the middle of the defrag, it might end up defragging the same range over and over again. The problem is that writepage won't go through and do anything on pages past i_size, so the cow won't happen, so the file will appear to still be fragmented. defrag will end up hitting the same extents again and again. In the worst case, the truncate can actually live lock with the defrag because the defrag keeps creating new ordered extents which the truncate code keeps waiting on. The fix here is to make defrag check for i_size inside the main loop, instead of just once before the looping starts. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-10-11 11:45:55 -04:00
J. Bruce Fields	04f9e664b2	nfsd4: move access/deny validity checks to xdr code I'd rather put more of these sorts of checks into standardized xdr decoders for the various types rather than have them cluttering up the core logic in nfs4proc.c and nfs4state.c. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-11 08:53:12 -04:00
J. Bruce Fields	c30e92df30	nfsd4: ignore WANT bits in open downgrade We don't use WANT bits yet--and sending them can probably trigger a BUG() further down. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-10 18:05:20 -04:00
J. Bruce Fields	b31b30e5c7	nfsd4: cleanup state.h comments These comments are mostly out of date. Reported-by: Bryan Schumaker <bjschuma@netapp.com>	2011-10-10 18:04:46 -04:00
J. Bruce Fields	6409a5a65d	nfsd4: clean up downgrading code In response to some review comments, get rid of the somewhat obscure for-loop with bitops, and improve a comment. Reported-by: Steve Dickson <steved@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-10 18:04:45 -04:00
J. Bruce Fields	71c3bcd713	nfsd4: fix state lock usage in LOCKU In commit `5ec094c109` "nfsd4: extend state lock over seqid replay logic" I modified the exit logic of all the seqid-based procedures except nfsd4_locku(). Fix the oversight. The result of the bug was a double-unlock while handling the LOCKU procedure, and a warning like: [ 142.150014] WARNING: at kernel/mutex-debug.c:78 debug_mutex_unlock+0xda/0xe0() ... [ 142.152927] Pid: 742, comm: nfsd Not tainted 3.1.0-rc1-SLIM+ #9 [ 142.152927] Call Trace: [ 142.152927] [<ffffffff8105fa4f>] warn_slowpath_common+0x7f/0xc0 [ 142.152927] [<ffffffff8105faaa>] warn_slowpath_null+0x1a/0x20 [ 142.152927] [<ffffffff810960ca>] debug_mutex_unlock+0xda/0xe0 [ 142.152927] [<ffffffff813e4200>] __mutex_unlock_slowpath+0x80/0x140 [ 142.152927] [<ffffffff813e42ce>] mutex_unlock+0xe/0x10 [ 142.152927] [<ffffffffa03bd3f5>] nfs4_lock_state+0x35/0x40 [nfsd] [ 142.152927] [<ffffffffa03b0b71>] nfsd4_proc_compound+0x2a1/0x690 [nfsd] [ 142.152927] [<ffffffffa039f9fb>] nfsd_dispatch+0xeb/0x230 [nfsd] [ 142.152927] [<ffffffffa02b1055>] svc_process_common+0x345/0x690 [sunrpc] [ 142.152927] [<ffffffff81058d10>] ? try_to_wake_up+0x280/0x280 [ 142.152927] [<ffffffffa02b16e2>] svc_process+0x102/0x150 [sunrpc] [ 142.152927] [<ffffffffa039f0bd>] nfsd+0xbd/0x160 [nfsd] [ 142.152927] [<ffffffffa039f000>] ? 0xffffffffa039efff [ 142.152927] [<ffffffff8108230c>] kthread+0x8c/0xa0 [ 142.152927] [<ffffffff813e8694>] kernel_thread_helper+0x4/0x10 [ 142.152927] [<ffffffff81082280>] ? kthread_worker_fn+0x190/0x190 [ 142.152927] [<ffffffff813e8690>] ? gs_change+0x13/0x13 Reported-by: Bryan Schumaker <bjschuma@netapp.com> Tested-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-10-10 18:04:45 -04:00
Li Zefan	2a0f7f5769	Btrfs: fix recursive auto-defrag Follow those steps: # mount -o autodefrag /dev/sda7 /mnt # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1 # sync # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc and then it'll go into a loop: writeback -> defrag -> writeback ... It's because writeback writes [8K, 200K] and then writes [0, 8K]. I tried to make writeback know if the pages are dirtied by defrag, but the patch was a bit intrusive. Here I simply set writeback_index when we defrag a file. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-10-10 15:43:34 -04:00
Linus Torvalds	65112dccf8	Merge git://git.samba.org/sfrench/cifs-2.6 * git://git.samba.org/sfrench/cifs-2.6: [CIFS] Fix first time message on mount, ntlmv2 upgrade delayed to 3.2	2011-10-10 14:53:11 +12:00
Steve French	9d1e397b7b	[CIFS] Fix first time message on mount, ntlmv2 upgrade delayed to 3.2 Microsoft has a bug with ntlmv2 that requires use of ntlmssp, but we didn't get the required information on when/how to use ntlmssp to old (but once very popular) legacy servers (various NT4 fixpacks for example) until too late to merge for 3.1. Will upgrade to NTLMv2 in NTLMSSP in 3.2 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>	2011-10-07 20:17:56 -05:00
Boaz Harrosh	d866d875f6	ore/exofs: Change the type of the devices array (API change) In the pNFS obj-LD the device table at the layout level needs to point to a device_cache node, where it is possible and likely that many layouts will point to the same device-nodes. In Exofs we have a more orderly structure where we have a single array of devices that repeats twice for a round-robin view of the device table This patch moves to a model that can be used by the pNFS obj-LD where struct ore_components holds an array of ore_dev-pointers. (ore_dev is newly defined and contains a struct osd_dev *od member) Each pointer in the array of pointers will point to a bigger user-defined dev_struct. That can be accessed by use of the container_of macro. In Exofs an __alloc_dev_table() function allocates the ore_dev-pointers array as well as an exofs_dev array, in one allocation and does the addresses dance to set everything pointing correctly. It still keeps the double allocation trick for the inodes round-robin view of the table. The device table is always allocated dynamically, also for the single device case. So it is unconditionally freed at umount. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-04 12:13:59 +02:00
Linus Torvalds	7fd21be75d	Merge branch 'btrfs-3.0' of git://github.com/chrismason/linux * 'btrfs-3.0' of git://github.com/chrismason/linux: Btrfs: force a page fault if we have a shorty copy on a page boundary	2011-10-03 12:17:44 -07:00
Boaz Harrosh	eb507bc189	ore: Make ore_striping_info and ore_calc_stripe_info public The struct ore_striping_info will be used later in other structures. And ore_calc_stripe_info as well. Rename them make struct ore_striping_info public. ore_calc_stripe_info is still static, will be made public on first use. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:07:51 +02:00
Boaz Harrosh	8d2d83a835	exofs: Remove unused data_map member from exofs_sb_info The struct pnfs_osd_data_map data_map member of exofs_sb_info was never used after mount. In fact all it's members were duplicated by the ore_layout structure. So just remove the duplicated information. Also removed some stupid, but perfectly supported, restrictions on layout parameters. The case where num_devices is not divisible by mirror_count+1 is perfectly fine since the rotating device view will eventually use all the devices it can get. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>	2011-10-03 17:07:51 +02:00
Boaz Harrosh	5bf696dad4	exofs: Rename struct ore_components comps => oc ore_components already has a comps member so this leads to things like comps->comps which is annoying. the name oc was already used in new code. So rename all old usage of ore_components comps => ore_components oc. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:07:50 +02:00
H Hartley Sweeten	de74b05ace	exofs/super.c: local functions should be static This quiets the following sparse noise: warning: symbol 'exofs_sync_fs' was not declared. Should it be static? warning: symbol 'exofs_free_sbi' was not declared. Should it be static? warning: symbol 'exofs_get_parent' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:07:29 +02:00
H Hartley Sweeten	1958c7c284	exofs/ore.c: local functions should be static This quiets the sparse noise: warning: symbol '_calc_trunk_info' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:06:47 +02:00
Josef Bacik	b6316429af	Btrfs: force a page fault if we have a shorty copy on a page boundary A user reported a problem where ceph was getting into 100% cpu usage while doing some writing. It turns out it's because we were doing a short write on a not uptodate page, which means we'd fall back at one page at a time and fault the page in. The problem is our position is on the page boundary, so our fault in logic wasn't actually reading the page, so we'd just spin forever or until the page got read in by somebody else. This will force a readpage if we end up doing a short copy. Alexandre could reproduce this easily with ceph and reports it fixes his problem. I also wrote a reproducer that no longer hangs my box with this patch. Thanks, Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-30 15:23:54 -04:00
Jesper Juhl	95c7545453	CIFS: Don't free volume_info->UNC until we are entirely done with it. In cleanup_volume_info_contents() we kfree(volume_info->UNC); and then proceed to use that variable on the very next line. This causes (at least) Coverity Prevent to complain about use-after-free of that variable (and I guess other checkers may do that as well). There's not any /real/ problem here since we are just using the value of the pointer, not actually dereferencing it, but it's still trivial to silence the tool, so why not? To me at least it also just seems nicer to defer freeing the variable until we are entirely done with it in all respects. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-09-27 18:08:04 +02:00
Paul Bolle	395cf9691d	doc: fix broken references There are numerous broken references to Documentation files (in other Documentation files, in comments, etc.). These broken references are caused by typo's in the references, and by renames or removals of the Documentation files. Some broken references are simply odd. Fix these broken references, sometimes by dropping the irrelevant text they were part of. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-09-27 18:08:04 +02:00
Linus Torvalds	b6c8069d35	vfs: remove LOOKUP_NO_AUTOMOUNT flag That flag no longer makes sense, since we don't look up automount points as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT handling was buggy to begin with: it would avoid automounting even for cases where we really needed to do the automount handling, and could return ENOENT for autofs entries that hadn't been instantiated yet. With our new non-eager automount semantics, one discussion has been about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the newfstatat() and fstatat64() system calls), but it's probably not worth it: you can always force at least directory automounting by simply adding the final '/' to the filename, which works for all of the stat family system calls, old and new. So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a result of our bad default behavior. Acked-by: Ian Kent <raven@themaw.net> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-27 08:12:33 -07:00
Trond Myklebust	815d405cef	VFS: Fix the remaining automounter semantics regressions The concensus seems to be that system calls such as stat() etc should not trigger an automount. Neither should the l* versions. This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups that _should_ trigger an automount on the last path element. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> [ Edited to leave out the cases that are already covered by LOOKUP_OPEN, LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally force automounting for their own reasons - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-26 19:16:46 -07:00
Linus Torvalds	d94c177bee	vfs pathname lookup: Add LOOKUP_AUTOMOUNT flag Since we've now turned around and made LOOKUP_FOLLOW not force an automount, we want to add the ability to force an automount event on lookup even if we don't happen to have one of the other flags that force it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..) Most cases will never want to use this, since you'd normally want to delay automounting as long as possible, which usually implies LOOKUP_OPEN (when we open a file or directory, we really cannot avoid the automount any more). But Trond argued sufficiently forcefully that at a minimum bind mounting a file and quotactl will want to force the automount lookup. Some other cases (like nfs_follow_remote_path()) could use it too, although LOOKUP_DIRECTORY would work there as well. This commit just adds the flag and logic, no users yet, though. It also doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and was made irrelevant by the same change that made us not follow on LOOKUP_FOLLOW. Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Ian Kent <raven@themaw.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-26 17:44:55 -07:00
Heiko Carstens	c4253cb074	sysfs: add unsigned long cast to prevent compile warning "sysfs: use rb-tree for inode number lookup" added a new printk which causes a new compile warning on s390 (and few other architectures): fs/sysfs/dir.c: In function 'sysfs_link_sibling': fs/sysfs/dir.c:63:4: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'ino_t' [-Wform Add an explicit unsigned long cast since ino_t is an unsigned long on most architectures. Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-09-26 16:21:15 -07:00
J. Bruce Fields	38c2f4b12a	nfsd4: look up stateid's per clientid Use a separate stateid idr per client, and lookup a stateid by first finding the client, then looking up the stateid relative to that client. Also some minor refactoring. This allows us to improve error returns: we can return expired when the clientid is not found and bad_stateid when the clientid is found but not the stateid, as opposed to returning expired for both cases. I hope this will also help to replace the state lock mostly by a per-client lock, but that hasn't been done yet. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-26 17:35:28 -04:00
J. Bruce Fields	36279ac10c	nfsd4: assume test_stateid always has session Test_stateid is 4.1-only and only allowed after a sequence operation, so this check is unnecessary. Cc: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-26 17:35:27 -04:00
J. Bruce Fields	6136d2b409	nfsd4: use idr for stateid's The idr system is designed exactly for generating id and looking up integer id's. Thanks to Trond for pointing it out. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-26 17:35:26 -04:00
J. Bruce Fields	2a74aba799	nfsd4: move client * to nfs4_stateid, add init_stid helper This will be convenient. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-26 17:35:25 -04:00
Linus Torvalds	fed678dc8a	Merge branch 'for-linus' of git://git.kernel.dk/linux-block * 'for-linus' of git://git.kernel.dk/linux-block: floppy: use del_timer_sync() in init cleanup blk-cgroup: be able to remove the record of unplugged device block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request mm: Add comment explaining task state setting in bdi_forker_thread() mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() block: simplify force plug flush code a little bit block: change force plug flush call order block: Fix queue_flag update when rq_affinity goes from 2 to 1 block: separate priority boosting from REQ_META block: remove READ_META and WRITE_META xen-blkback: fixed indentation and comments xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.	2011-09-21 13:20:21 -07:00
Dave Hansen	32ef43848f	teach /proc/$pid/numa_maps about transparent hugepages This is modeled after the smaps code. It detects transparent hugepages and then does a single gather_stats() for the page as a whole. This has two benifits: 1. It is more efficient since it does many pages in a single shot. 2. It does not have to break down the huge page. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-21 13:15:44 -07:00
Dave Hansen	3200a8aaab	break out numa_maps gather_pte_stats() checks gather_pte_stats() does a number of checks on a target page to see whether it should even be considered for statistics. This breaks that code out in to a separate function so that we can use it in the transparent hugepage case in the next patch. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-21 13:15:44 -07:00
Dave Hansen	eb4866d006	make /proc/$pid/numa_maps gather_stats() take variable page size We need to teach the numa_maps code about transparent huge pages. The first step is to teach gather_stats() that the pte it is dealing with might represent more than one page. Note that will we use this in a moment for transparent huge pages since they have use a single pmd_t which _acts_ as a "surrogate" for a bunch of smaller pte_t's. I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to figure out how many _bytes_ "dirty=1" means, you must first know the hugetlbfs page size. That's easier said than done especially if you don't have visibility in to the mount. But, that's probably a discussion for another day especially since it would change behavior to fix it. But, just in case anyone wonders why this patch only passes a '1' in the hugetlb case... Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-21 13:15:44 -07:00
J. Bruce Fields	8335ebd94b	leases: split up generic_setlease into lock/unlock cases Eventually we should probably do the same thing to the file operations as well. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-21 10:40:54 -04:00
Linus Torvalds	43a964a7bf	Merge branch 'for-linus' of git://github.com/chrismason/linux * 'for-linus' of git://github.com/chrismason/linux: Btrfs: reserve sufficient space for ioctl clone	2011-09-20 14:22:55 -07:00
Chris Mason	0a7a0519d1	Merge branch 'btrfs-3.0' into for-linus	2011-09-20 14:49:29 -04:00
Sage Weil	b6f3409b21	Btrfs: reserve sufficient space for ioctl clone Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We need to reserve space for: - adjusting the old extent (possibly splitting it) - adding the new extent - updating the inode Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-20 14:48:51 -04:00
J. Bruce Fields	c856694e3d	nfsd4: make op_cacheresult another flag I'm not sure why I used a new field for this originally. Also, the differences between some of these flags are a little subtle; add some comments to explain. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-20 14:45:51 -04:00
J. Bruce Fields	3d02fa29de	nfsd4: fix open downgrade, again Yet another open-management regression: - nfs4_file_downgrade() doesn't remove the BOTH access bit on downgrade, so the server's idea of the stateid's access gets out of sync with the client's. If we want to keep an O_RDWR open in this case, we should do that in the file_put_access logic rather than here. - We forgot to convert v4 access to an open mode here. This logic has proven too hard to get right. In the future we may consider: - reexamining the lock/openowner relationship (locks probably don't really need to take their own references here). - adding open upgrade/downgrade support to the vfs. - removing the atomic operations. They're redundant as long as this is all under some other lock. Also, maybe some kind of additional static checking would help catch O_/NFS4_SHARE_ACCESS confusion. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-20 14:43:39 -04:00
Shirish Pargaonkar	cfbd6f84c2	cifs: Fix broken sec=ntlmv2/i sec option (try #2 ) Fix sec=ntlmv2/i authentication option during mount of Samba shares. cifs client was coding ntlmv2 response incorrectly. All that is needed in temp as specified in MS-NLMP seciton 3.3.2 "Define ComputeResponse(NegFlg, ResponseKeyNT, ResponseKeyLM, CHALLENGE_MESSAGE.ServerChallenge, ClientChallenge, Time, ServerName) as Set temp to ConcatenationOf(Responserversion, HiResponserversion, Z(6), Time, ClientChallenge, Z(4), ServerName, Z(4)" is MsvAvNbDomainName. For sec=ntlmsspi, build_av_pair is not used, a blob is plucked from type 2 response sent by the server to use in authentication. I tested sec=ntlmv2/i and sec=ntlmssp/i mount options against Samba (3.6) and Windows - XP, 2003 Server and 7. They all worked. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-09-19 21:16:58 -05:00
Steve French	c9c7fa0064	Fix the conflict between rwpidforward and rw mount options Both these options are started with "rw" - that's why the first one isn't switched on even if it is specified. Fix this by adding a length check for "rw" option check. Cc: <stable@kernel.org> Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-09-19 21:16:20 -05:00
Pavel Shilovsky	5b980b0121	CIFS: Fix ERR_PTR dereference in cifs_get_root move it to the beginning of the loop. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-09-19 21:15:03 -05:00
Jeff Layton	9438fabb73	cifs: fix possible memory corruption in CIFSFindNext The name_len variable in CIFSFindNext is a signed int that gets set to the resume_name_len in the cifs_search_info. The resume_name_len however is unsigned and for some infolevels is populated directly from a 32 bit value sent by the server. If the server sends a very large value for this, then that value could look negative when converted to a signed int. That would make that value pass the PATH_MAX check later in CIFSFindNext. The name_len would then be used as a length value for a memcpy. It would then be treated as unsigned again, and the memcpy scribbles over a ton of memory. Fix this by making the name_len an unsigned value in CIFSFindNext. Cc: <stable@kernel.org> Reported-by: Darren Lavender <dcl@hppine99.gbr.hp.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2011-09-19 21:14:40 -05:00
Linus Torvalds	50f2d407c0	Merge branch 'for-linus' of git://github.com/chrismason/linux * 'for-linus' of git://github.com/chrismason/linux: Btrfs: only clear the need lookup flag after the dentry is setup BTRFS: Fix lseek return value for error Btrfs: don't change inode flag of the dest clone file Btrfs: don't make a file partly checksummed through file clone Btrfs: fix pages truncation in btrfs_ioctl_clone() btrfs: fix d_off in the first dirent	2011-09-19 17:17:32 -07:00
J. Bruce Fields	f7a4d87207	nfsd4: hash closed stateid's like any other Look up closed stateid's in the stateid hash like any other stateid rather than searching the close lru. This is simpler, and fixes a bug: currently we handle only the case of a close that is the last close for a given stateowner, but not the case of a close for a stateowner that still has active opens on other files. Thus in a case like: open(owner, file1) open(owner, file2) close(owner, file2) close(owner, file2) the final close won't be recognized as a retransmission. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-19 08:39:34 -04:00
J. Bruce Fields	d3b313a463	nfsd4: construct stateid from clientid and counter Including the full clientid in the on-the-wire stateid allows more reliable detection of bad vs. expired stateid's, simplifies code, and ensures we won't reuse the opaque part of the stateid (as we currently do when the same openowner closes and reopens the same file). Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-19 06:33:57 -04:00
Josef Bacik	a66e7cc626	Btrfs: only clear the need lookup flag after the dentry is setup We can race with readdir and the RCU path walking stuff. This is because we clear the need lookup flag before actually instantiating the inode. This will lead the RCU path walk stuff to find a dentry it thinks is valid without a d_inode attached. So instead unhash the dentry when we first start the lookup, and then clear the flag after we've instantiated the dentry so we're garunteed to either try the slow lookup, or have the d_inode set properly. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-18 10:34:03 -04:00
Jeff Liu	48802c8ae2	BTRFS: Fix lseek return value for error The recent reworking of btrfs' lseek lead to incorrect values being returned. This adds checks for seeking beyond EOF in SEEK_HOLE and makes sure the error values come back correct. Andi Kleen also sent in similar patches. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-18 10:34:02 -04:00
Chris Mason	2cf4ce7c2a	Merge branch 'btrfs-3.0' into for-linus	2011-09-18 10:31:44 -04:00
Li Zefan	dde820fbf7	Btrfs: don't change inode flag of the dest clone file The dst file will have the same inode flags with dst file after file clone, and I think it's unexpected. For example, the dst file will suddenly become immutable after getting some share of data with src file, if the src is immutable. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-18 10:20:46 -04:00
Li Zefan	0e7b824c4e	Btrfs: don't make a file partly checksummed through file clone To reproduce the bug: # mount /dev/sda7 /mnt # dd if=/dev/zero of=/mnt/src bs=4K count=1 # umount /mnt # mount -o nodatasum /dev/sda7 /mnt # dd if=/dev/zero of=/mnt/dst bs=4K count=1 # clone_range -s 4K -l 4K /mnt/src /mnt/dst # echo 3 > /proc/sys/vm/drop_caches # cat /mnt/dst # dmesg ... btrfs no csum found for inode 258 start 0 btrfs csum failed ino 258 off 0 csum 2566472073 private 0 It's because part of the file is checksummed and the other part is not, and then btrfs will complain checksum is not found when we read the file. Disallow file clone if src and dst file have different checksum flag, so we ensure a file is completely checksummed or unchecksummed. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-18 10:20:46 -04:00
Li Zefan	71ef078610	Btrfs: fix pages truncation in btrfs_ioctl_clone() It's a bug in commit `f81c9cdc56` (Btrfs: truncate pages from clone ioctl target range) We should pass the dest range to the truncate function, but not the src range. Also move the function before locking extent state. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-18 10:20:46 -04:00
Hidetoshi Seto	3765fefaee	btrfs: fix d_off in the first dirent Since the d_off in the first dirent for "." (that originates from the 4th argument "offset" of filldir() for the 2nd dirent for "..") is wrongly assigned in btrfs_real_readdir(), telldir returns same offset for different locations. \| # mkfs.btrfs /dev/sdb1 \| # mount /dev/sdb1 fs0 \| # cd fs0 \| # touch file0 file1 \| # ../test \| telldir: 0 \| readdir: d_off = 2, d_name = "." \| telldir: 2 \| readdir: d_off = 2, d_name = ".." \| telldir: 2 \| readdir: d_off = 3, d_name = "file0" \| telldir: 3 \| readdir: d_off = 2147483647, d_name = "file1" \| telldir: 2147483647 To fix this problem, pass filp->f_pos (which is loff_t) instead. \| # ../test \| telldir: 0 \| readdir: d_off = 1, d_name = "." \| telldir: 1 \| readdir: d_off = 2, d_name = ".." \| telldir: 2 \| readdir: d_off = 3, d_name = "file0" : At the moment the "offset" for "." is unused because there is no preceding dirent, however it is better to pass filp->f_pos to follow grammatical usage. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-18 10:20:46 -04:00
J. Bruce Fields	2da1cec713	nfsd4: simplify free_stateid We no longer need is_deleg_stateid, for example. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-17 10:31:16 -04:00
J. Bruce Fields	38c387b52d	nfsd4: match close replays on stateid, not open owner id Keep around an unhashed copy of the final stateid after the last close using an openowner, and when identifying a replay, match against that stateid instead of just against the open owner id. Free it the next time the seqid is bumped or the stateowner is destroyed. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-17 10:01:54 -04:00
J. Bruce Fields	dad1c067eb	nfsd4: replace oo_confirmed by flag bit I want at least one more bit here. So, let's haul out the caps lock key and add a flags field. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-16 17:44:16 -04:00
Mi Jinlong	58e7b33a58	nfsd41: try to check reply size before operation For checking the size of reply before calling a operation, we need try to get maxsize of the operation's reply. v3: using new method as Bruce said, "we could handle operations in two different ways: - For operations that actually change something (write, rename, open, close, ...), do it the way we're doing it now: be very careful to estimate the size of the response before even processing the operation. - For operations that don't change anything (read, getattr, ...) just go ahead and do the operation. If you realize after the fact that the response is too large, then return the error at that point. So we'd add another flag to op_flags: say, OP_MODIFIES_SOMETHING. And for operations with OP_MODIFIES_SOMETHING set, we'd do the first thing. For operations without it set, we'd do the second." Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> [bfields@redhat.com: crash, don't attempt to handle, undefined op_rsize_bop] Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-16 10:31:01 -04:00
Linus Torvalds	17d8428e4c	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Do not allow multiple mounts on same mountpoint when using -o noac NFS: Fix a typo in nfs_flush_multi NFSv4: renewd needs to be able to handle the NFS4ERR_CB_PATH_DOWN error NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation NFSv4: nfs4_proc_renew should be declared static NFSv4: nfs4_proc_async_renew should use a GFP_NOFS allocation	2011-09-15 12:36:01 -07:00
Christoph Hellwig	f1fcd9f0e9	hfsplus: fix filesystem size checks generic_check_addressable can't deal with hfsplus's larger than page size allocation blocks, so simply opencode the checks that we actually need in hfsplus_fill_super. Signed-off-by: Christoph Hellwig <hch@tuxera.com> Reported-by: Pavel Ivanov <paivanof@gmail.com> Tested-by: Pavel Ivanov <paivanof@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-15 09:03:17 -07:00
Seth Forshee	f588c960fc	hfsplus: Fix kfree of wrong pointers in hfsplus_fill_super() error path Commit `6596528e39` ("hfsplus: ensure bio requests are not smaller than the hardware sectors") changed the pointers used for volume header allocations but failed to free the correct pointers in the error path path of hfsplus_fill_super() and hfsplus_read_wrapper. The second hunk came from a separate patch by Pavel Ivanov. Reported-by: Pavel Ivanov <paivanof@gmail.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Christoph Hellwig <hch@tuxera.com> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-15 09:03:16 -07:00
Jiri Kosina	e060c38434	Merge branch 'master' into for-next Fast-forward merge with Linus to be able to merge patches based on more recent version of the tree.	2011-09-15 15:08:18 +02:00
Joe Perches	558feb0818	fs: Convert vmalloc/memset to vzalloc Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Alex Elder <aelder@sgi.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-09-15 13:56:28 +02:00
Linus Torvalds	53d872e995	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix a use after free in xfs_end_io_direct_write	2011-09-14 16:08:29 -07:00
Al Viro	1d2ef59014	restore pinning the victim dentry in vfs_rmdir()/vfs_rename_dir() We used to get the victim pinned by dentry_unhash() prior to commit `64252c75a2` ("vfs: remove dget() from dentry_unhash()") and ->rmdir() and ->rename() instances relied on that; most of them don't care, but ones that used d_delete() themselves do. As the result, we are getting rmdir() oopses on NFS now. Just grab the reference before locking the victim and drop it explicitly after unlocking, same as vfs_rename_other() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Simon Kirby <sim@hostway.ca> Cc: stable@kernel.org (3.0.x) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-14 11:31:55 -07:00
Christoph Hellwig	2d2422aebc	xfs: fix a use after free in xfs_end_io_direct_write There is a window in which the ioend that we call inode_dio_wake on in xfs_end_io_direct_write is already free. Fix this by storing the inode pointer in a local variable. This is a fix for the regression introduced in 3.1-rc by "fs: move inode_dio_done to the end_io handler". Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-09-14 08:56:35 -05:00
Mi Jinlong	849a1cf13d	SUNRPC: Replace svc_addr_u by sockaddr_storage For IPv6 local address, lockd can not callback to client for missing scope id when binding address at inet6_bind: 324 if (addr_type & IPV6_ADDR_LINKLOCAL) { 325 if (addr_len >= sizeof(struct sockaddr_in6) && 326 addr->sin6_scope_id) { 327 /* Override any existing binding, if another one 328 * is supplied by user. 329 / 330 sk->sk_bound_dev_if = addr->sin6_scope_id; 331 } 332 333 / Binding to link-local address requires an interface */ 334 if (!sk->sk_bound_dev_if) { 335 err = -EINVAL; 336 goto out_unlock; 337 } Replacing svc_addr_u by sockaddr_storage, let rqstp->rq_daddr contains more info besides address. Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-14 08:21:48 -04:00
Trond Myklebust	11fcee0293	NFSD: Add a cache for fs_locations information Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> [ cel: since this is server-side, use nfsd4_ prefix instead of nfs4_ prefix. ] [ cel: implement S_ISVTX filter in bfields-normal form ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 22:44:17 -04:00
Trond Myklebust	2f1ddda174	NFSD: Remove the ex_pathname field from struct svc_export There are no more users... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 22:44:10 -04:00
Trond Myklebust	ed748aacb8	NFSD: Cleanup for nfsd4_path() The current code is sort of hackish in that it assumes a referral is always matched to an export. When we add support for junctions that may not be the case. We can replace nfsd4_path() with a function that encodes the components directly from the dentries. Since nfsd4_path is currently the only user of the 'ex_pathname' field in struct svc_export, this has the added benefit of allowing us to get rid of that. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 22:43:42 -04:00
J. Bruce Fields	ee626a77d3	nfsd4: better stateid hashing First, we shouldn't care here about the structure of the opaque part of the stateid. Second, this hash is really dumb. (I'm not sure the replacement is much better, though--to look at it another patch.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:30:36 -04:00
J. Bruce Fields	69064a2764	nfsd4: use deleg changes to cleanup preprocess_stateid_op Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:30:36 -04:00
J. Bruce Fields	97b7e3b6d4	nfsd4: fix test_stateid for delegation stateid's Test_stateid should handle delegation stateid's as well. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:30:35 -04:00
J. Bruce Fields	f459e45359	nfsd4: hash deleg stateid's like any other It's simpler to look up delegation stateid's in the same hash table as any other stateid. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:30:34 -04:00
J. Bruce Fields	36d44c6038	nfsd4: share common stid-hashing helper function Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:30:33 -04:00
J. Bruce Fields	d5477a8db8	nfsd4: add common dl_stid field to delegation Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:30:32 -04:00
J. Bruce Fields	dcef0413da	nfsd4: move some of nfs4_stateid into a separate structure We want delegations to share more with open/lock stateid's, so first we'll pull out some of the common stuff we want to share. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:29:58 -04:00
J. Bruce Fields	91a8c04031	nfsd4: remove redundant stateid initialization Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:29:04 -04:00
J. Bruce Fields	881ea2b11e	nfsd4: rename init_stateid Note this is actually open-stateid specific. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:29:03 -04:00
J. Bruce Fields	2288d0e395	nfsd4: pass around typemask instead of flags We're only using those flags to choose lock or open stateid's at this point. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:29:00 -04:00
J. Bruce Fields	c0a5d93efb	nfsd4: split preprocess_seqid, cleanup Move most of this into helper functions. Also move the non-CONFIRM case into caller, providing a helper function for that purpose. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:27:35 -04:00
J. Bruce Fields	4d71ab8751	nfsd4: split up find_stateid Minor cleanup. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:27:31 -04:00
J. Bruce Fields	4581d14099	nfsd4: rearrange to avoid a forward reference Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-13 18:25:39 -04:00
Sachin Prabhu	fb2088ccc1	nfs: Do not allow multiple mounts on same mountpoint when using -o noac Do not allow multiple mounts on same mountpoint when using -o noac When you normally attempt to mount a share twice on the same mountpoint, a check in do_add_mount causes it to return an error # mount localhost:/nfsv3 /mnt # mount localhost:/nfsv3 /mnt mount.nfs: /mnt is already mounted or busy However when using the option 'noac', the user is able to mount the same share on the same mountpoint multiple times. This happens because a share mounted with the noac option is automatically assigned the 'sync' flag MS_SYNCHRONOUS in nfs_initialise_sb(). This flag is set after the check for already existing superblocks is done in sget(). The check for the mount flags in nfs_compare_mount_options() does not take into account the 'sync' flag applied later on in the code path. This means that when using 'noac', a new superblock structure is assigned for every new mount of the same share and multiple shares on the same mountpoint are allowed. ie. # mount -onoac localhost:/nfsv3 /mnt can be run multiple times. The patch checks for noac and assigns the sync flag before sget() is called to obtain an already existing superblock structure. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-09-13 17:10:15 -04:00
Trond Myklebust	f13c3620a4	NFS: Fix a typo in nfs_flush_multi Fix a typo which causes an Oops in the RPC layer, when using wsize < 4k. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Sricharan R <r.sricharan@ti.com>	2011-09-13 17:06:57 -04:00
Linus Torvalds	0b001b2eda	Merge branch 'for-linus' of git://github.com/chrismason/linux * 'for-linus' of git://github.com/chrismason/linux: Btrfs: add dummy extent if dst offset excceeds file end in Btrfs: calc file extent num_bytes correctly in file clone btrfs: xattr: fix attribute removal Btrfs: fix wrong nbytes information of the inode Btrfs: fix the file extent gap when doing direct IO Btrfs: fix unclosed transaction handle in btrfs_cont_expand Btrfs: fix misuse of trans block rsv Btrfs: reset to appropriate block rsv after orphan operations Btrfs: skip locking if searching the commit root in csum lookup btrfs: fix warning in iput for bad-inode Btrfs: fix an oops when deleting snapshots	2011-09-12 11:47:49 -07:00
Miklos Szeredi	5dfcc87fd7	fuse: fix memory leak kmemleak is reporting that 32 bytes are being leaked by FUSE: unreferenced object 0xe373b270 (size 32): comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<b05517d7>] kmemleak_alloc+0x27/0x50 [<b0196435>] kmem_cache_alloc+0xc5/0x180 [<b02455be>] fuse_alloc_forget+0x1e/0x20 [<b0245670>] fuse_alloc_inode+0xb0/0xd0 [<b01b1a8c>] alloc_inode+0x1c/0x80 [<b01b290f>] iget5_locked+0x8f/0x1a0 [<b0246022>] fuse_iget+0x72/0x1a0 [<b02461da>] fuse_get_root_inode+0x8a/0x90 [<b02465cf>] fuse_fill_super+0x3ef/0x590 [<b019e56f>] mount_nodev+0x3f/0x90 [<b0244e95>] fuse_mount+0x15/0x20 [<b019d1bc>] mount_fs+0x1c/0xc0 [<b01b5811>] vfs_kern_mount+0x41/0x90 [<b01b5af9>] do_kern_mount+0x39/0xd0 [<b01b7585>] do_mount+0x2e5/0x660 [<b01b7966>] sys_mount+0x66/0xa0 This leak report is consistent and happens once per boot on 3.1.0-rc5-dirty. This happens if a FORGET request is queued after the fuse device was released. Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-12 11:47:10 -07:00
Miklos Szeredi	24114504c4	fuse: fix flock breakage Commit `37fb3a30b4` ("fuse: fix flock") added in 3.1-rc4 caused flock() to fail with ENOSYS with the kernel ABI version 7.16 or earlier. Fix by falling back to testing FUSE_POSIX_LOCKS for ABI versions 7.16 and earlier. Reported-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Martin Ziegler <ziegler@email.mathematik.uni-freiburg.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-12 11:47:10 -07:00
Li Zefan	d525e8ab02	Btrfs: add dummy extent if dst offset excceeds file end in You can see there's no file extent with range [0, 4096]. Check this by btrfsck: # btrfsck /dev/sda7 root 5 inode 258 errors 100 ... Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:25 -04:00
Li Zefan	d72c0842ff	Btrfs: calc file extent num_bytes correctly in file clone num_bytes should be 4096 not 12288. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:25 -04:00
David Sterba	4815053aba	btrfs: xattr: fix attribute removal An attribute is not removed by 'setfattr -x attr file' and remains visible in attr list. This makes xfstests/062 pass again. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:25 -04:00
Miao Xie	a39f752143	Btrfs: fix wrong nbytes information of the inode If we write some data into the data hole of the file(no preallocation for this hole), Btrfs will allocate some disk space, and update nbytes of the inode, but the other element--disk_i_size needn't be updated. At this condition, we must update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size() return 1). # mkfs.btrfs /dev/sdb1 # mount /dev/sdb1 /mnt # touch /mnt/a # truncate -s 856002 /mnt/a # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc # umount /mnt # btrfsck /dev/sdb1 root 5 inode 257 errors 400 found 32768 bytes used err is 1 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:25 -04:00
Miao Xie	0c1a98c814	Btrfs: fix the file extent gap when doing direct IO When we write some data to the place that is beyond the end of the file in direct I/O mode, a data hole will be created. And Btrfs should insert a file extent item that point to this hole into the fs tree. But unfortunately Btrfs forgets doing it. The following is a simple way to reproduce it: # mkfs.btrfs /dev/sdc2 # mount /dev/sdc2 /test4 # touch /test4/a # dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc # umount /test4 # btrfsck /dev/sdc2 root 5 inode 257 errors 100 Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Miao Xie	5b397377e9	Btrfs: fix unclosed transaction handle in btrfs_cont_expand The function - btrfs_cont_expand() forgot to close the transaction handle before it jump out the while loop. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Liu Bo	98c9942aca	Btrfs: fix misuse of trans block rsv At the beginning of create_pending_snapshot, trans->block_rsv is set to pending->block_rsv and is used for snapshot things, however, when it is done, we do not recover it as will. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Liu Bo	65450aa645	Btrfs: reset to appropriate block rsv after orphan operations While truncating free space cache, we forget to change trans->block_rsv back to the original one, but leave it with the orphan_block_rsv, and then with option inode_cache enable, it leads to countless warnings of btrfs_alloc_free_block and btrfs_orphan_commit_root: WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]() ... WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]() Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Josef Bacik	ddf23b3fc6	Btrfs: skip locking if searching the commit root in csum lookup It's not enough to just search the commit root, since we could be cow'ing the very block we need to search through, which would mean that its locked and we'll still deadlock. So use path->skip_locking as well. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Sergei Trofimovich	e0b6d65be5	btrfs: fix warning in iput for bad-inode iput() shouldn't be called for inodes in I_NEW state. We need to mark inode as constructed first. WARNING: at fs/inode.c:1309 iput+0x20b/0x210() Call Trace: [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0 [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20 [<ffffffff810eaf0b>] iput+0x20b/0x210 [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0 [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210 [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0 [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0 [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280 [<ffffffff8105af86>] kthread+0x96/0xa0 [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10 [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190 [<ffffffff814336d0>] ? gs_change+0xb/0xb Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> CC: Konstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: David Sterba <dsterba@suse.cz> CC: Josef Bacik <josef@redhat.com> CC: Chris Mason <chris.mason@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Liu Bo	14c7cca780	Btrfs: fix an oops when deleting snapshots We can reproduce this oops via the following steps: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs $ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done $ rm -fr /mnt/btrfs/* $ rm -fr /mnt/btrfs/* then we'll get ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:2264! [...] Call Trace: [<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs] [<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0 [<ffffffff81153cc3>] do_rmdir+0x123/0x140 [<ffffffff81145ac7>] ? fput+0x197/0x260 [<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0 [<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40 [<ffffffff8147896b>] system_call_fastpath+0x16/0x1b RIP [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs] When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID, while the snapshot's location.objectid remains unchanged. However, btrfs_ino() does not take this into account, and returns a wrong ino, and causes the oops. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-09-11 10:52:24 -04:00
Linus Torvalds	290a1cc4f7	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. md/raid1,10: Remove use-after-free bug in make_request. md/raid10: unify handling of write completion. Avoid dereferencing a 'request_queue' after last close.	2011-09-10 10:19:15 -07:00
NeilBrown	94007751bb	Avoid dereferencing a 'request_queue' after last close. On the last close of an 'md' device which as been stopped, the device is destroyed and in particular the request_queue is freed. The free is done in a separate thread so it might happen a short time later. __blkdev_put calls bdev_inode_switch_bdi after ->release has been called. Since commit `f758eeabeb` bdev_inode_switch_bdi will dereference the 'old' bdi, which lives inside a request_queue, to get a spin lock. This causes the last close on an md device to sometime take a spin_lock which lives in freed memory - which results in an oops. So move the called to bdev_inode_switch_bdi before the call to ->release. Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:20:21 +10:00
Linus Torvalds	0d20fbbe82	Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix leak of osd structs during shutdown ceph: fix memory leak ceph: fix encoding of ino only (not relative) paths libceph: fix msgpool	2011-09-09 15:48:34 -07:00
Miklos Szeredi	0ec26fd069	vfs: automount should ignore LOOKUP_FOLLOW Prior to 2.6.38 automount would not trigger on either stat(2) or lstat(2) on the automount point. After 2.6.38, with the introduction of the ->d_automount() infrastructure, stat(2) and others would start triggering automount while lstat(2), etc. still would not. This is a regression and a userspace ABI change. Problem originally reported here: http://thread.gmane.org/gmane.linux.kernel.autofs/6098 It appears that there was an attempt at fixing various userspace tools to not trigger the automount. But since the stat system call is rather common it is impossible to "fix" all userspace. This patch reverts the original behavior, which is to not trigger on stat(2) and other symlink following syscalls. [ It's not really clear what the right behavior is. Apparently Solaris does the "automount on stat, leave alone on lstat". And some programs can get unhappy when "stat+open+fstat" ends up giving a different result from the fstat than from the initial stat. But the change in 2.6.38 resulted in problems for some people, so we're going back to old behavior. Maybe we can re-visit this discussion at some future date - Linus ] Reported-by: Leonardo Chiquitto <leonardo.lists@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Ian Kent <raven@themaw.net> Cc: David Howells <dhowells@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-09-09 15:42:34 -07:00
Michal Hocko	a25cac5198	proc: Consider NO_HZ when printing idle and iowait times show_stat handler of the /proc/stat file relies on kstat_cpu(cpu) statistics when priting information about idle and iowait times. This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because counters are updated periodically. With NO_HZ things got more tricky because we are not doing idle/iowait accounting while we are tickless so the value might get outdated. Users of /proc/stat will notice that by unchanged idle/iowait values which is then interpreted as 0% idle/iowait time. From the user space POV this is an unexpected behavior and a change of the interface. Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the total idle/iowait time since boot and it doesn't rely on sampling or any other periodic activity. Fall back to the previous behavior if NO_HZ is disabled or not configured. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-09-08 11:10:55 +02:00
Linus Torvalds	54d6d53744	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 and git://git.infradead.org/ubi-2.6 * branch 'linux-next' of git://git.infradead.org/ubifs-2.6: UBIFS: not build debug messages with CONFIG_UBIFS_FS_DEBUG disabled * branch 'linux-next' of git://git.infradead.org/ubi-2.6: UBI: do not link debug messages when debugging is disabled	2011-09-07 09:51:43 -07:00
J. Bruce Fields	4665e2bac5	nfsd4: split out some free_generic_stateid code We'll use this elsewhere. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-07 09:47:23 -04:00
J. Bruce Fields	fe0750e5c4	nfsd4: split stateowners into open and lockowners The stateowner has some fields that only make sense for openowners, and some that only make sense for lockowners, and I find it a lot clearer if those are separated out. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-07 09:45:49 -04:00
Jim Garlick	51b8b4fb32	fs/9p: Use protocol-defined value for lock/getlock 'type' field. Signed-off-by: Jim Garlick <garlick@llnl.gov> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-09-06 08:17:16 -05:00
Aneesh Kumar K.V	73f507171c	fs/9p: Always ask new inode in lookup for cache mode disabled This make sure we don't end up reusing the unlinked inode object. The ideal way is to use inode i_generation. But i_generation is not available in userspace always. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-09-06 08:17:15 -05:00
Aneesh Kumar K.V	f88657ce3f	fs/9p: Add OS dependent open flags in 9p protocol Some of the flags are OS/arch dependent we add a 9p protocol value which maps to asm-generic/fcntl.h values in Linux Based on the original patch from Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-09-06 08:17:15 -05:00
Aneesh Kumar K.V	45089142b1	fs/9p: Don't update file type when updating file attributes We should only update attributes that we can change on stat2inode. Also do file type initialization in v9fs_init_inode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-09-06 08:17:14 -05:00
Aneesh Kumar K.V	5441ae5eb3	fs/9p: Add fid before dentry instantiation d_instantiate marks the dentry positive. So a parallel lookup and mkdir of the directory can find dentry that doesn't have fid attached. This can result in both the code path doing v9fs_fid_add which results in v9fs_dentry leak. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-09-06 08:17:14 -05:00
J. Bruce Fields	f4dee24cca	nfsd4: move CLOSE_STATE special case to caller Move the CLOSE_STATE case into the unique caller that cares about it rather than putting it in preprocess_seqid_op. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-03 23:15:28 -04:00
J. Bruce Fields	68b66e8270	nfsd4: move double-confirm test to open_confirm I don't see the point of having this check in nfs4_preprocess_seqid_op() when it's only needed by the one caller. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-03 05:01:52 -04:00
J. Bruce Fields	77eaae8d44	nfsd4: simplify check_open logic Sometimes the single-exit style is good, sometimes it's unnecessarily convoluted.... Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-02 19:59:29 -04:00
J. Bruce Fields	7a8711c9a6	nfsd4: share common seqid checks Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-02 19:59:24 -04:00
Linus Torvalds	4d7b5a116f	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix ->write_inode return values xfs: fix xfs_mark_inode_dirty during umount xfs: deprecate the nodelaylog mount option	2011-09-02 08:25:23 -07:00
J. Bruce Fields	16d259418b	nfsd4: eliminate unused lt_stateowner This is used only as a local variable. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-01 11:35:30 -04:00
J. Bruce Fields	7c13f344cf	nfsd4: drop most stateowner refcounting Maybe we'll bring it back some day, but we don't have much real use for it now. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-01 11:12:47 -04:00
Christoph Hellwig	58d84c4ee0	xfs: fix ->write_inode return values Currently we always redirty an inode that was attempted to be written out synchronously but has been cleaned by an AIL pushed internall, which is rather bogus. Fix that by doing the i_update_core check early on and return 0 for it. Also include async calls for it, as doing any work for those is just as pointless. While we're at it also fix the sign for the EIO return in case of a filesystem shutdown, and fix the completely non-sensical locking around xfs_log_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> (cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97) Signed-off-by: Alex Elder <aelder@sgi.com>	2011-09-01 09:46:11 -05:00
J. Bruce Fields	fff6ca9cc4	nfsd4: eliminate impossible open replay case If open fails with any error other than nfserr_replay_me, then the main nfsd4_proc_compound() loop continues unconditionally to nfsd4_encode_operation(), which will always call encode_seqid_op_tail. Thus the condition we check for here does not occur. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-01 07:29:01 -04:00
J. Bruce Fields	5ec094c109	nfsd4: extend state lock over seqid replay logic There are currently a couple races in the seqid replay code: a retransmission could come while we're still encoding the original reply, or a new seqid-mutating call could come as we're encoding a replay. So, extend the state lock over the encoding (both encoding of a replayed reply and caching of the original encoded reply). I really hate doing this, and previously added the stateowner reference-counting code to avoid it (which was insufficient)--but I don't see a less complicated alternative at the moment. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-09-01 07:07:59 -04:00
Christoph Hellwig	866e4ed774	xfs: fix xfs_mark_inode_dirty during umount During umount we do not add a dirty inode to the lru and wait for it to become clean first, but force writeback of data and metadata with I_WILL_FREE set. Currently there is no way for XFS to detect that the inode has been redirtied for metadata operations, as we skip the mark_inode_dirty call during teardown. Fix this by setting i_update_core nanually in that case, so that the inode gets flushed during inode reclaim. Alternatively we could enable calling mark_inode_dirty for inodes in I_WILL_FREE state, and let the VFS dirty tracking handle this. I decided against this as we will get better I/O patterns from reclaim compared to the synchronous writeout in write_inode_now, and always marking the inode dirty in some way from xfs_mark_inode_dirty is a better safetly net in either case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> (cherry picked from commit da6742a5a4cc844a9982fdd936ddb537c0747856) Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-31 17:59:39 -05:00
Linus Torvalds	b79c4f75e4	Merge tag 'for_linus-20110831' of git://github.com/tytso/ext4 * tag 'for_linus-20110831' of git://github.com/tytso/ext4: ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining	2011-08-31 15:08:19 -07:00
J. Bruce Fields	9072d5c66b	nfsd4: cleanup seqid op stateowner usage Now that the replay owner is in the cstate we can remove it from a lot of other individual operations and further simplify nfs4_preprocess_seqid_op(). Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:56:03 -04:00
J. Bruce Fields	f3e4223751	nfsd4: centralize handling of replay owners Set the stateowner associated with a replay in one spot in nfs4_preprocess_seqid_op() and keep it in cstate. This allows removing a few lines of boilerplate from all the nfs4_preprocess_seqid_op() callers. Also turn ENCODE_SEQID_OP_TAIL into a function while we're here. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:56:02 -04:00
J. Bruce Fields	73997dc418	nfsd4: make delegation stateid's seqid start at 1 Thanks to Casey for reminding me that 5661 gives a special meaning to a value of 0 in the stateid's seqid field, so all stateid's should start out with si_generation 1. We were doing that in the open and lock cases for minorversion 1, but not for the delegation stateid, and not for openstateid's with v4.0. It doesn't really matter much for v4.0 or for delegation stateid's (which never get the seqid field incremented), but we may as well do the same for all of them. Reported-by: Casey Bodley <cbodley@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:56:01 -04:00
J. Bruce Fields	81b829655d	nfsd4: simplify stateid generation code, fix wraparound Follow the recommendation from rfc3530bis for stateid generation number wraparound, simplify some code, and fix or remove incorrect comments. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:56:00 -04:00
J. Bruce Fields	b79abaddfe	nfsd4: consolidate lock & open stateid tables There's no reason to have two separate hash tables for open and lock stateid's. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:56:00 -04:00
J. Bruce Fields	5fa0bbb4ee	nfsd4: simplify distinguishing lock & open stateid's The trick free_stateid is using is a little cheesy, and we'll have more uses for this field later. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:55:59 -04:00
J. Bruce Fields	c2d8eb7ac6	nfsd4: remove typoed replay field Wow, I wonder how long that typo's been there. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:55:58 -04:00
J. Bruce Fields	b7d7ca3580	nfsd4: fix off-by-one-error in SEQUENCE reply The values here represent highest slotid numbers. Since slotid's are numbered starting from zero, the highest should be one less than the number of slots. Reported-by: Rick Macklem <rmacklem@uoguelph.ca> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 17:55:57 -04:00
Jiaying Zhang	8c0bec2151	ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining The i_mutex lock and flush_completed_IO() added by commit `2581fdc810` in ext4_evict_inode() causes lockdep complaining about potential deadlock in several places. In most/all of these LOCKDEP complaints it looks like it's a false positive, since many of the potential circular locking cases can't take place by the time the ext4_evict_inode() is called; but since at the very least it may mask real problems, we need to address this. This change removes the flush_completed_IO() and i_mutex lock in ext4_evict_inode(). Instead, we take a different approach to resolve the software lockup that commit `2581fdc810` intends to fix. Rather than having ext4-dio-unwritten thread wait for grabing the i_mutex lock of an inode, we use mutex_trylock() instead, and simply requeue the work item if we fail to grab the inode's i_mutex lock. This should speed up work queue processing in general and also prevents the following deadlock scenario: During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-08-31 11:50:51 -04:00
J. Bruce Fields	c152292f9e	nfsd: remove include/linux/nfsd/syscall.h We don't need this any more. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-31 11:50:11 -04:00
J. Bruce Fields	3cc9fda40a	nfsd4: remove redundant is_open_owner check When called with OPEN_STATE, preprocess_seqid_op only returns an open stateid, hence only an open owner. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:29 -04:00
J. Bruce Fields	b34f27aa5d	nfsd4: get lock checks out of preprocess_seqid_op We've got some lock-specific code here in nfs4_preprocess_seqid_op which is only used by nfsd4_lock(). Move it to the caller. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:28 -04:00
J. Bruce Fields	9afb978400	nfsd4: simplify lock openmode check Note that the special handling for the lock stateid case is already done by nfs4_check_openmode() (as of `0292191417` "nfsd4: fix openmode checking on IO using lock stateid") so we no longer need these two cases in the caller. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:27 -04:00
J. Bruce Fields	a9004abc34	nfsd4: cleanup and consolidate seqid_mutating_err Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:26 -04:00
J. Bruce Fields	28dde241cc	nfsd4: remove HAS_SESSION This flag doesn't really buy us anything. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:25 -04:00
J. Bruce Fields	ff194bd959	nfsd4: cleanup lock/stateowner initialization Share some common code, stop doing silly things like initializing a list head immediately before adding it to a list, etc. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:24 -04:00
J. Bruce Fields	506f275fff	nfsd4: name openowner data structures more clearly These appear to be generic (for both open and lock owners), but they're actually just for open owners. This has confused me more than once. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:23 -04:00
J. Bruce Fields	ddc04c4163	nfsd4: replace some macros by functions For all the usual reasons. (Type safety, readability.) Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:22 -04:00
J. Bruce Fields	3e77246393	nfsd4: stop using nfserr_resource for transitory errors The server is returning nfserr_resource for both permanent errors and for errors (like allocation failures) that might be resolved by retrying later. Save nfserr_resource for the former and use delay/jukebox for the latter. Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:21 -04:00
Boaz Harrosh	6577aac01f	nfsd4: fix failure to end nfsd4 grace period Even if we fail to write a recovery record, we should still mark the client as having acquired its first state. Otherwise we leave 4.1 clients with indefinite ERR_GRACE returns. However, an inability to write stable storage records may cause failures of reboot recovery, and the problem should still be brought to the server administrator's attention. So, make sure the error is logged. These errors shouldn't normally be triggered on a corectly functioning server--this isn't a case where a misconfigured client could spam the logs. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:21 -04:00
J. Bruce Fields	48483bf23a	nfsd4: simplify recovery dir setting Move around some of this code, simplify a bit. Reviewed-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:21:18 -04:00
J. Bruce Fields	8e82fa8fdc	nfsd: prettify NFSD_MAY_* flag definitions Acked-by: Jim Rees <rees@umich.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:20:21 -04:00
J. Bruce Fields	a043226bc1	nfsd4: permit read opens of executable-only files A client that wants to execute a file must be able to read it. Read opens over nfs are therefore implicitly allowed for executable files even when those files are not readable. NFSv2/v3 get this right by using a passed-in NFSD_MAY_OWNER_OVERRIDE on read requests, but NFSv4 has gotten this wrong ever since `dc730e1737` "nfsd4: fix owner-override on open", when we realized that the file owner shouldn't override permissions on non-reclaim NFSv4 opens. So we can't use NFSD_MAY_OWNER_OVERRIDE to tell nfsd_permission to allow reads of executable files. So, do the same thing we do whenever we encounter another weird NFS permission nit: define yet another NFSD_MAY_* flag. The industry's future standardization on 128-bit processors will be motivated primarily by the need for integers with enough bits for all the NFSD_MAY_* flags. Reported-by: Leonardo Borda <leonardoborda@gmail.com> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-27 14:20:20 -04:00
J. Bruce Fields	c10bd39d80	Remove include/linux/nfsd/const.h Userspace shouldn't have a use for these constants. Nothing here is used outside fs/nfsd. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-26 18:22:52 -04:00
J. Bruce Fields	75c096f753	nfsd4: it's OK to return nfserr_symlink The nfsd4 code has a bunch of special exceptions for error returns which map nfserr_symlink to other errors. In fact, the spec makes it clear that nfserr_symlink is to be preferred over less specific errors where possible. The patch that introduced it back in 2.6.4 is "kNFSd: correct symlink related error returns.", which claims that these special exceptions are represent an NFSv4 break from v2/v3 tradition--when in fact the symlink error was introduced with v4. I suspect what happened was pynfs tests were written that were overly faithful to the (known-incomplete) rfc3530 error return lists, and then code was fixed up mindlessly to make the tests pass. Delete these unnecessary exceptions. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-26 18:22:50 -04:00
J. Bruce Fields	e281d81009	nfsd4: fix incorrect comment in nfsd4_set_nfs4_acl Zero means "I don't care what kind of file this is". And that's probably what we want--acls are also settable at least on directories, and if the filesystem doesn't want them on other objects, leave it to it to complain. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-26 18:22:49 -04:00
J. Bruce Fields	e10f9e1413	nfsd: clean up nfsd_mode_check() Add some more comments, simplify logic, do & S_IFMT just once, name "type" more helpfully. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-26 18:22:48 -04:00
J. Bruce Fields	7d818a7b8f	nfsd: open-code special directory-hardlink check We allow the fh_verify caller to specify that any object except those of a given type is allowed, by passing a negative type. But only one caller actually uses it. Open-code that check in the one caller. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-26 18:22:47 -04:00
J. Bruce Fields	3d2544b1e4	nfsd4: clean up S_IS -> NF4 file type mapping A slightly unconventional approach to make the code more compact I could live with, but let's give the poor reader some chance. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-26 18:22:47 -04:00
NeilBrown	f5b9409973	All Arch: remove linkage for sys_nfsservctl system call The nfsservctl system call is now gone, so we should remove all linkage for it. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-26 15:09:58 -07:00
Josh Boyer	e096d0c7e2	lockdep: Add helper function for dir vs file i_mutex annotation Purely in-memory filesystems do not use the inode hash as the dcache tells us if an entry already exists. As a result, they do not call unlock_new_inode, and thus directory inodes do not get put into a different lockdep class for i_sem. We need the different lockdep classes, because the locking order for i_mutex is different for directory inodes and regular inodes. Directory inodes can do "readdir()", which takes i_mutex before possibly taking mm->mmap_sem (due to a page fault while copying the directory entry to user space). In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem before accessing i_mutex. The two cases can never happen for the same inode, so no real deadlock can occur, but without the different lockdep classes, lockdep cannot understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this can lead to false positives from lockdep like below: find/645 is trying to acquire lock: (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac but task is already holding lock: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>] vfs_readdir+0x5b/0xb4 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}: [<ffffffff8108ac26>] lock_acquire+0xbf/0x103 [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361 [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45 [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110 [<ffffffff81111557>] mmap_region+0x258/0x432 [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306 [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a [<ffffffff8100c858>] sys_mmap+0x22/0x24 [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b -> #0 (&mm->mmap_sem){++++++}: [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7 [<ffffffff8108ac26>] lock_acquire+0xbf/0x103 [<ffffffff81109541>] might_fault+0x89/0xac [<ffffffff81149cff>] filldir+0x6f/0xc7 [<ffffffff811586ea>] dcache_readdir+0x67/0x205 [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4 [<ffffffff8114a073>] sys_getdents+0x7e/0xd1 [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b This patch moves the directory vs file lockdep annotation into a helper function that can be called by in-memory filesystems and has hugetlbfs call it. Signed-off-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-25 10:50:18 -07:00
Christoph Hellwig	242d621964	xfs: deprecate the nodelaylog mount option Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-25 10:30:05 -05:00
Trond Myklebust	042b60beb4	NFSv4: renewd needs to be able to handle the NFS4ERR_CB_PATH_DOWN error The NFSv4 spec does not specify that the server must repeat that error, so in order to avoid having the delegations revoked, we should handle it immediately. Also note that NFS4ERR_CB_PATH_DOWN does in fact renew the lease... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-24 15:07:37 -04:00
Trond Myklebust	2f60ea6b8c	NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation RFC3530 states that if the client holds a delegation, then it is obliged to continue to send RENEW calls once every lease period in order to allow the server to return NFS4ERR_CB_PATH_DOWN if the callback path is unreachable. This is not required for NFSv4.1, since the server can at any time set the SEQ4_STATUS_CB_PATH_DOWN_SESSION in any SEQUENCE operation. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-24 15:07:37 -04:00
Trond Myklebust	8534d4ec05	NFSv4: nfs4_proc_renew should be declared static Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-24 15:07:37 -04:00
Trond Myklebust	b569ad3492	NFSv4: nfs4_proc_async_renew should use a GFP_NOFS allocation We shouldn't allow the renew daemon to do direct reclaim on the NFS partition. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-08-24 15:07:35 -04:00
Linus Torvalds	051732bcbe	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message fuse: mark pages accessed when written to fuse: delete dead .write_begin and .write_end aops fuse: fix flock fuse: fix non-ANSI void function notation	2011-08-24 09:14:42 -07:00
Miklos Szeredi	c2183d1e9b	fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message FUSE_NOTIFY_INVAL_ENTRY didn't check the length of the write so the message processing could overrun and result in a "kernel BUG at fs/fuse/dev.c:629!" Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@kernel.org	2011-08-24 10:20:17 +02:00
Linus Torvalds	35a177a08d	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix tracing builds inside the source tree xfs: remove subdirectories xfs: don't expect xfs headers to be in subdirectories	2011-08-23 11:41:44 -07:00
Christoph Hellwig	65299a3b78	block: separate priority boosting from REQ_META Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule, and lave REQ_META purely for marking requests as metadata in blktrace. All existing callers of REQ_META except for XFS are updated to also set REQ_PRIO for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-23 14:50:29 +02:00
Christoph Hellwig	5dc06c5a70	block: remove READ_META and WRITE_META Replace all occurnanced of the undocumented READ_META with READ \| REQ_META and remove the unused WRITE_META define. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-23 14:49:55 +02:00
Mikulas Patocka	a406f75840	sysfs: use rb-tree for inode number lookup sysfs: use rb-tree for inode number lookup This patch makes sysfs use red-black tree for inode number lookup. Together with a previous patch to use red-black tree for name lookup, this patch makes all sysfs lookups to have O(log n) complexity. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-08-22 17:43:53 -07:00
Mikulas Patocka	58f2a4c793	sysfs: remove s_sibling hacks sysfs: remove s_sibling hacks s_sibling was used for three different purposes: 1) as a linked list of entries in the directory 2) as a linked list of entries to be deleted 3) as a pointer to "struct completion" This patch removes the hack and introduces new union u which holds pointers for cases 2) and 3). This change is needed for the following patch that removes s_sibling at all and replaces it with a rb tree. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-08-22 17:43:52 -07:00
Mikulas Patocka	4f72c0cab4	sysfs: use rb-tree for name lookups sysfs: use rb-tree for name lookups Use red-black tree for name lookups. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-08-22 17:43:52 -07:00
Mikulas Patocka	7f9838fd01	sysfs: count subdirectories sysfs: count subdirectories This patch introduces a subdirectory counter for each sysfs directory. Without the patch, sysfs_refresh_inode would walk all entries of the directory to calculate the number of subdirectories. This patch improves time of "ls -la /sys/block" when there are 10000 block devices from 9 seconds to 0.19 seconds. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-08-22 17:43:30 -07:00
Harry Wei	bd33d12fba	debugfs: Fix a comment mistake The file is fs/debugfs/inode.c but the comment says it is file.c. This patch can fix this little mistake. Signed-off-by: Harry Wei <harryxiyou@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-08-22 17:41:48 -07:00
Christoph Hellwig	b6bede3b4c	xfs: fix tracing builds inside the source tree The code really requires the current source directory to be in the header search path. We already do this if building with an object tree separate from the source, but it needs to be added manually if building inside the source. The cflags addition for it accidentally got removed when collapsing the xfs directory structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-08-22 16:37:24 -05:00
Noah Watkins	259a187ade	ceph: fix memory leak kfree does not clean up indirect allocations in ceph_fs_client and ceph_options (e.g. snapdir_name). Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-08-22 13:06:59 -07:00
Josef Bacik	6719db6a23	Btrfs: fix 64 bit divide problem This fixes a regression introduced by commit `cdcb725c05` ("Btrfs: check if there is enough space for balancing smarter"). We can't do 64-bit divides on 32-bit architectures. In cases where we need to divide/multiply by 2 we should just left/right shift respectively, and in cases where theres N number of devices use do_div. Also make the counters u64 to match up with rw_devices. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Acked-and-tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-21 07:02:00 -07:00
Linus Torvalds	c063d8a60f	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: flush any pending end_io requests before DIO reads w/dioread_nolock ext4: fix nomblk_io_submit option so it correctly converts uninit blocks ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN. ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode ext4: Fix ext4_should_writeback_data() for no-journal mode	2011-08-21 06:59:41 -07:00
Jiaying Zhang	dccaf33fa3	ext4: flush any pending end_io requests before DIO reads w/dioread_nolock There is a race between ext4 buffer write and direct_IO read with dioread_nolock mount option enabled. The problem is that we clear PageWriteback flag during end_io time but will do uninitialized-to-initialized extent conversion later with dioread_nolock. If an O_direct read request comes in during this period, ext4 will return zero instead of the recently written data. This patch checks whether there are any pending uninitialized-to-initialized extent conversion requests before doing O_direct read to close the race. Note that this is just a bandaid fix. The fundamental issue is that we clear PageWriteback flag before we really complete an IO, which is problem-prone. To fix the fundamental issue, we may need to implement an extent tree cache that we can use to look up pending to-be-converted extents. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-08-19 19:13:32 -04:00
Eric Dumazet	11fd165c68	sunrpc: use better NUMA affinities Use NUMA aware allocations to reduce latencies and increase throughput. sunrpc kthreads can use kthread_create_on_node() if pool_mode is "percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can also take into account NUMA node affinity for memory allocations. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: "J. Bruce Fields" <bfields@fieldses.org> CC: Neil Brown <neilb@suse.de> CC: David Miller <davem@davemloft.net> Reviewed-by: Greg Banks <gnb@fastmail.fm> [bfields@redhat.com: fix up caller nfs41_callback_up] Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-19 13:25:36 -04:00
J. Bruce Fields	c1f24ef4ed	locks: setlease cleanup There's an incorrect comment here. Also clean up the logic: the "rdlease" and "wrlease" locals are confusingly named, and don't really add anything since we can make a decision as soon as we hit one of these cases. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-19 13:25:35 -04:00
J. Bruce Fields	778fc546f7	locks: fix tracking of inprogress lease breaks We currently use a bit in fl_flags to record whether a lease is being broken, and set fl_type to the type (RDLCK or UNLCK) that it will eventually have. This means that once the lease break starts, we forget what the lease's type used to be. Breaking a read lease will then result in blocking read opens, even though there's no conflict--because the lease type is now F_UNLCK and we can no longer tell whether it was previously a read or write lease. So, instead keep fl_type as the original type (the type which we enforce), and keep track of whether we're unlocking or merely downgrading by replacing the single FL_INPROGRESS flag by FL_UNLOCK_PENDING and FL_DOWNGRADE_PENDING flags. To get this right we also need to track separate downgrade and break times, to handle the case where a write-leased file gets conflicting opens first for read, then later for write. (I first considered just eliminating the downgrade behavior completely--nfsv4 doesn't need it, and nobody as far as I can tell actually uses it currently--but Jeremy Allison tells me that Windows oplocks do behave this way, so Samba will probably use this some day.) Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-19 13:25:34 -04:00
J. Bruce Fields	710b721696	locks: move F_INPROGRESS from fl_type to fl_flags field F_INPROGRESS isn't exposed to userspace. To me it makes more sense in fl_flags.... Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-19 13:25:34 -04:00
J. Bruce Fields	ab83fa4b49	locks: minor lease cleanup Use a helper function, to simplify upcoming changes. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-08-19 13:25:33 -04:00

... 3 4 5 6 7 ...

24450 Commits