OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	f7b0069317	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has our collection of bug fixes. I missed the last rc because I thought our patches were making NFS crash during my xfs test runs. Turns out it was an NFS client bug fixed by someone else while I tried to bisect it. All of these fixes are small, but some are fairly high impact. The biggest are fixes for our mount -o remount handling, a deadlock due to GFP_KERNEL allocations in readdir, and a RAID10 error handling bug. This was tested against both 3.3 and Linus' master as of this morning." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits) Btrfs: reduce lock contention during extent insertion Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir Btrfs: Fix space checking during fs resize Btrfs: fix block_rsv and space_info lock ordering Btrfs: Prevent root_list corruption Btrfs: fix repair code for RAID10 Btrfs: do not start delalloc inodes during sync Btrfs: fix that check_int_data mount option was ignored Btrfs: don't count CRC or header errors twice while scrubbing Btrfs: fix btrfs_ioctl_dev_info() crash on missing device btrfs: don't return EINTR Btrfs: double unlock bug in error handling Btrfs: always store the mirror we read the eb from fs/btrfs/volumes.c: add missing free_fs_devices btrfs: fix early abort in 'remount' Btrfs: fix max chunk size check in chunk allocator Btrfs: add missing read locks in backref.c Btrfs: don't call free_extent_buffer twice in iterate_irefs Btrfs: Make free_ipath() deal gracefully with NULL pointers Btrfs: avoid possible use-after-free in clear_extent_bit() ...	2012-04-28 09:30:07 -07:00
Stefan Behrens	25cd999e1a	Btrfs: fix that check_int_data mount option was ignored The bitfield member mount_opt was too small by one bit to hold the mount option that enabled to include data extents in the integrity checker. Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR option was added (git rebase silently merges so that the increase of the size of the bitfield member is lost), the bit limit was removed entirely. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-04-18 19:22:38 +02:00
Al Viro	6ed3cf2cdf	btrfs: btrfs_root_readonly() broken on big-endian ->root_flags is __le64 and all accesses to it go through the helpers that do proper conversions. Except for btrfs_root_readonly(), which checks bit 0 as in host-endian... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-04-13 11:54:32 -04:00
Chris Mason	1c691b330a	Merge branch 'for-chris' of git://github.com/idryomov/btrfs-unstable into for-linus	2012-03-28 20:32:46 -04:00
Chris Mason	1d4284bd6e	Merge branch 'error-handling' into for-linus Conflicts: fs/btrfs/ctree.c fs/btrfs/disk-io.c fs/btrfs/extent-tree.c fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/inode.c fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-28 20:31:37 -04:00
Stefan Behrens	94598ba8d8	Btrfs: introduce common define for max number of mirrors Readahead already has a define for the max number of mirrors. Scrub needs such a define now, the rest of the code will need something like this soon. Therefore the define was added to ctree.h and removed from the readahead code. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-27 14:21:26 -04:00
Ilya Dryomov	0c460c0d70	Btrfs: move alloc_profile_is_valid() to volumes.c Header file is not a good place to define functions. This also moves a call to alloc_profile_is_valid() down the stack and removes a redundant check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	e8920a640b	Btrfs: make profile_is_valid() check more strict "0" is a valid value for an on-disk chunk profile, but it is not a valid extended profile. (We have a separate bit for single chunks in extended case) Also rename it to alloc_profile_is_valid() for clarity. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	899c81eac8	Btrfs: add wrappers for working with alloc profiles Add functions to abstract the conversion between chunk and extended allocation profile formats and switch everybody to use them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:16 +03:00
Chris Mason	cfed81a04e	Btrfs: add the ability to cache a pointer into the eb This cuts down on the CPU time used by map_private_extent_buffer Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-26 17:04:23 -04:00
Chris Mason	727011e07c	Btrfs: allow metadata blocks larger than the page size A few years ago the btrfs code to support blocks lager than the page size was disabled to fix a few corner cases in the page cache handling. This fixes the code to properly support large metadata blocks again. Since current kernels will crash early and often with larger metadata blocks, this adds an incompat bit so that older kernels can't mount it. This also does away with different blocksizes for nodes and leaves. You get a single block size for all tree blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-26 16:50:37 -04:00
Josef Bacik	81c9ad237c	Btrfs: remove search_start and search_end from find_free_extent and callers We have been passing nothing but (u64)-1 to find_free_extent for search_end in all of the callers, so it's completely useless, and we've always been passing 0 in as search_start, so just remove them as function arguments and move search_start into find_free_extent. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-03-26 14:42:51 -04:00
Jeff Mahoney	49b25e0540	btrfs: enhance transaction abort infrastructure Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:40 +01:00
Jeff Mahoney	4da3511342	btrfs: add varargs to btrfs_error btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:40 +01:00
Jeff Mahoney	2c536799f1	btrfs: btrfs_drop_snapshot should return int Commit `cb1b69f4` (Btrfs: forced readonly when btrfs_drop_snapshot() fails) made btrfs_drop_snapshot return void because there were no callers checking the return value. That is the wrong order to handle error propogation since the caller will have no idea that an error has occured and continue on as if nothing went wrong. Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:36 +01:00
Jeff Mahoney	143bede527	btrfs: return void in functions without error conditions Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:34 +01:00
Jeff Mahoney	b45a9d8b48	btrfs: btrfs_update_root error push-up btrfs_update_root BUG's when it can't alloc a path, yet it can recover from a search error. This patch returns -ENOMEM instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:33 +01:00
Jeff Mahoney	8c34293001	btrfs: Add btrfs_panic() As part of the effort to eliminate BUG_ON as an error handling technique, we need to determine which errors are actual logic errors, which are on-disk corruption, and which are normal runtime errors e.g. -ENOMEM. Annotating these error cases is helpful to understand and report them. This patch adds a btrfs_panic() routine that will either panic or BUG depending on the new -ofatal_errors={panic,bug} mount option. Since there are still so many BUG_ONs, it defaults to BUG for now but I expect that to change once the error handling effort has made significant progress. Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:29 +01:00
Linus Torvalds	855a85f704	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Quoth Chris: "This is later than I wanted because I got backed up running through btrfs bugs from the Oracle QA teams. But they are all bug fixes that we've queued and tested since rc1. Nothing in particular stands out, this just reflects bug fixing and QA done in parallel by all the btrfs developers. The most user visible of these is: Btrfs: clear the extent uptodate bits during parent transid failures Because that helps deal with out of date drives (say an iscsi disk that has gone away and come back). The old code wasn't always properly retrying the other mirror for this type of failure." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits) Btrfs: fix compiler warnings on 32 bit systems Btrfs: increase the global block reserve estimates Btrfs: clear the extent uptodate bits during parent transid failures Btrfs: add extra sanity checks on the path names in btrfs_mksubvol Btrfs: make sure we update latest_bdev Btrfs: improve error handling for btrfs_insert_dir_item callers Btrfs: be less strict on finding next node in clear_extent_bit Btrfs: fix a bug on overcommit stuff Btrfs: kick out redundant stuff in convert_extent_bit Btrfs: skip states when they does not contain bits to clear Btrfs: check return value of lookup_extent_mapping() correctly Btrfs: fix deadlock on page lock when doing auto-defragment Btrfs: fix return value check of extent_io_ops btrfs: honor umask when creating subvol root btrfs: silence warning in raid array setup btrfs: fix structs where bitfields and spinlock/atomic share 8B word btrfs: delalloc for page dirtied out-of-band in fixup worker Btrfs: fix memory leak in load_free_space_cache() btrfs: don't check DUP chunks twice Btrfs: fix trim 0 bytes after a device delete ...	2012-02-24 09:02:53 -08:00
David Sterba	c08782dacd	btrfs: fix structs where bitfields and spinlock/atomic share 8B word On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has disaterous effect. https://lkml.org/lkml/2012/2/1/220 Signed-off-by: David Sterba <dsterba@suse.cz>	2012-02-15 16:40:25 +01:00
Linus Torvalds	d65773b22b	Merge branch 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs * 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: btrfs: take allocation of ->tree_root into open_ctree() btrfs: let ->s_fs_info point to fs_info, not root... btrfs: consolidate failure exits in btrfs_mount() a bit btrfs: make free_fs_info() call ->kill_sb() unconditional btrfs: merge free_fs_info() calls on fill_super failures btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super() btrfs: make open_ctree() return int btrfs: sanitizing ->fs_info, part 5 btrfs: sanitizing ->fs_info, part 4 btrfs: sanitizing ->fs_info, part 3 btrfs: sanitizing ->fs_info, part 2 btrfs: sanitizing ->fs_info, part 1 btrfs: fix a deadlock in btrfs_scan_one_device() btrfs: fix mount/umount race btrfs: get ->kill_sb() of its own btrfs: preparation to fixing mount/umount race	2012-01-17 15:52:51 -08:00
Chris Mason	c126dea771	Merge branch 'integrity-check-patch-v2' of git://btrfs.giantdisaster.de/git/btrfs into integration Conflicts: fs/btrfs/ctree.h fs/btrfs/super.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-16 15:27:58 -05:00
Chris Mason	9785dbdf26	Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration	2012-01-16 15:26:31 -05:00
Ilya Dryomov	a7e99c691a	Btrfs: allow for canceling restriper Implement an ioctl for canceling restriper. Currently we wait until relocation of the current block group is finished, in future this can be done by triggering a commit. Balance item is deleted and no memory about the interrupted balance is kept. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:49 +02:00
Ilya Dryomov	837d5b6e46	Btrfs: allow for pausing restriper Implement an ioctl for pausing restriper. This pauses the relocation, but balance is still considered to be "in progress": balance item is not deleted, other volume operations cannot be started, etc. If paused in the middle of profile changing operation we will continue making allocations with the target profile. Add a hook to close_ctree() to pause restriper and free its data structures on unmount. (It's safe to unmount when restriper is in "paused" state, we will resume with the same parameters on the next mount) Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:49 +02:00
Ilya Dryomov	9555c6c180	Btrfs: add skip_balance mount option Since restriper kthread starts involuntarily on mount and can suck cpu and memory bandwidth add a mount option to forcefully skip it. The restriper in that case hangs around in paused state and can be resumed from userspace when it's convenient. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:48 +02:00
Ilya Dryomov	0940ebf6b9	Btrfs: save balance parameters to disk Introduce a new btree objectid for storing balance item. The reason is to be able to resume restriper after a crash with the same parameters. Balance item has a very high objectid and goes into tree of tree roots. The key for the new item is as follows: [ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ] Older kernels simply ignore it so it's safe to mount with an older kernel and then go back to the newer one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:48 +02:00
Ilya Dryomov	70922617b0	Btrfs: do not reduce profile in do_chunk_alloc() Every caller of do_chunk_alloc() feeds it the reduced allocation profile, so stop trying to reduce it one more time. Instead check the validity of the passed profile. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:48 +02:00
Ilya Dryomov	c9e9f97bdf	Btrfs: add basic restriper infrastructure Add basic restriper infrastructure: extended balancing ioctl and all related ioctl data structures, add data structure for tracking restriper's state to fs_info, etc. The semantics of the old balancing ioctl are fully preserved. Explicitly disallow any volume operations when balance is in progress. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Ilya Dryomov	a46d11a8b0	Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for avail_{data,metadata,system}_alloc_bits fields, which gather info about available allocation profiles in the FS. When chunk is created or read from disk, its profile is OR'ed with the corresponding avail_alloc_bits field. Since SINGLE is denoted by 0 in the on-disk format, currently there is no way to tell when such chunks become avaialble. Restriper needs that information, so add a separate bit for SINGLE profile. This bit is going to be in-memory only, it should never be written out to disk, so it's not a disk format change. However to avoid remappings in future, reserve corresponding on-disk bit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Ilya Dryomov	52ba692972	Btrfs: introduce masks for chunk type and profile Chunk's type and profile are encoded in u64 flags field. Introduce masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_* constants, it should be ULL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Ilya Dryomov	6fef8df1dc	Btrfs: get rid of *_alloc_profile fields {data,metadata,system}_alloc_profile fields have been unused for a long time now. Get rid of them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Al Viro	815745cf3e	btrfs: let ->s_fs_info point to fs_info, not root... the latter can be obtained from the former (by looking as ->tree_root) just as cheaply as we currently are doing the other way round. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-08 19:35:37 -05:00
Arne Jansen	66d7e7f09f	Btrfs: mark delayed refs as for cow Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value from every call site. The for_cow parameter will later on be used to determine if a ref will change anything with respect to qgroups. Delayed refs coming from relocation are always counted as for_cow, as they don't change subvol quota. Also pass in the fs_info for later use. btrfs_find_all_roots() will use this as an optimization, as changes that are for_cow will not change anything with respect to which root points to a certain leaf. Thus, we don't need to add the current sequence number to those delayed refs. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2011-12-22 16:22:27 +01:00
Jan Schmidt	c7d22a3c3c	Btrfs: added helper btrfs_next_item() btrfs_next_item() makes the btrfs path point to the next item, crossing leaf boundaries if needed. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2011-12-22 16:22:26 +01:00
Stefan Behrens	21adbd5cbb	Btrfs: integrate integrity check module into btrfs This is the last part of the patch series. It modifies the btrfs code to use the integrity check module if configured to do so with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set, the only effective change is that code is added that handles the mount option to activate the integrity check. If the mount option is set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code complains in the log and the mount fails with EINVAL. Add the mount option to activate the usage of the integrity check code. Add invocation of btrfs integrity check code init and cleanup function on mount and umount, respectively. Add hook to call btrfs integrity check code version of submit_bh/submit_bio. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2011-12-21 19:14:17 +01:00
Josef Bacik	22c44fe65a	Btrfs: deal with enospc from dirtying inodes properly Now that we're properly keeping track of delayed inode space we've been getting a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is because a bunch of people call mark_inode_dirty, which is void so we can't return ENOSPC. This needs to be fixed in a few areas 1) file_update_time - this updates the mtime and such when writing to a file, which will call mark_inode_dirty. So copy file_update_time into btrfs so we can call btrfs_dirty_inode directly and return an error if we get one appropriately. 2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't setting ->setattr for symlinks, even though we should have been. This catches one of the cases where we were getting errors in mark_inode_dirty. 3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly instead of mark_inode_dirty. This lets us return errors properly for truncate and chown/anything related to setattr. 4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and print an error if we have one. The only remaining user we can't control for this is touch_atime(), but we don't really want to keep people from walking down the tree if we don't have space to save the atime update, so just complain but don't worry about it. With this patch xfstests 83 complains a handful of times instead of hundreds of times. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-12-15 11:04:21 -05:00
Miao Xie	aa38a711a8	Btrfs: fix deadlock on metadata reservation when evicting a inode When I ran the xfstests, I found the test tasks was blocked on meta-data reservation. By debugging, I found the reason of this bug: start transaction \| v reserve meta-data space \| v flush delay allocation -> iput inode -> evict inode ^ \| \| v wait for delay allocation flush <- reserve meta-data space And besides that, the flush on evicting inode will block the thread, which is reclaiming the memory, and make oom happen easily. Fix this bug by skipping the flush step when evicting inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2011-11-30 18:46:03 +01:00
Josef Bacik	291c7d2f57	Btrfs: wait on caching if we're loading the free space cache We've been hitting panics when running xfstest 13 in a loop for long periods of time. And actually this problem has always existed so we've been hitting these things randomly for a while. Basically what happens is we get a thread coming into the allocator and reading the space cache off of disk and adding the entries to the free space cache as we go. Then we get another thread that comes in and tries to allocate from that block group. Since block_group->cached != BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because if we're doing the old slow way of caching we don't want to hold people up and wait for everything to finish. The problem with this is we could end up discarding the space cache at some arbitrary point in the future, which means we could very well end up allocating space that is either bad, or when the real caching happens it could end up thinking the space isn't in use when it really is and cause all sorts of other problems. The solution is to add a new flag to indicate we are loading the free space cache from disk, and always try to cache the block group if cache->cached != BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else who tries to allocate from the block group will have to wait until it's finished to make sure it completes successfully. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-11-20 07:42:16 -05:00
Liu Bo	f1ebcc74d5	Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush The btrfs snapshotting code requires that once a root has been snapshotted, we don't change it during a commit. But there are two cases to lead to tree corruptions: 1) multi-thread snapshots can commit serveral snapshots in a transaction, and this may change the src root when processing the following pending snapshots, which lead to the former snapshots corruptions; 2) the free inode cache was changing the roots when it root the cache, which lead to corruptions. This fixes things by making sure we force COW the block after we create a snapshot during commiting a transaction, then any changes to the roots will result in COW, and we get all the fs roots and snapshot roots to be consistent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-15 09:53:28 -05:00
Chris Mason	531f4b1ae5	Merge branch 'for-chris' of git://github.com/sensille/linux into integration Conflicts: fs/btrfs/ctree.h Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:05:08 -05:00
Josef Bacik	c06a0e120a	Btrfs: fix delayed insertion reservation We all keep getting those stupid warnings from use_block_rsv when running stress.sh, and it's because the delayed insertion stuff is being stupid. It's not the delayed insertion stuffs fault, it's all just stupid. When marking an inode dirty for oh say updating the time on it, we just do a btrfs_join_transaction, which doesn't reserve any space. This is stupid because we're going to have to have space reserve to make this change, but we do it because it's fast because chances are we're going to call it over and over again and it doesn't matter. Well thanks to the delayed insertion stuff this is mostly the case, so we do actually need to make this reservation. So if trans->bytes_reserved is 0 then try to do a normal reservation. If not return ENOSPC which will make the btrfs_dirty_inode start a proper transaction which will let it do the whole ENOSPC dance and reserve enough space for the delayed insertion to steal the reservation from the transaction. The other stupid thing we do is not reserve space for the inode when writing to the thing. Usually this is ok since we have to update the time so we'd have already done all this work before we get to the endio stuff, so it doesn't matter. But this is stupid because we could write the data after the transaction commits where we changed the mtime of the inode so we have to cow all the way down to the inode anyway. This used to be masked by the delalloc reservation stuff, but because we delay the update it doesn't get masked in this case. So again the delayed insertion stuff bites us in the ass. So if our trans->block_rsv is delalloc, just steal the reservation from the delalloc reserve. Hopefully this won't bite us in the ass, but I've said that before. With this patch stress.sh no longer spits out those stupid warnings (famous last words). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:04:20 -05:00
Josef Bacik	6d668dda0c	Btrfs: make a delayed_block_rsv for the delayed item insertion I've been hitting warnings in use_block_rsv when running the delayed insertion stuff. It's because we will readjust global block rsv based on what is in use, which means we could end up discarding reservations that are for the delayed insertion stuff. So instead create a seperate block rsv for the delayed insertion stuff. This will also make it easier to debug problems with the delayed insertion reservations since we will know that only the delayed insertion code touches this block_rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:04:18 -05:00
Chris Mason	af31f5e5b8	Btrfs: add a log of past tree roots This takes some of the free space in the btrfs super block to record information about most of the roots in the last four commits. It also adds a -o recovery to use the root history log when we're not able to read the tree of tree roots, the extent tree root, the device tree root or the csum root. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:04:15 -05:00
David Sterba	6c41761fc6	btrfs: separate superblock items out of fs_info fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-11-06 03:04:01 -05:00
Chris Mason	e688b7252f	Btrfs: fix extent pinning bugs in the tree log The tree log had two important bugs that could cause corruptions after a crash. Sometimes we were allowing tree log blocks to be reused after the tree log was committed but before the transaction commit was done. This allowed a future metadata write to overwrite the tree log data. It is fixed by adding a new variant of freeing reserved extents that always pins them. Credit goes to Stefan Behrens and Arne Jansen for many many hours spent tracking this bug down. During tree log replay, we do a pass through the tree log and pin all the extents we find. This makes sure the replay code won't go in and use any of those blocks for new allocations during replay. The problem is the free space cache isn't honoring these pinned extents. So the allocator can end up handing them out, leading to all kinds of problems during replay. The fix here is to force any free space cache to load while we pin the extents, and then to make sure we remove the pinned extents from the free space rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>	2011-11-06 03:03:48 -05:00
Josef Bacik	36ba022ac0	Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions Currently btrfs_block_rsv_check does 2 things, it will either refill a block reserve like in the truncate or refill case, or it will check to see if there is enough space in the global reserve and possibly refill it. However because of overcommit we could be well overcommitting ourselves just to try and refill the global reserve, when really we should just be committing the transaction. So breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill will try to reserve more metadata if it can and btrfs_block_rsv_check will not, it will only tell you if the factor of the total space is still reserved. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:59 -04:00
Josef Bacik	5b0e95bf60	Btrfs: inline checksums into the disk free space cache Yeah yeah I know this is how we used to do it and then I changed it, but damnit I'm changing it back. The fact is that writing out checksums will modify metadata, which could cause us to dirty a block group we've already written out, so we have to truncate it and all of it's checksums and re-write it which will write new checksums which could dirty a blockg roup that has already been written and you see where I'm going with this? This can cause unmount or really anything that depends on a transaction to commit to take it's sweet damned time to happen. So go back to the way it was, only this time we're specifically setting NODATACOW because we can't go through the COW pathway anyway and we're doing our own built-in cow'ing by truncating the free space cache. The other new thing is once we truncate the old cache and preallocate the new space, we don't need to do that song and dance at all for the rest of the transaction, we can just overwrite the existing space with the new cache if the block group changes for whatever reason, and the NODATACOW will let us do this fine. So keep track of which transaction we last cleared our cache in and if we cleared it in this transaction just say we're all setup and carry on. This survives xfstests and stress.sh. The inode cache will continue to use the normal csum infrastructure since it only gets written once and there will be no more modifications to the fs tree in a transaction commit. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:54 -04:00
Josef Bacik	2bf64758fd	Btrfs: allow us to overcommit our enospc reservations One of the things that kills us is the fact that our ENOSPC reservations are horribly over the top in most normal cases. There isn't too much that can be done about this because when we are completely full we really need them to work like this so we don't under reserve. However if there is plenty of unallocated chunks on the disk we can use that to gauge how much we can overcommit. So this patch adds chunk free space accounting so we always know how much unallocated space we have. Then if we fail to make a reservation within our allocated space, check to see if we can overcommit. In the normal flushing case (like with delalloc metadata reservations) we'll take the free space and divide it by 2 if our metadata profile is setup for DUP or any of those, and then divide it by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing case (we really need this reservation now!) we only limit ourselves to half of the free space. This makes this fio test [torrent] filename=torrent-test rw=randwrite size=4g ioengine=sync directory=/mnt/btrfs-test go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB file system. This doesn't seem to break my other enospc tests, but could really use some more testing as this is a super scary change. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:50 -04:00
Josef Bacik	3b16a4e3c3	Btrfs: use the inode's mapping mask for allocating pages Johannes pointed out we were allocating only kernel pages for doing writes, which is kind of a big deal if you are on 32bit and have more than a gig of ram. So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we don't re-enter. Thanks, Reported-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:45 -04:00
Josef Bacik	4a92b1b8d2	Btrfs: stop passing a trans handle all around the reservation code The only thing that we need to have a trans handle for is in reserve_metadata_bytes and thats to know how much flushing we can do. So instead of passing it around, just check current->journal_info for a trans_handle so we know if we can commit a transaction to try and free up space or not. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:44 -04:00
Josef Bacik	482e6dc526	Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check If you run xfstest 224 it you will get lots of messages about not being able to delete inodes and that they will be cleaned up next mount. This is because btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to flush, so if there was not enough space, it simply failed. But in truncate and evict case we could easily flush space to try and get enough space to do our work, so make btrfs_block_rsv_check take a flush argument to pass down to reserve_metadata_bytes. Now xfstests 224 runs fine without all those complaints. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:38 -04:00
Josef Bacik	07127184ef	Btrfs: reduce the amount of space needed for truncates With btrfs_truncate_inode_items we always return if we have to go to another leaf, which makes us do our reservation again. This means we will only ever modify one leaf at a time, so we only need 1 items worth of slack space. Also, since we are deleting we will not be creating nodes as we go down, if anything we'll be free'ing them as we merge them together, so make a different calculation for truncate which will only have the worst case useage of COW'ing the entire path down to the leaf. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:37 -04:00
Josef Bacik	5e962c7850	Btrfs: kill btrfs_truncate_reserve_metadata Since we've optimized the truncate path, we no longer require this function. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:36 -04:00
Josef Bacik	dabdb6408c	Btrfs: kill unused parts of block_rsv The priority and refill_used flags are not used anymore, and neither is the usage counter, so just remove them from btrfs_block_rsv. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:34 -04:00
Josef Bacik	37be25bcb6	Btrfs: kill the durable block rsv stuff This is confusing code and isn't used by anything anymore, so delete it. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:32 -04:00
Josef Bacik	dba68306f3	Btrfs: kill the orphan space calculation for snapshots This patch kills off the calculation for the amount of space needed for the orphan operations during a snapshot. The thing is we only do snapshots on commit, so any space that is in the block_rsv->freed[] isn't going to be in the new snapshot anyway, so there isn't any reason to require that space to be reserved for the snapshot to occur. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:32 -04:00
Josef Bacik	fb25e9141a	Btrfs: use bytes_may_use for all ENOSPC reservations We have been using bytes_reserved for metadata reservations, which is wrong since we use that to keep track of outstanding reservations from the allocator. This resulted in us doing a lot of silly things to make sure we don't allocate a bunch of metadata chunks since we never had a real view of how much space was actually in use by metadata. This passes Arne's enospc test and xfstests as well as my own enospc tests. Hopefully this will get us moving in the right direction. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:30 -04:00
Arne Jansen	7414a03fbf	btrfs: initial readahead code and prototypes This is the implementation for the generic read ahead framework. To trigger a readahead, btrfs_reada_add must be called. It will start a read ahead for the given range [start, end) on tree root. The returned handle can either be used to wait on the readahead to finish (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach). The read ahead works as follows: On btrfs_reada_add, the root of the tree is inserted into a radix_tree. reada_start_machine will then search for extents to prefetch and trigger some reads. When a read finishes for a node, all contained node/leaf pointers that lie in the given range will also be enqueued. The reads will be triggered in sequential order, thus giving a big win over a naive enumeration. It will also make use of multi-device layouts. Each disk will have its on read pointer and all disks will by utilized in parallel. Also will no two disks read both sides of a mirror simultaneously, as this would waste seeking capacity. Instead both disks will read different parts of the filesystem. Any number of readaheads can be started in parallel. The read order will be determined globally, i.e. 2 parallel readaheads will normally finish faster than the 2 started one after another. Changes v2: - protect root->node by transaction instead of node_lock - fix missed branches: The readahead had a too simple check to determine if a branch from a node should be checked or not. It now also records the upper bound of each node to see if the requested RA range lies within. - use KERN_CONT to debug output, to avoid line breaks - defer reada_start_machine to worker to avoid deadlock Changes v3: - protect root->node by rcu Changes v5: - changed EIO-semantics of reada_tree_block_flagged - remove spin_lock from reada_control and make elems an atomic_t - remove unused read_total from reada_control - kill reada_key_cmp, use btrfs_comp_cpu_keys instead - use kref-style release functions where possible - return struct reada_control * instead of void * from btrfs_reada_add Signed-off-by: Arne Jansen <sensille@gmx.net>	2011-10-02 08:48:44 +02:00
Arne Jansen	90519d66ab	btrfs: state information for readahead Add state information for readahead to btrfs_fs_info and btrfs_device Changes v2: - don't wait in radix_trees - add own set of workers for readahead Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Arne Jansen <sensille@gmx.net>	2011-10-02 08:48:30 +02:00
Chris Mason	81d86e1b70	Merge branch 'btrfs-3.0' into for-linus	2011-08-18 10:38:03 -04:00
Li Zefan	c97c2916e2	Btrfs: use plain page_address() in header fields setget functions We've stopped using highmem for extent buffers. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-16 21:09:15 -04:00
Tsutomu Itoh	cb1b69f450	Btrfs: forced readonly when btrfs_drop_snapshot() fails The filesystem turns readonly instead of returning the error to the caller when detected error in btrfs_drop_snapshot(). and, because the caller doesn't check the error, the function type is changed to 'void'. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-16 21:09:15 -04:00
Linus Torvalds	ed8f37370d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits) Btrfs: don't call writepages from within write_full_page Btrfs: Remove unused variable 'last_index' in file.c Btrfs: clean up for find_first_extent_bit() Btrfs: clean up for wait_extent_bit() Btrfs: clean up for insert_state() Btrfs: remove unused members from struct extent_state Btrfs: clean up code for merging extent maps Btrfs: clean up code for extent_map lookup Btrfs: clean up search_extent_mapping() Btrfs: remove redundant code for dir item lookup Btrfs: make acl functions really no-op if acl is not enabled Btrfs: remove remaining ref-cache code Btrfs: remove a BUG_ON() in btrfs_commit_transaction() Btrfs: use wait_event() Btrfs: check the nodatasum flag when writing compressed files Btrfs: copy string correctly in INO_LOOKUP ioctl Btrfs: don't print the leaf if we had an error btrfs: make btrfs_set_root_node void Btrfs: fix oops while writing data to SSD partitions Btrfs: Protect the readonly flag of block group ... Fix up trivial conflicts (due to acl and writeback cleanups) in - fs/btrfs/acl.c - fs/btrfs/ctree.h - fs/btrfs/extent_io.c	2011-08-02 21:14:05 -10:00
Li Zefan	9b89d95a14	Btrfs: make acl functions really no-op if acl is not enabled So there's no overhead for something we don't use. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:48 -04:00
Mark Fasheh	bf5f32ecb6	btrfs: make btrfs_set_root_node void This is fairly trivial - btrfs_set_root_node() - always returns zero so we can just make it void. All callers ignore the return code now anyway. I also made sure to check that none of the functions that btrfs_set_root_node() calls returns an error that we might have needed to catch and pass back. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:44 -04:00
Li Zefan	b6973aa622	Btrfs: fix readahead in file defrag We passed the wrong value to btrfs_force_ra(). Fix this by changing the argument of btrfs_force_ra() from last_index to nr_page. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-08-01 14:30:42 -04:00
Linus Torvalds	22712200e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors Btrfs: use the commit_root for reading free_space_inode crcs Btrfs: reduce extent_state lock contention for metadata Btrfs: remove lockdep magic from btrfs_next_leaf Btrfs: make a lockdep class for each root Btrfs: switch the btrfs tree locks to reader/writer Btrfs: fix deadlock when throttling transactions Btrfs: stop using highmem for extent_buffers Btrfs: fix BUG_ON() caused by ENOSPC when relocating space Btrfs: tag pages for writeback in sync Btrfs: fix enospc problems with delalloc Btrfs: don't flush delalloc arbitrarily Btrfs: use find_or_create_page instead of grab_cache_page Btrfs: use a worker thread to do caching Btrfs: fix how we merge extent states and deal with cached states Btrfs: use the normal checksumming infrastructure for free space cache Btrfs: serialize flushers in reserve_metadata_bytes Btrfs: do transaction space reservation before joining the transaction Btrfs: try to only do one btrfs_search_slot in do_setxattr	2011-07-27 16:43:52 -07:00
Chris Mason	ff95acb673	Merge branch 'integration' into for-linus	2011-07-27 16:18:13 -04:00
Chris Mason	bd681513fa	Btrfs: switch the btrfs tree locks to reader/writer The btrfs metadata btree is the source of significant lock contention, especially in the root node. This commit changes our locking to use a reader/writer lock. The lock is built on top of rw spinlocks, and it extends the lock tracking to remember if we have a read lock or a write lock when we go to blocking. Atomics count the number of blocking readers or writers at any given time. It removes all of the adaptive spinning from the old code and uses only the spinning/blocking hints inside of btrfs to decide when it should continue spinning. In read heavy workloads this is dramatically faster. In write heavy workloads we're still faster because of less contention on the root node lock. We suffer slightly in dbench because we schedule more often during write locks, but all other benchmarks so far are improved. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:46 -04:00
Josef Bacik	9e0baf60de	Btrfs: fix enospc problems with delalloc So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-27 12:46:44 -04:00
Josef Bacik	bab39bf998	Btrfs: use a worker thread to do caching A user reported a deadlock when copying a bunch of files. This is because they were low on memory and kthreadd got hung up trying to migrate pages for an allocation when starting the caching kthread. The page was locked by the person starting the caching kthread. To fix this we just need to use the async thread stuff so that the threads are already created and we don't have to worry about deadlocks. Thanks, Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-27 12:46:25 -04:00
Christoph Hellwig	4e34e719e4	fs: take the ACL checks to common code Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-25 14:30:23 -04:00
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Josef Bacik	b26751575a	Btrfs: implement our own ->llseek In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek. Basically for the normal SEEK_*'s we will just defer to the generic helper, and for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest hole or data. Currently this helper doesn't check for delalloc bytes for prealloc space, so for now treat prealloc as data until that is fixed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:56 -04:00
Al Viro	0ee5dc676a	btrfs: kill magical embedded struct superblock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:20 -04:00
Al Viro	7e40145eb1	->permission() sanitizing: don't pass flags to ->check_acl() not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:21 -04:00
Josef Bacik	fdb5effd5c	Btrfs: serialize flushers in reserve_metadata_bytes We keep having problems with early enospc, and that's because our method of making space is inherently racy. The problem is we can have one guy trying to make space for himself, and in the meantime people come in and steal his reservation. In order to stop this we make a waitqueue and put anybody who comes into reserve_metadata_bytes on that waitqueue if somebody is trying to make more space. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 09:58:48 -04:00
Josef Bacik	b5009945be	Btrfs: do transaction space reservation before joining the transaction We have to do weird things when handling enospc in the transaction joining code. Because we've already joined the transaction we cannot commit the transaction within the reservation code since it will deadlock, so we have to return EAGAIN and then make sure we don't retry too many times. Instead of doing this, just do the reservation the normal way before we join the transaction, that way we can do whatever we want to try and reclaim space, and then if it fails we know for sure we are out of space and we can return ENOSPC. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-07-11 09:58:47 -04:00
Linus Torvalds	1acc9309eb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: btrfs: fix oops when doing space balance Btrfs: don't panic if we get an error while balancing V2 btrfs: add missing options displayed in mount output	2011-07-08 23:25:45 -07:00
David Sterba	0942caa373	btrfs: add missing options displayed in mount output There are three missed mount options settable by user which are not currently displayed in mount output. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-07-06 18:46:43 -04:00
Jesper Juhl	46e4edbf7e	Remove unneeded version.h includes from fs/ It was pointed out by 'make versioncheck' that some includes of linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and fs/omfs/file.c). This patch removes them. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-24 08:34:22 -07:00
Linus Torvalds	90a800de0a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: avoid delayed metadata items during commits btrfs: fix uninitialized return value btrfs: fix wrong reservation when doing delayed inode operations btrfs: Remove unused sysfs code btrfs: fix dereference of ERR_PTR value Btrfs: fix relocation races Btrfs: set no_trans_join after trying to expand the transaction Btrfs: protect the pending_snapshots list with trans_lock Btrfs: fix path leakage on subvol deletion Btrfs: drop the delalloc_bytes check in shrink_delalloc Btrfs: check the return value from set_anon_super	2011-06-20 08:58:53 -07:00
Maarten Lankhorst	9fe6a50fb7	btrfs: Remove unused sysfs code Removes code no longer used. The sysfs file itself is kept, because the btrfs developers expressed interest in putting new entries to sysfs. Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 14:54:18 -04:00
Chris Mason	7585717f30	Btrfs: fix relocation races The recent commit to get rid of our trans_mutex introduced some races with block group relocation. The problem is that relocation needs to do some record keeping about each root, and it was relying on the transaction mutex to coordinate things in subtle ways. This fix adds a mutex just for the relocation code and makes sure it doesn't have a big impact on normal operations. The race is really fixed in btrfs_record_root_in_trans, which is where we step back and wait for the relocation code to finish accounting setup. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-17 13:36:58 -04:00
Linus Torvalds	e6ece70732	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits) btrfs: fix uninitialized variable warning btrfs: add helper for fs_info->closing Btrfs: add mount -o inode_cache btrfs: scrub: add explicit plugging btrfs: use btrfs_ino to access inode number Btrfs: don't save the inode cache if we are deleting this root btrfs: false BUG_ON when degraded Btrfs: don't save the inode cache in non-FS roots Btrfs: make sure we don't overflow the free space cache crc page Btrfs: fix uninit variable in the delayed inode code btrfs: scrub: don't reuse bios and pages Btrfs: leave spinning on lookup and map the leaf Btrfs: check for duplicate entries in the free space cache Btrfs: don't try to allocate from a block group that doesn't have enough space Btrfs: don't always do readahead Btrfs: try not to sleep as much when doing slow caching Btrfs: kill BTRFS_I(inode)->block_group Btrfs: don't look at the extent buffer level 3 times in a row Btrfs: map the node block when looking for readahead targets Btrfs: set range_start to the right start in count_range_bits ...	2011-06-05 06:17:23 +09:00
David Sterba	7841cb2898	btrfs: add helper for fs_info->closing wrap checking of filesystem 'closing' flag and fix a few missing memory barriers. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-06-04 08:11:22 -04:00
Chris Mason	4b9465cb9e	Btrfs: add mount -o inode_cache This makes the inode map cache default to off until we fix the overflow problem when the free space crcs don't fit inside a single page. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-06-04 08:03:47 -04:00
Linus Torvalds	36947a7682	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits) Cache xattr security drop check for write v2 fs: block_page_mkwrite should wait for writeback to finish mm: Wait for writeback when grabbing pages to begin a write configfs: remove unnecessary dentry_unhash on rmdir, dir rename fat: remove unnecessary dentry_unhash on rmdir, dir rename hpfs: remove unnecessary dentry_unhash on rmdir, dir rename minix: remove unnecessary dentry_unhash on rmdir, dir rename fuse: remove unnecessary dentry_unhash on rmdir, dir rename coda: remove unnecessary dentry_unhash on rmdir, dir rename afs: remove unnecessary dentry_unhash on rmdir, dir rename affs: remove unnecessary dentry_unhash on rmdir, dir rename 9p: remove unnecessary dentry_unhash on rmdir, dir rename ncpfs: fix rename over directory with dangling references ncpfs: document dentry_unhash usage ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename hostfs: remove unnecessary dentry_unhash on rmdir, dir rename hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename hfs: remove unnecessary dentry_unhash on rmdir, dir rename omfs: remove unnecessary dentry_unhash on rmdir, dir rneame udf: remove unnecessary dentry_unhash from rmdir, dir rename ...	2011-05-28 13:03:41 -07:00
Chris Mason	ff5714cca9	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus Conflicts: fs/btrfs/disk-io.c fs/btrfs/extent-tree.c fs/btrfs/free-space-cache.c fs/btrfs/inode.c fs/btrfs/transaction.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-28 07:00:39 -04:00
Christoph Hellwig	aa38572954	fs: pass exact type of data dirties to ->dirty_inode Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or anything else, so that the filesystem can track internally if it needs to push out a transaction for fdatasync or not. This is just the prototype change with no user for it yet. I plan to push large XFS changes for the next merge window, and getting this trivial infrastructure in this window would help a lot to avoid tree interdependencies. Also remove incorrect comments that ->dirty_inode can't block. That has been changed a long time ago, and many implementations rely on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-27 07:04:40 -04:00
Chris Mason	4cb5300bc8	Btrfs: add mount -o auto_defrag This will detect small random writes into files and queue the up for an auto defrag process. It isn't well suited to database workloads yet, but works for smaller files such as rpm, sqlite or bdb databases. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-26 17:52:15 -04:00
Chris Mason	d6c0cb379c	Merge branch 'cleanups_and_fixes' into inode_numbers Conflicts: fs/btrfs/tree-log.c fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-23 14:37:47 -04:00
Andi Kleen	0956c798ef	BTRFS: Remove unused node_lock `240f62c875` replaced the node_lock with rcu_read_lock, but forgot to remove the actual lock in the data structure. Remove it here. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-23 13:05:39 -04:00
Josef Bacik	d82a6f1d7e	Btrfs: kill BTRFS_I(inode)->block_group Originally this was going to be used as a way to give hints to the allocator, but frankly we can get much better hints elsewhere and it's not even used at all for anything usefull. In addition to be completely useless, when we initialize an inode we try and find a freeish block group to set as the inodes block group, and with a completely full 40gb fs this takes _forever_, so I imagine with say 1tb fs this is just unbearable. So just axe the thing altoghether, we don't need it and it saves us 8 bytes in the inode and saves us 500 microseconds per inode lookup in my testcase. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-05-23 13:03:12 -04:00
Josef Bacik	fcb80c2aff	Btrfs: fix how we do space reservation for truncate The ceph guys keep running into problems where we have space reserved in our orphan block rsv when freeing it up. This is because they tend to do snapshots alot, so their truncates tend to use a bunch of space, so when we go to do things like update the inode we have to steal reservation space in order to make the reservation happen. This happens because truncate can use as much space as it freaking feels like, but we still have to hold space for removing the orphan item and updating the inode, which will definitely always happen. So in order to fix this we need to split all of the reservation stuf up. So with this patch we have 1) The orphan block reserve which only holds the space for deleting our orphan item when everything is over. 2) The truncate block reserve which gets allocated and used specifically for the space that the truncate will use on a per truncate basis. 3) The transaction will always have 1 item's worth of data reserved so we can update the inode normally. Hopefully this will make the ceph problem go away. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-05-23 13:03:08 -04:00
Josef Bacik	a4abeea41a	Btrfs: kill trans_mutex We use trans_mutex for lots of things, here's a basic list 1) To serialize trans_handles joining the currently running transaction 2) To make sure that no new trans handles are started while we are committing 3) To protect the dead_roots list and the transaction lists Really the serializing trans_handles joining is not too hard, and can really get bogged down in acquiring a reference to the transaction. So replace the trans_mutex with a trans_lock spinlock and use it to do the following 1) Protect fs_info->running_transaction. All trans handles have to do is check this, and then take a reference of the transaction and keep on going. 2) Protect the fs_info->trans_list. This doesn't get used too much, basically it just holds the current transactions, which will usually just be the currently committing transaction and the currently running transaction at most. 3) Protect the dead roots list. This is only ever processed by splicing the list so this is relatively simple. 4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using the trans_mutex before, so this is a pretty straightforward change. 5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over the entirety of the commit we need to have a way to block new people from creating a new transaction while we're doing our work. So we set no_trans_join and in join_transaction we test to see if that is set, and if it is we do a wait_on_commit. 6) Make the transaction use count atomic so we don't need to take locks to modify it when we're dropping references. 7) Add a commit_lock to the transaction to make sure multiple people trying to commit the same transaction don't race and commit at the same time. 8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl trans. I have tested this with xfstests, but obviously it is a pretty hairy change so lots of testing is greatly appreciated. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-05-23 13:00:57 -04:00
Chris Mason	712673339a	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers Conflicts: fs/btrfs/Makefile fs/btrfs/ctree.h fs/btrfs/volumes.h Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-23 06:30:52 -04:00
Chris Mason	945d8962ce	Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers Conflicts: fs/btrfs/extent-tree.c fs/btrfs/free-space-cache.c fs/btrfs/inode.c fs/btrfs/tree-log.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-22 12:33:42 -04:00
Chris Mason	dcc6d07322	Merge branch 'delayed_inode' into inode_numbers Conflicts: fs/btrfs/inode.c fs/btrfs/ioctl.c fs/btrfs/transaction.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-05-22 07:07:01 -04:00

1 2 3 4 5 ...

589 Commits