OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Nikolay Borisov	c31683a6ef	btrfs: Remove fs_info argument from __remove_from_free_space_tree This function takes a transaction handle which holds a reference to fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	e581168d1f	btrfs: Remove fs_info argument from remove_free_space_extent This function takes a transaction handle which already has a reference to the fs_info. Use it and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	5cb1782213	btrfs: Remove fs_info argument from add_free_space_extent This function always takes a transaction handle which references the fs_info structure. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	85a7ef130c	btrfs: Remove fs_info argument from modify_free_space_bitmap This function already takes a transaction which has a reference to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	690d76828a	btrfs: Remove fs_info argument from update_free_space_extent_count This function already takes a transaction handle which has a reference to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	5296c2bf51	btrfs: Remove fs_info parameter from convert_free_space_to_extents This function always takes a transaction handle which contains a reference to fs_info. So use that and kill the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	719fb4de55	btrfs: Remove fs_info argument from convert_free_space_to_bitmaps This function already takes a transaction handle which contains a reference to fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	f3f7277995	btrfs: Remove fs_info parameter from remove_block_group_free_space This function always takes a trans handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	4457c1c702	btrfs: Remove fs_info argument from add_new_free_space This function also takes a btrfs_block_group_cache which contains a referene to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	66afee1848	btrfs: Remove fs_info parameter from add_new_free_space_info This function already takes trans handle from where fs_info can be referenced. Remove the redundant parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	2d5cffa1b0	btrfs: Remove fs_info argument from __add_to_free_space_tree This function already takes a transaction handle which contains a reference to fs_info. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	9a7e0f9284	btrfs: Remove fs_info argument from __add_block_group_free_space This function already takes a transaction handle which has a reference to the fs_info. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	e4e0711cd9	btrfs: Remove fs_info argument from add_block_group_free_space We also pass in a transaction handle which has a reference to the fs_info. Just remove the extraneous argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	483bce068e	btrfs: Make btrfs_init_dummy_trans initialize trans' fs_info field This will be necessary for future cleanups which remove the fs_info argument from some freespace tree functions. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Nikolay Borisov	7c8a0d363a	btrfs: Add assert in __btrfs_del_delalloc_inode The invariant is that when nr_delalloc_inodes is 0 then the root mustn't have any inodes on its delalloc inodes list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Robbie Ko	0f96f517dc	btrfs: incremental send, improve rmdir performance for large directory Currently when checking if a directory can be deleted, we always check if all its children have been processed. Example: A directory with 2,000,000 files was deleted original: 1994m57.071s patch: 1m38.554s [FIX] Instead of checking all children on all calls to can_rmdir(), we keep track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls to can_rmdir(). Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Robbie Ko	35c8eda12f	btrfs: incremental send, move allocation until it's needed in orphan_dir_info Move the allocation after the search when it's clear that the new entry will be added. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Nikolay Borisov	2335efafa6	btrfs: split delayed ref head initialization and addition add_delayed_ref_head really performed 2 independent operations - initialisting the ref head and adding it to a list. Now that the init part is in a separate function let's complete the separation between both operations. This results in a lot simpler interface for add_delayed_ref_head since the function now deals solely with either adding the newly initialised delayed ref head or merging it into an existing delayed ref head. This results in vastly simplified function signature since 5 arguments are dropped. The only other thing worth mentioning is that due to this split the WARN_ON catching reinit of existing. In this patch the condition is extended such that: qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved is added. This is done because the two qgroup_* prefixed member are set only if both ref_root and reserved are passed. So functionally it's equivalent to the old WARN_ON and allows to remove the two args from add_delayed_ref_head. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Nikolay Borisov	eb86ec73b9	btrfs: Use init_delayed_ref_head in add_delayed_ref_head Use the newly introduced function when initialising the head_ref in add_delayed_ref_head. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	a2e569b3f2	btrfs: Introduce init_delayed_ref_head add_delayed_ref_head implements the logic to both initialize a head_ref structure as well as perform the necessary operations to add it to the delayed ref machinery. This has resulted in a very cumebrsome interface with loads of parameters and code, which at first glance, looks very unwieldy. Begin untangling it by first extracting the initialization only code in its own function. It's more or less verbatim copy of the first part of add_delayed_ref_head. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	cd7f9699b1	btrfs: Open-code add_delayed_data_ref Now that the initialization part and the critical section code have been split it's a lot easier to open code add_delayed_data_ref. Do so in the following manner: 1. The common init function is put immediately after memory-to-be-initialized is allocated, followed by the specific data ref initialization. 2. The only piece of code that remains in the critical section is insert_delayed_ref call. 3. Tracing and memory freeing code is moved outside of the critical section. No functional changes, just an overall shorter critical section. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	70d640004a	btrfs: Open-code add_delayed_tree_ref Now that the initialization part and the critical section code have been split it's a lot easier to open code add_delayed_tree_ref. Do so in the following manner: 1. The comming init code is put immediately after memory-to-be-initialized is allocated, followed by the ref-specific member initialization. 2. The only piece of code that remains in the critical section is insert_delayed_ref call. 3. Tracing and memory freeing code is put outside of the critical section as well. The only real change here is an overall shorter critical section when dealing with delayed tree refs. From functional point of view - the code is unchanged. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	c812c8a857	btrfs: Use init_delayed_ref_common in add_delayed_data_ref Use the newly introduced helper and remove the duplicate code. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	646f4dd76f	btrfs: Use init_delayed_ref_common in add_delayed_tree_ref Use the newly introduced common helper. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Nikolay Borisov	cb49a87b2a	btrfs: Factor out common delayed refs init code THe majority of the init code for struct btrfs_delayed_ref_node is duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out the common bits in init_delayed_ref_common. This function is going to be used in future patches to clean that up. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Chengguang Xu	891f41cb27	btrfs: return original error code when failing from option parsing It's not good to overwrite -ENOMEM using -EINVAL when failing from mount option parsing, so just return original error code. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
David Sterba	6fcf6e2bff	btrfs: remove redundant btrfs_balance_control::fs_info The fs_info is always available from the context so we don't need to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Qu Wenruo	c9f6f3cd1c	btrfs: qgroup: Allow trace_btrfs_qgroup_account_extent() to record its transid When debugging quota rescan race, some times btrfs rescan could account some old (committed) leaf and then re-account newly committed leaf in next generation. This race needs extra transid to locate, so add @transid for trace_btrfs_qgroup_account_extent() for such debug. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Colin Ian King	f5686e3acd	btrfs: send: fix spelling mistake: "send_in_progres" -> "send_in_progress" Trivial fix to spelling mistake of function name in btrfs_err message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	63a9c7b9ce	btrfs: Remove devid parameter from btrfs_rmap_block This function is used in only one place and devid argument is always passed 0. So just remove it, similarly to how it was removed in the userspace code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Qu Wenruo	8b317901da	btrfs: trace: Allow trace_qgroup_update_counters() to record old rfer/excl value Origin trace_qgroup_update_counters() only records qgroup id and its reference count change. It's good enough to debug qgroup accounting change, but when rescan race is involved, it's pretty hard to distinguish which modification belongs to which rescan. So add old_rfer and old_excl trace output to help distinguishing different rescan instance. (Different rescan instance should reset its qgroup->rfer to 0) For trace event parameter, it just changes from u64 qgroup_id to struct btrfs_qgroup *qgroup, so number of parameters is not changed at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	3a2f8c07e1	btrfs: Unexport btrfs_alloc_delalloc_work It's used only in inode.c so makes no sense to have it exported. Also move the definition of btrfs_delalloc_work to inode.c since it's used only this file. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	076da91cd9	btrfs: Remove delayed_iput member from btrfs_delalloc_work When allocating a delalloc work we are always setting the delayed_iput to 0. So remove the delay_iput member of btrfs_delalloc_work, as a result also remove it as a parameter from btrfs_alloc_delalloc_work since it's not used anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	4fbb514785	btrfs: Remove delay_iput parameter from __start_delalloc_inodes It's always set to 0 so remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ rename to start_delalloc_inodes ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Nikolay Borisov	76f32e240e	btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes It's always set to 0, so just remove it and collapse the constant value to the only function we are passing it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Nikolay Borisov	82b3e53b8d	btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots This parameter was introduced alongside the function in `eb73c1b7ce` ("Btrfs: introduce per-subvolume delalloc inode list") to avoid deadlocks since this function was used in the transaction commit path. However, commit `8d875f95da` ("btrfs: disable strict file flushes for renames and truncates") removed that usage, rendering the parameter obsolete. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Gu Jinxiang	0338dff6e0	btrfs: do reverse path readahead in btrfs_shrink_device In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to READA_FORWARD. But I think READA_BACK is correct. Since: 1. key.offset is set to (u64)-1 2. after btrfs_search_slot, btrfs_previous_item is called So, for readahead previous items, READA_BACK is the correct one. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Qu Wenruo	4ed0a7a3b7	btrfs: trace: Add trace points for unused block groups This patch will add the following trace events: 1) btrfs_remove_block_group For btrfs_remove_block_group() function. Triggered when a block group is really removed. 2) btrfs_add_unused_block_group Triggered which block group is added to unused_bgs list. 3) btrfs_skip_unused_block_group Triggered which unused block group is not deleted. These trace events is pretty handy to debug case related to block group auto remove. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Qu Wenruo	3dca5c942d	btrfs: trace: Remove unnecessary fs_info parameter for btrfs__reserve_extent event class fs_info can be extracted from btrfs_block_group_cache, and all btrfs_block_group_cache is created by btrfs_create_block_group_cache() with fs_info initialized, no need to worry about NULL pointer dereference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Gu Jinxiang	9113493e3a	btrfs: remove unused fs_info parameter Since the commit `c6100a4b4e` ("Btrfs: replace tree->mapping with tree->private_data"), parameter fs_info in alloc_reloc_control is not used. So remove it. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	f9fbcaa2a3	btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::mindev_error so that btrfs_raid_array can maintain the error code to return if the minimum number of devices condition is not met while trying to delete a device in the given raid. And so we can drop btrfs_raid_mindev_error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	41a6e8913c	btrfs: move btrfs_raid_group values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::bg_flag so that btrfs_raid_array can maintain the bit map flag of the raid type, and so we can drop btrfs_raid_group. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	ed23467b18	btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::raid_name so that btrfs_raid_array can maintain the name of the raid type, and so we can drop btrfs_raid_type_names. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Qu Wenruo	b545993694	btrfs: print-tree: Add eb locking status output for debug build It's pretty handy if we can get the debug output for locking status of an extent buffer, specially for race condition related debugging. So add the following output for btrfs_print_tree() and btrfs_print_leaf(): - refs - write_locks (as w:%d) - read_locks (as r:%d) - blocking_writers (as bw:%d) - blocking_readers (as br:%d) - spinning_writers (as sw:%d) - spinning_readers (as sr:%d) - lock_owner - current->pid Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	833aae18fc	btrfs: open code set_balance_control The helper is quite simple and I'd like to see the locking in the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	1354e1a13e	btrfs: use mutex in btrfs_resume_balance_async While the spinlock does not cause problems, using the mutex is more correct and consistent with others. The global status of balance is eg. checked from btrfs_pause_balance or btrfs_cancel_balance with mutex. Resuming balance happens during mount or ro->rw remount. In the former case, no other user of the balance_ctl exists, in the latter, balance cannot run until the ro/rw transition is finished. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	008ef0969d	btrfs: drop lock parameter from update_ioctl_balance_args and rename The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	cf7d20f447	btrfs: move and comment read-only check in btrfs_cancel_balance Balance cannot be started on a read-only filesystem and will have to finish/exit before eg. going to read-only via remount. In case the filesystem is forcibly set to read-only after an error, balance will finish anyway and if the cancel call is too fast it will just wait for that to happen. The last case is when the balance is paused after mount but it's read-only and cancelling would want to delete the item. The test is moved after the check if balance is running at all, as it looks more logical to report "no balance running" instead of "read-only filesystem". Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	3009a62f3b	btrfs: track running balance in a simpler way Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	dccdb07bc9	btrfs: kill btrfs_fs_info::volume_mutex Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit `5ac00addc7` ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	a0fecc2371	btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	149196a2ae	btrfs: cleanup helpers that reset balance state The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	eee95e3fb0	btrfs: add sanity check when resuming balance after mount Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	010a47bde9	btrfs: add proper safety check before resuming dev-replace The device replace is paused by unmount or read only remount, and resumed on next mount or write remount. The exclusive status should be checked properly as it's a global invariant and we must not allow 2 operations run. In this case, the balance can be also paused and resumed under same conditions. It's always checked first so dev-replace could see the EXCL_OP already taken, BUT, the ioctl would never let start both at the same time. Replace the WARN_ON with message and return 0, indicating no error as this is purely theoretical and the user will be informed. Resolving that manually should be possible by waiting for the other operation to finish or cancel the paused state. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	a17c95df4c	btrfs: move clearing of EXCL_OP out of __cancel_balance Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	72b81abf95	btrfs: move volume_mutex to callers of btrfs_rm_device Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	d48f39d5a5	btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static The function logically belongs there and there's only a single caller, no need to export it. No code changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	a425f9d475	btrfs: export and rename free_device The function will be used outside of volumes.c, the allocation btrfs_alloc_device is also exported. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:23 +02:00
David Sterba	6fc4749d25	btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear This is a preparatory cleanup that will make clear that the only successful way out of btrfs_init_dev_replace_tgtdev will also set the device_out to a valid pointer. With this guarantee, the callers can be simplified. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:23 +02:00
David Sterba	00251a527a	btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller The function is called once and is fairly small, we can merge it with the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	b518519713	btrfs: cleanup btrfs_rm_device() promote fs_devices pointer This function uses fs_info::fs_devices number of time, however we declare and use it only at the end, instead do it in the beginning of the function and use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	636d2c9d63	btrfs: cleanup find_device() drop list_head pointer find_device() declares struct list_head *head pointer and used only once, instead just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	897fb5734a	btrfs: rename __btrfs_open_devices to open_fs_devices __btrfs_open_devices() is un-exported drop __ prefix and rename it to open_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	0226e0eb65	btrfs: rename __btrfs_close_devices to close_fs_devices __btrfs_close_devices() is un-exported, drop the __ prefix and rename it to close_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	f117e290e8	btrfs: cleanup __btrfs_open_devices() drop head pointer __btrfs_open_devices() declares struct list_head *head, however head is used only once, instead use btrfs_fs_devices::devices directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	c4babc5e38	btrfs: rename struct btrfs_fs_devices::list btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic name 'list' makes it's search very difficult, rename it to fs_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Nikolay Borisov	be97f133b3	btrfs: Drop fs_info parameter from btrfs_merge_delayed_refs It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	f033798d12	btrfs: Drop fs_info parameter from add_delayed_data_ref It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	1acda0c289	btrfs: Drop add_delayed_ref_head fs_info parameter It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	40012f96b6	btrfs: Remove btrfs_wait_and_free_delalloc_work This function is called from only 1 place and is effectively a wrapper over wait_completion/kfree. It doesn't really bring any value having those two calls in a separate function. Just open code it and remove it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	8ae225a8a4	btrfs: Remove tree argument from extent_writepages It can be directly referenced from the passed address_space so do that. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	81f1d39035	btrfs: Use list_empty instead of list_empty_careful list_empty_careful usually is a signal of something tricky going on. Its usage in btrfs is actually not needed since both lists it's used on are local to a function and cannot be modified concurrently. So switch to plain list_empty. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Nikolay Borisov	2a3ff0adc9	btrfs: Remove redundant tree argument from extent_readpages This function is called only from btrfs_readpage and is already passed the mapping. Simplify its signature by moving the code obtaining reference to the extent tree in the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Nikolay Borisov	29c68b2de9	btrfs: Remove map argument from try_release_extent_state It's not used in the function so just remove it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Nikolay Borisov	477a30ba5f	btrfs: Sink extent_tree arguments in try_release_extent_mapping This function already gets the page from which the two extent trees are referenced. Simplify its signature by moving the code getting the trees inside the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Misono Tomohiro	a79a464d56	btrfs: Allow rmdir(2) to delete an empty subvolume Change the behavior of rmdir(2) and allow it to delete an empty subvolume by using btrfs_delete_subvolume() which is used by btrfs_ioctl_snap_destroy(). This is a change in behaviour and has been requested by users. Deleting the subvolume by ioctl requires root permissions while the rmdir way does works with standard tools and syscalls for all users that can access the subvolume. The main usecase is to allow 'rm -rf /path/with/subvols' to simply work. We were not able to find any nasty usability surprises, the intention is to do the destructive rm. Without allowing rmdir, this would have to be followed by the ioctl subvolume deletion, which is more of an annoyance. Implementation details: The required lock for @dir and inode of @dentry is already acquired in vfs layer. We need some check before deleting a subvolume. Permission check is done in vfs layer, emptiness check is in btrfs_rmdir() and additional check (i.e. neither the subvolume is a default subvolume nor send is in progress) is in btrfs_delete_subvolume(). Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs layer later. Tested-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Misono Tomohiro	f60a2364a4	btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() Factor out the second half of btrfs_ioctl_snap_destroy() as btrfs_delete_subvolume(), which performs some subvolume specific checks before deletion: 1. send is not in progress 2. the subvolume is not the default subvolume 3. the subvolume does not contain other subvolumes and actual deletion process. btrfs_delete_subvolume() requires inode_lock for both @dir and inode of @dentry. The remaining part of btrfs_ioctl_snap_destroy() is mainly permission checks. Note that call of d_delete() is not included in btrfs_delete_subvolume() as this function will also be used by btrfs_rmdir() to delete an empty subvolume and in that case d_delete() is called in VFS layer. As a result, btrfs_unlink_subvol() and may_destroy_subvol() become static functions. No functional changes. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Misono Tomohiro	ec42f16734	btrfs: Move may_destroy_subvol() from ioctl.c to inode.c This is a preparation work to refactor btrfs_ioctl_snap_destroy() and to allow rmdir(2) to delete an empty subvolume. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of the function comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Howard McLauchlan	3b079a919a	btrfs: remove unused le_test_bit() With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap conversion"), there are no more callers to le_test_bit(). This patch removes le_test_bit(). Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Howard McLauchlan	a565971ff3	btrfs: optimize free space tree bitmap conversion Presently, convert_free_space_to_extents() does a linear scan of the bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). Testing shows a 20-33% decrease in execution time for convert_free_space_to_extents(). Since we change bitmap to be unsigned long, we have to do some casting for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing bit operations. Everywhere else, we're just using it for pointer arithmetic and not directly accessing it, so char seems more appropriate. Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Howard McLauchlan	6faa8f475e	btrfs: clean up le_bitmap_{set, clear}() le_bitmap_set() is only used by free-space-tree, so move it there and make it static. le_bitmap_clear() is not used, so remove it. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
David Sterba	f46b24c945	btrfs: use fs_info for btrfs_handle_em_exist tracepoint We really want to know to which filesystem the extent map events belong, but as it cannot be reached from the extent_map pointers, we need to pass it down the callchain. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:17 +02:00
David Sterba	0e08eb9b1c	btrfs: tests: pass fs_info to extent_map tests Preparatory work to pass fs_info to btrfs_add_extent_mapping so we can get a better tracepoint message. Extent maps do not need fs_info for anything so we only add a dummy one without any other initialization. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:17 +02:00
Nikolay Borisov	57f1642ec3	btrfs: Consolidate error checking for btrfs_alloc_chunk The second if is really a subcase of ret being less than 0. So introduce a generic if (ret < 0) check, and inside have another if which explicitly handles the -ENOSPC and any other errors. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:16 +02:00
Nikolay Borisov	1e7a14211b	btrfs: Fix lock release order Locks should generally be released in the oppposite order they are acquired. Generally lock acquisiton ordering is used to ensure deadlocks don't happen. However, as becomes more complicated it's best to also maintain proper unlock order so as to avoid possible dead locks. This was found by code inspection and doesn't necessarily lead to a deadlock scenario. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:16 +02:00
Nikolay Borisov	b25f0d0012	btrfs: Use while loop instead of labels in __endio_write_update_ordered Currently __endio_write_update_ordered uses labels to implement what is essentially a simple while loop. This makes the code more cumbersome to follow than it actually has to be. No functional changes. No xfstest regressions were found during testing. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:15 +02:00
Anand Jain	89595e80de	btrfs: add comment about BTRFS_FS_EXCL_OP Adds comments about BTRFS_FS_EXCL_OP to existing comments about the device locks. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:15 +02:00
Nikolay Borisov	41d0bd3b5e	btrfs: Drop delayed_refs argument from btrfs_check_delayed_seq It's used to print its pointer in a debug statement but doesn't really bring any useful information to the error message. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 13:12:11 +02:00
Su Yue	c065f5b1cf	btrfs: rename btrfs_get_block_group_info and make it static The function btrfs_get_block_group_info() was introduced by the commit `5af3e8cce8` ("Btrfs: make filesystem read-only when submitting barrier fails") which used it in disk-io.c. However, the function is only called in ioctl.c now. Its parameter type btrfs_ioctl_space_info* is only for ioctl. So, make it static and rename it to be original name get_block_group_info. No functional change. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 13:12:11 +02:00
Nikolay Borisov	29d2b84cf9	btrfs: Replace owner argument in add_pinned_bytes with a boolean add_pinned_bytes really cares whether the bytes being pinned are either data or metadata. To that effect it checks whether the 'owner' argument is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because owner can really have 2 types of values: a) For metadata extents it holds the level at which the parent is in the btree. This amounts to owner having the values 0-7 b) In case of modifying data extents, owner is the inode number to which those extents belongs. Let's make this more explicit byt converting the owner parameter to a boolean value and either pass it directly when we know the type of extents we are working with (i.e. in btrfs_free_tree_block). In cases when the parent function can be called on both metadata/data extents perform the check in the caller. This hopefully makes the interface of add_pinned_bytes more intuitive. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 13:12:11 +02:00
Linus Torvalds	d7b66b4ab0	for-4.17-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsG/ecACgkQxWXV+ddt WDvr3w/8D12pwR9sPcEwxD4pvoLv7LP1VRQy2u+ivSifdBD7MueKh3y0igUMyARR LERsK0zUsTQGkkC6c7ZYd4cT9PikPpXtO1P9iATFAKqR/YMDIV/haSqT8DwbI/qb 7F+ZMeTy1LzL01YlYBrGVDxP8AWVO2Dml6JolYxzplILSLvdPH6G8xOSjei/p9sm RK5ERHJENEI0l/cThpiLoAEWjzciPtR39T5Hq45onHyCs3bjJCcx51/QE8sBsl8x +BKvCmL40UKd30YKudJZYDM6NgMgWENhfTtIZQIInv99sMNCxIgTEUdX8ExdyjRZ 24rst/BuQz4d8r/8zqE/hdFsHRGWwnEiYmGWylanPY5KdQ41ULfXC06xuoNOLoW8 KQwD8SWv+W5vEJW0UQz5cb3vUgv5RnUzPvcmMfSztLeo2K4zj6zCK5L6XJwIJNbM 1AJR7R4TRkQdf5QEeziFl738Yv1AgsPQuKSiiFa9YwXMLU8dYXlx14ioUzBL8MLe 1wZPJ03x/N7eKJ0g6OIAAVfUTFFejv4Z2B2IDoObuLLsPwTdK6tS+9tJ5mos7ngG Vf1ZVmhmeJdw1qwK8ROzAJHkK807KgGO7LWmA7tIVLwWuZX14F7xLQIg3Ux3MhIh NhoBTFy2AGmdE0hFYv/4FA5dnUOU4VTVYVw3QUV4DMc0XIodZrE= =iYyx -----END PGP SIGNATURE----- Merge tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A one-liner that prevents leaking an internal error value 1 out of the ftruncate syscall. This has been observed in practice. The steps to reproduce make a common pattern (open/write/fync/ftruncate) but also need the application to not check only for negative values and happens only for compressed inlined files. The conditions are narrow but as this could break userspace I think it's better to merge it now and not wait for the merge window" * tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix error handling in btrfs_truncate()	2018-05-24 11:47:43 -07:00
Omar Sandoval	d50147381a	Btrfs: fix error handling in btrfs_truncate() Jun Wu at Facebook reported that an internal service was seeing a return value of 1 from ftruncate() on Btrfs in some cases. This is coming from the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items(). btrfs_truncate() uses two variables for error handling, ret and err. When btrfs_truncate_inode_items() returns non-zero, we set err to the return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we only set err if ret is an error (i.e., negative). To reproduce the issue: mount a filesystem with -o compress-force=zstd and the following program will encounter return value of 1 from ftruncate: int main(void) { char buf[256] = { 0 }; int ret; int fd; fd = open("test", O_CREAT \| O_WRONLY \| O_TRUNC, 0666); if (fd == -1) { perror("open"); return EXIT_FAILURE; } if (write(fd, buf, sizeof(buf)) != sizeof(buf)) { perror("write"); close(fd); return EXIT_FAILURE; } if (fsync(fd) == -1) { perror("fsync"); close(fd); return EXIT_FAILURE; } ret = ftruncate(fd, 128); if (ret) { printf("ftruncate() returned %d\n", ret); close(fd); return EXIT_FAILURE; } close(fd); return EXIT_SUCCESS; } Fixes: `ddfae63cc8` ("btrfs: move btrfs_truncate_block out of trans handle") CC: stable@vger.kernel.org # 4.15+ Reported-by: Jun Wu <quark@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-24 11:56:57 +02:00
Linus Torvalds	5997aab0a1	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes all over the place" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: fix io_destroy(2) vs. lookup_ioctx() race ext2: fix a block leak nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed unfuck sysfs_mount() kernfs: deal with kernfs_fill_super() failures cramfs: Fix IS_ENABLED typo befs_lookup(): use d_splice_alias() affs_lookup: switch to d_splice_alias() affs_lookup(): close a race with affs_remove_link() fix breakage caused by d_find_alias() semantics change fs: don't scan the inode cache before SB_BORN is set do d_instantiate/unlock_new_inode combinations safely iov_iter: fix memory leak in pipe_get_pages_alloc() iov_iter: fix return type of __pipe_get_pages()	2018-05-21 11:54:57 -07:00
Linus Torvalds	e5e03ad9e0	for-4.17-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsBhWUACgkQxWXV+ddt WDsl5g/7BwC4g1BICBG5SXKG9e/s0gjf/3xh7XI8g9kYYu3NktH4fWqDNNncgKtQ LL4WTcFhYJ+Cx/wkgPoYHfR9CKN2dR038S1OneKz+nhP/dTXw1MnSmfNP4kECqSQ vwdeDKwlO0Qsy2PdSLjLk8/Yn43wNleBI0swEF+5q7AQgv7XW9hk1oCpwJ7gjDw2 Ymb3WVlj/V0QhnZRnQEgnRwK4xLiOBszb6C+fxQDjtismWDz12dY3udl6Co18YTW DnmH3x9qBXpL7D2S/6AtZcafbrfgSeL6PXTlcb1fLHK1HwZbdUAerUrVVlV2aitC rHbg+pD0X1uzDCt2iZ7MROnZv/gLbU9OSz1foE9pw8xU9J5zbsvLlBSK4P0mdEzI MaZzqB3H31cUSZJq/BUdGnFAOIykcOEvscn000p/cy7szv+GpWb08rTqvVgZvSM2 ai1qADU7ACaWdFjJUqbOi3zWyT6AcGONwjfSIaa/y3DyGzVX3UyJxeIuvznPS2Yt 17B4GRIbF1xPbNRRBw7N60E8o4p8t+BMftStMCBSl8zxnjd6RPOCluOH/az6tL+H hmY/nGvJZCj3Y6SeLGiKXdNH9MFkhcvEIvePFkUt3AEEHcCtdG5RebrAvVSpO+N4 1SUAE1y8Cbco/KYMjlERpIzZIKOBkD/EnSBIXTI9mIVC0op6mII= =bhlK -----END PGP SIGNATURE----- Merge tag 'for-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've accumulated some fixes during the last week, some of them were in the works for a longer time but there are some newer ones too. Most of the fixes have a reproducer and fix user visible problems, also candidates for stable kernels. They IMHO qualify for a late rc, though I did not expect that many" * tag 'for-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix crash when trying to resume balance without the resume flag btrfs: Fix delalloc inodes invalidation during transaction abort btrfs: Split btrfs_del_delalloc_inode into 2 functions btrfs: fix reading stale metadata blocks after degraded raid1 mounts btrfs: property: Set incompat flag if lzo/zstd compression is set Btrfs: fix duplicate extents after fsync of file with prealloc extents Btrfs: fix xattr loss after power failure Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting	2018-05-20 12:04:27 -07:00
Anand Jain	02ee654d3a	btrfs: fix crash when trying to resume balance without the resume flag We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:38:24 +02:00
Nikolay Borisov	fe816d0f1d	btrfs: Fix delalloc inodes invalidation during transaction abort When a transaction is aborted btrfs_cleanup_transaction is called to cleanup all the various in-flight bits and pieces which migth be active. One of those is delalloc inodes - inodes which have dirty pages which haven't been persisted yet. Currently the process of freeing such delalloc inodes in exceptional circumstances such as transaction abort boiled down to calling btrfs_invalidate_inodes whose sole job is to invalidate the dentries for all inodes related to a root. This is in fact wrong and insufficient since such delalloc inodes will likely have pending pages or ordered-extents and will be linked to the sb->s_inode_list. This means that unmounting a btrfs instance with an aborted transaction could potentially lead inodes/their pages visible to the system long after their superblock has been freed. This in turn leads to a "use-after-free" situation once page shrink is triggered. This situation could be simulated by running generic/019 which would cause such inodes to be left hanging, followed by generic/176 which causes memory pressure and page eviction which lead to touching the freed super block instance. This situation is additionally detected by the unmount code of VFS with the following message: "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..." Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); in free_fs_root for the same reason. This patch aims to rectify the sitaution by doing the following: 1. Change btrfs_destroy_delalloc_inodes so that it calls invalidate_inode_pages2 for every inode on the delalloc list, this ensures that all the pages of the inode are released. This function boils down to calling btrfs_releasepage. During test I observed cases where inodes on the delalloc list were having an i_count of 0, so this necessitates using igrab to be sure we are working on a non-freed inode. 2. Since calling btrfs_releasepage might queue delayed iputs move the call out to btrfs_cleanup_transaction in btrfs_error_commit_super before calling run_delayed_iputs for the last time. This is necessary to ensure that delayed iputs are run. Note: this patch is tagged for 4.14 stable but the fix applies to older versions too but needs to be backported manually due to conflicts. CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions CC: stable@vger.kernel.org # 4.14.x Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment to igrab ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:38:18 +02:00
Nikolay Borisov	2b87733134	btrfs: Split btrfs_del_delalloc_inode into 2 functions This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:18:26 +02:00
Liu Bo	02a3307aa9	btrfs: fix reading stale metadata blocks after degraded raid1 mounts If a btree block, aka. extent buffer, is not available in the extent buffer cache, it'll be read out from the disk instead, i.e. btrfs_search_slot() read_block_for_search() # hold parent and its lock, go to read child btrfs_release_path() read_tree_block() # read child Unfortunately, the parent lock got released before reading child, so commit `5bdd3536cb` ("Btrfs: Fix block generation verification race") had used 0 as parent transid to read the child block. It forces read_tree_block() not to check if parent transid is different with the generation id of the child that it reads out from disk. A simple PoC is included in btrfs/124, 0. A two-disk raid1 btrfs, 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. 2. Mount this filesystem and put it in use, after a while, device tree's root got COW but block A hasn't been allocated/overwritten yet. 3. Umount it and reload the btrfs module to remove both disks from the global @fs_devices list. 4. mount -odegraded dev1 and write some data, so now block A is allocated to be a leaf in checksum tree. Note that only dev1 has the latest metadata of this filesystem. 5. Umount it and mount it again normally (with both disks), since raid1 can pick up one disk by the writer task's pid, if btrfs_search_slot() needs to read block A, dev2 which does NOT have the latest metadata might be read for block A, then we got a stale block A. 6. As parent transid is not checked, block A is marked as uptodate and put into the extent buffer cache, so the future search won't bother to read disk again, which means it'll make changes on this stale one and make it dirty and flush it onto disk. To avoid the problem, parent transid needs to be passed to read_tree_block(). In order to get a valid parent transid, we need to hold the parent's lock until finishing reading child. This patch needs to be slightly adapted for stable kernels, the &first_key parameter added to read_tree_block() is from 4.16+ (`581c176041`). The fix is to replace 0 by 'gen'. Fixes: `5bdd3536cb` ("Btrfs: Fix block generation verification race") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:18:25 +02:00
Misono Tomohiro	1a63c198dd	btrfs: property: Set incompat flag if lzo/zstd compression is set Incompat flag of LZO/ZSTD compression should be set at: 1. mount time (-o compress/compress-force) 2. when defrag is done 3. when property is set Currently 3. is missing and this commit adds this. This could lead to a filesystem that uses ZSTD but is not marked as such. If a kernel without a ZSTD support encounteres a ZSTD compressed extent, it will handle that but this could be confusing to the user. Typically the filesystem is mounted with the ZSTD option, but the discrepancy can arise when a filesystem is never mounted with ZSTD and then the property on some file is set (and some new extents are written). A simple mount with -o compress=zstd will fix that up on an unpatched kernel. Same goes for LZO, but this has been around for a very long time (2.6.37) so it's unlikely that a pre-LZO kernel would be used. Fixes: `5c1aab1dd5` ("btrfs: Add zstd support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add user visible impact ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:18:25 +02:00
Filipe Manana	31d11b83b9	Btrfs: fix duplicate extents after fsync of file with prealloc extents In commit `471d557afe` ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), on fsync, we started to always log all prealloc extents beyond an inode's i_size in order to avoid losing them after a power failure. However under some cases this can lead to the log replay code to create duplicate extent items, with different lengths, in the extent tree. That happens because, as of that commit, we can now log extent items based on extent maps that are not on the "modified" list of extent maps of the inode's extent map tree. Logging extent items based on extent maps is used during the fast fsync path to save time and for this to work reliably it requires that the extent maps are not merged with other adjacent extent maps - having the extent maps in the list of modified extents gives such guarantee. Consider the following example, captured during a long run of fsstress, which illustrates this problem. We have inode 271, in the filesystem tree (root 5), for which all of the following operations and discussion apply to. A buffered write starts at offset 312391 with a length of 933471 bytes (end offset at 1245862). At this point we have, for this inode, the following extent maps with the their field values: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 376832, block_start 1106399232, block_len 376832, orig_block_len 376832 em C, start 417792, orig_start 417792, len 782336, block_start 18446744073709551613, block_len 0, orig_block_len 0 em D, start 1200128, orig_start 1200128, len 835584, block_start 1106776064, block_len 835584, orig_block_len 835584 em E, start 2035712, orig_start 2035712, len 245760, block_start 1107611648, block_len 245760, orig_block_len 245760 Extent map A corresponds to a hole and extent maps D and E correspond to preallocated extents. Extent map D ends where extent map E begins (1106776064 + 835584 = 1107611648), but these extent maps were not merged because they are in the inode's list of modified extent maps. An fsync against this inode is made, which triggers the fast path (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback of the data previously written using buffered IO, and when the respective ordered extent finishes, btrfs_drop_extents() is called against the (aligned) range 311296..1249279. This causes a split of extent map D at btrfs_drop_extent_cache(), replacing extent map D with a new extent map D', also added to the list of modified extents, with the following values: em D', start 1249280, orig_start of 1200128, block_start 1106825216 (= 1106776064 + 1249280 - 1200128), orig_block_len 835584, block_len 786432 (835584 - (1249280 - 1200128)) Then, during the fast fsync, btrfs_log_changed_extents() is called and extent maps D' and E are removed from the list of modified extents. The flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged clear_em_logging() is called on each of them, and that makes extent map E to be merged with extent map D' (try_merge_map()), resulting in D' being deleted and E adjusted to: em E, start 1249280, orig_start 1200128, len `1032192`, block_start 1106825216, block_len `1032192`, orig_block_len 245760 A direct IO write at offset 1847296 and length of 360448 bytes (end offset at 2207744) starts, and at that moment the following extent maps exist for our inode: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em E (prealloc), start 1249280, orig_start 1200128, len `1032192`, block_start 1106825216, block_len `1032192`, orig_block_len 245760 The dio write results in drop_extent_cache() being called twice. The first time for a range that starts at offset 1847296 and ends at offset 2035711 (length of 188416), which results in a double split of extent map E, replacing it with two new extent maps: em F, start 1249280, orig_start 1200128, block_start 1106825216, block_len 598016, orig_block_len 598016 em G, start 2035712, orig_start 1200128, block_start 1107611648, block_len 245760, orig_block_len `1032192` It also creates a new extent map that represents a part of the requested IO (through create_io_em()): em H, start 1847296, len 188416, block_start 1107423232, block_len 188416 The second call to drop_extent_cache() has a range with a start offset of 2035712 and end offset of 2207743 (length of 172032). This leads to replacing extent map G with a new extent map I with the following values: em I, start 2207744, orig_start 1200128, block_start 1107783680, block_len 73728, orig_block_len `1032192` It also creates a new extent map that represents the second part of the requested IO (through create_io_em()): em J, start 2035712, len 172032, block_start 1107611648, block_len 172032 The dio write set the inode's i_size to 2207744 bytes. After the dio write the inode has the following extent maps: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em F, start 1249280, orig_start 1200128, len 598016, block_start 1106825216, block_len 598016, orig_block_len 598016 em H, start 1847296, orig_start 1200128, len 188416, block_start 1107423232, block_len 188416, orig_block_len 835584 em J, start 2035712, orig_start 2035712, len 172032, block_start 1107611648, block_len 172032, orig_block_len 245760 em I, start 2207744, orig_start 1200128, len 73728, block_start 1107783680, block_len 73728, orig_block_len `1032192` Now do some change to the file, like adding a xattr for example and then fsync it again. This triggers a fast fsync path, and as of commit `471d557afe` ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), we use the extent map I to log a file extent item because it's a prealloc extent and it starts at an offset matching the inode's i_size. However when we log it, we create a file extent item with a value for the disk byte location that is wrong, as can be seen from the following output of "btrfs inspect-internal dump-tree": item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53 generation 22 type 2 (prealloc) prealloc data disk byte 1106776064 nr `1032192` prealloc data offset 1007616 nr 73728 Here the disk byte value corresponds to calculation based on some fields from the extent map I: 1106776064 = block_start (1107783680) - 1007616 (extent_offset) extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616 The disk byte value of 1106776064 clashes with disk byte values of the file extent items at offsets 1249280 and 1847296 in the fs tree: item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1106776064 nr 835584 prealloc data offset 49152 nr 598016 item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1106776064 nr 835584 extent data offset 647168 nr 188416 ram 835584 extent compression 0 (none) item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1107611648 nr 245760 extent data offset 0 nr 172032 ram 245760 extent compression 0 (none) item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1107611648 nr 245760 prealloc data offset 172032 nr 73728 Instead of the disk byte value of 1106776064, the value of 1107611648 should have been logged. Also the data offset value should have been 172032 and not 1007616. After a log replay we end up getting two extent items in the extent tree with different lengths, one of 835584, which is correct and existed before the log replay, and another one of `1032192` which is wrong and is based on the logged file extent item: item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53 refs 2 gen 15 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 2 item 13 key (1106776064 EXTENT_ITEM `1032192`) itemoff 3353 itemsize 53 refs 1 gen 22 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 1 Obviously this leads to many problems and a filesystem check reports many errors: (...) checking extents Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1 extent item 1106776064 has multiple extent items ref mismatch on [1106776064 835584] extent item 2, found 3 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680 Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70 Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192 backpointer mismatch on [1106776064 835584] checking free space cache block group 1103101952 has wrong amount of free space failed to load free space cache for block group 1103101952 checking fs roots (...) So fix this by logging the prealloc extents beyond the inode's i_size based on searches in the subvolume tree instead of the extent maps. Fixes: `471d557afe` ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-17 14:18:19 +02:00

1 2 3 4 5 ...

7086 Commits