Commit Graph

4823 Commits

Author SHA1 Message Date
Qu Wenruo d810ef2be5 btrfs: qgroup: Add function qgroup_update_refcnt().
This function is used to update refcnt for qgroups.
And is one of the two core functions used in the new qgroup implement.

This is based on the old update_old/new_refcnt, but provides a unified
logic and behavior.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:24 -07:00
Qu Wenruo c682f9b3c2 btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent()
__btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too
many parameters, but three of them can be extracted from
btrfs_delayed_ref_node struct.

So use btrfs_delayed_ref_node struct as a single parameter to replace
the bytenr/num_byte/no_quota parameters.

The real objective of this patch is to allow btrfs_qgroup_record_ref()
get the delayed_ref_node in incoming qgroup patches.

Other functions calling btrfs_qgroup_record_ref() are not affected since
the rest will only add/sub exclusive extents, where node is not used.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:18 -07:00
Qu Wenruo 9c542136fd btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read.
Use inline functions to do such things, to improve readability.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:13 -07:00
Qu Wenruo c43d160fcd btrfs: delayed-ref: Cleanup the unneeded functions.
Cleanup the rb_tree merge/insert/update functions, since now we use list
instead of rb_tree now.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:09 -07:00
Qu Wenruo c6fc245499 btrfs: delayed-ref: Use list to replace the ref_root in ref_head.
This patch replace the rbtree used in ref_head to list.
This has the following advantage:
1) Easier merge logic.
With the new list implement, we only need to care merging the tail
ref_node with the new ref_node.
And this can be done quite easy at insert time, no need to do a
indicated merge at run_delayed_refs().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:03 -07:00
Qu Wenruo 00db646d3f btrfs: backref: Don't merge refs which are not for same block.
Old __merge_refs() in backref.c will even merge refs whose root_id are
different, which makes qgroup gives wrong result.

Fix it by checking ref_for_same_block() before any mode specific works.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:24:59 -07:00
Zhao Lei 20b2e3029e btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx()
lockdep report following warning in test:
 [25176.843958] =================================
 [25176.844519] [ INFO: inconsistent lock state ]
 [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
 [25176.845591] ---------------------------------
 [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
 [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
 [25176.847838] {SOFTIRQ-ON-W} state was registered at:
 [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
 [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
 [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
 [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
 [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
 [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
 [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
 [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
 [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
 [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
 [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
 [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
 [25176.854935] irq event stamp: 51506
 [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
 [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
 [25176.857746]
 other info that might help us debug this:
 [25176.858845]  Possible unsafe locking scenario:
 [25176.859981]        CPU0
 [25176.860537]        ----
 [25176.861059]   lock(&wr_ctx->wr_lock);
 [25176.861705]   <Interrupt>
 [25176.862272]     lock(&wr_ctx->wr_lock);
 [25176.862881]
  *** DEADLOCK ***

Reason:
 Above warning is caused by:
 Interrupt
 -> bio_endio()
 -> ...
 -> scrub_put_ctx()
 -> scrub_free_ctx() *1
 -> ...
 -> mutex_lock(&wr_ctx->wr_lock);

 scrub_put_ctx() is allowed to be called in end_bio interrupt, but
 in code design, it will never call scrub_free_ctx(sctx) in interrupe
 context(above *1), because btrfs_scrub_dev() get one additional
 reference of sctx->refs, which makes scrub_free_ctx() only called
 withine btrfs_scrub_dev().

 Now the code runs out of our wish, because free sequence in
 scrub_pending_bio_dec() have a gap.

 Current code:
 -----------------------------------+-----------------------------------
 scrub_pending_bio_dec()            |  btrfs_scrub_dev
 -----------------------------------+-----------------------------------
 atomic_dec(&sctx->bios_in_flight); |
 wake_up(&sctx->list_wait);         |
                                    | scrub_put_ctx()
                                    | -> atomic_dec_and_test(&sctx->refs)
 scrub_put_ctx(sctx);               |
 -> atomic_dec_and_test(&sctx->refs)|
 -> scrub_free_ctx()                |
 -----------------------------------+-----------------------------------

 We expected:
 -----------------------------------+-----------------------------------
 scrub_pending_bio_dec()            |  btrfs_scrub_dev
 -----------------------------------+-----------------------------------
 atomic_dec(&sctx->bios_in_flight); |
 wake_up(&sctx->list_wait);         |
 scrub_put_ctx(sctx);               |
 -> atomic_dec_and_test(&sctx->refs)|
                                    | scrub_put_ctx()
                                    | -> atomic_dec_and_test(&sctx->refs)
                                    | -> scrub_free_ctx()
 -----------------------------------+-----------------------------------

Fix:
 Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
 in interrupt context.
 Tested by check tracelog in debug.

Changelog v1->v2:
 Use workqueue instead of adjust function call sequence in v1,
 because v1 will introduce a bug pointed out by:
 Filipe David Manana <fdmanana@gmail.com>

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:04:52 -07:00
Mark Fasheh e1d227a42e btrfs: Handle unaligned length in extent_same
The extent-same code rejects requests with an unaligned length. This
poses a problem when we want to dedupe the tail extent of files as we
skip cloning the portion between i_size and the extent boundary.

If we don't clone the entire extent, it won't be deleted. So the
combination of these behaviors winds up giving us worst-case dedupe on
many files.

We can fix this by allowing a length that extents to i_size and
internally aligining those to the end of the block. This is what
btrfs_ioctl_clone() so we can just copy that check over.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:50 -07:00
chandan 070034bdf9 Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag.
max_to_defrag represents the number of pages to defrag rather than the last
page of the file range to be defragged.

Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the
defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag
should actually have the value 4. Instead in the current code we end up
setting it to 6.

Now, this does not (yet) cause an issue since the first part of the while loop
condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control
to flow out of the while loop before any buggy behavior is actually caused. So
the patch just makes sure that max_to_defrag ends up having the right value
rather than fixing a bug. I did run the xfstests suite to make sure that the
code does not regress.

Changelog: v1->v2:
Provide a much descriptive commit message.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:48 -07:00
chandan e4826a5b24 Btrfs: btrfs_defrag_file: Fix ra_index computation.
Read-ahead is done for the pages in the range [ra_index, ra_index + cluster -
1]. So the next read-ahead should be starting from the page at index 'ra_index
+ cluster' (unless we deemed that the extent at 'ra_index + cluster' as
non-defraggable) rather than from the page at index 'ra_index +
max_cluster'. This patch fixes this. I did run the xfstests suite to make sure
that the code does not regress.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:47 -07:00
Filipe Manana 4617ea3a52 Btrfs: fix necessary chunk tree space calculation when allocating a chunk
When allocating a new chunk or removing one we need to update num_devs
device items and insert or remove a chunk item in the chunk tree, so
in the worst case the space needed in the chunk space_info is:

  btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
     btrfs_calc_trans_metadata_size(chunk_root, 1)

That is, in the worst case we need to cow num_devs paths and cow 1 other
path that can result in splitting every node and leaf, and each path
consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
which were unnecessary since updating the existing device items does
not result in splitting the nodes and leaf since after updating them
they remain with the same size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:46 -07:00
Filipe Manana 7558c8bc17 Btrfs: don't attach unnecessary extents to transaction on fsync
We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:44 -07:00
Filipe Manana b659ef0277 Btrfs: avoid syncing log in the fast fsync path when not necessary
Commit 3a8b36f378 ("Btrfs: fix data loss in the fast fsync path") added
a performance regression for that causes an unnecessary sync of the log
trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
against a file, without no writes or any metadata updates to the inode in
between them and if a transaction is committed before the second fsync is
called.

Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
after a test sysbench test that measured a -62% decrease of file io
requests per second for that tests' workload.

The test is:

  echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
  mkfs -t btrfs /dev/sda2
  mount -t btrfs /dev/sda2 /fs/sda2
  cd /fs/sda2
  for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
  sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
    --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
    --file-num=1024 run

A test on kvm guest, running a debug kernel gave me the following results:

Without 3a8b36f378060d:             16.01 reqs/sec
With 3a8b36f378060d:                 3.39 reqs/sec
With 3a8b36f378 and this patch: 16.04 reqs/sec

Reported-by: Huang Ying <ying.huang@intel.com>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:43 -07:00
Chris Mason 1ab818b137 Merge branch 'send_fixes_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2 2015-06-10 07:02:41 -07:00
Filipe Manana 6ca0709756 Btrfs: fix hang during inode eviction due to concurrent readahead
Zygo Blaxell and other users have reported occasional hangs while an
inode is being evicted, leading to traces like the following:

[ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
[ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
[ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
[ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
[ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
[ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
[ 5281.981396] Call Trace:
[ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
[ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
[ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
[ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
[ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
[ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
[ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
[ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
[ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
[ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 5282.001299] 1 lock held by rm/20488:
[ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b

This happens when we have readahead, which calls readpages(), happening
right before the inode eviction handler is invoked. So the reason is
essentially:

1) readpages() is called while a reference on the inode is held, so
   eviction can not be triggered before readpages() returns. It also
   locks one or more ranges in the inode's io_tree (which is done at
   extent_io.c:__do_contiguous_readpages());

2) readpages() submits several read bios, all with an end io callback
   that runs extent_io.c:end_bio_extent_readpage() and that is executed
   by other task when a bio finishes, corresponding to a work queue
   (fs_info->end_io_workers) worker kthread. This callback unlocks
   the ranges in the inode's io_tree that were previously locked in
   step 1;

3) readpages() returns, the reference on the inode is dropped;

4) One or more of the read bios previously submitted are still not
   complete (their end io callback was not yet invoked or has not
   yet finished execution);

5) Inode eviction is triggered (through an unlink call for example).
   The inode reference count was not incremented before submitting
   the read bios, therefore this is possible;

6) The eviction handler starts executing and enters the loop that
   iterates over all extent states in the inode's io_tree;

7) The loop picks one extent state record and uses its ->start and
   ->end fields, after releasing the inode's io_tree spinlock, to
   call lock_extent_bits() and clear_extent_bit(). The call to lock
   the range [state->start, state->end] blocks because the whole
   range or a part of it was locked by the previous call to
   readpages() and the corresponding end io callback, which unlocks
   the range was not yet executed;

8) The end io callback for the read bio is executed and unlocks the
   range [state->start, state->end] (or a superset of that range).
   And at clear_extent_bit() the extent_state record state is used
   as a second argument to split_state(), which sets state->start to
   a larger value;

9) The task executing the eviction handler is woken up by the task
   executing the bio's end io callback (through clear_state_bit) and
   the eviction handler locks the range
   [old value for state->start, state->end]. Shortly after, when
   calling clear_extent_bit(), it unlocks the range
   [new value for state->start, state->end], so it ends up unlocking
   only part of the range that it locked, leaving an extent state
   record in the io_tree that represents the unlocked subrange;

10) The eviction handler loop, in its next iteration, gets the
    extent_state record for the subrange that it did not unlock in the
    previous step and then tries to lock it, resulting in an hang.

So fix this by not using the ->start and ->end fields of an existing
extent_state record. This is a simple solution, and an alternative
could be to bump the inode's reference count before submitting each
read bio and having it dropped in the bio's end io callback. But that
would be a more invasive/complex change and would not protect against
other possible places that are not holding a reference on the inode
as well. Something to consider in the future.

Many thanks to Zygo Blaxell for reporting, in the mailing list, the
issue, a set of scripts to trigger it and testing this fix.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:09 -07:00
Liu Bo 64c043de46 Btrfs: fix up read_tree_block to return proper error
The return value of read_tree_block() can confuse callers as it always
returns NULL for either -ENOMEM or -EIO, so it's likely that callers
parse it to a wrong error, for instance, in btrfs_read_tree_root().

This fixes the above issue.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:08 -07:00
Liu Bo 8635eda91e Btrfs: add missing free_extent_buffer
read_tree_block may take a reference on the 'eb', a following
free_extent_buffer is necessary.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:07 -07:00
Liu Bo 0c304304fe Btrfs: remove csum_bytes_left
After commit 8407f55326
("Btrfs: fix data corruption after fast fsync and writeback error"),
during wait_ordered_extents(), we wait for ordered extent setting
BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
already got checksum information, so we don't need to check
(csum_bytes_left == 0) in the whole logging path.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:06 -07:00
Filipe Manana 39c2d7facc Btrfs: fix -ENOSPC on block group removal
Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.

While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:

[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left

So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:05 -07:00
Filipe Manana 4fbcdf6694 Btrfs: fix -ENOSPC when finishing block group creation
While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:

[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---

This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.

So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.

Fix this by doing proper reservation in the chunk block reserve.

The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:04 -07:00
Josef Bacik 0d2b2372e0 Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap
We should be doing this, it's weird we hadn't been doing this.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:03 -07:00
Omar Sandoval c8d3fe028f Btrfs: show subvol= and subvolid= in /proc/mounts
Now that we're guaranteed to have a meaningful root dentry, we can just
export seq_dentry() and use it in btrfs_show_options(). The subvolume ID
is easy to get and can also be useful, so put that in there, too.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:02 -07:00
Omar Sandoval 05dbe6837b Btrfs: unify subvol= and subvolid= mounting
Currently, mounting a subvolume with subvolid= takes a different code
path than mounting with subvol=. This isn't really a big deal except for
the fact that mounts done with subvolid= or the default subvolume don't
have a dentry that's connected to the dentry tree like in the subvol=
case. To unify the code paths, when given subvolid= or using the default
subvolume ID, translate it into a subvolume name by walking
ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:01 -07:00
Omar Sandoval bb289b7be6 Btrfs: fail on mismatched subvol and subvolid mount options
There's nothing to stop a user from passing both subvol= and subvolid=
to mount, but if they don't refer to the same subvolume, someone is
going to be surprised at some point. Error out on this case, but allow
users to pass in both if they do match (which they could, for example,
get out of /proc/mounts).

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:00 -07:00
Omar Sandoval fa33065950 Btrfs: clean up error handling in mount_subvol()
In preparation for new functionality in mount_subvol(), give it
ownership of subvol_name and tidy up the error paths.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:59 -07:00
Omar Sandoval e6e4dbe894 Btrfs: remove all subvol options before mounting top-level
Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'.
But, this means that if the user passes both a subvol and subvolid for
some reason, we won't actually mount the top-level when we recursively
mount. For example, consider:

mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt
btrfs subvol create /mnt/subvol1 # subvolid=257
btrfs subvol create /mnt/subvol2 # subvolid=258
umount /mnt
mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt

In the final mount, subvol=/subvol1,subvolid=258 becomes
subvolid=0,subvolid=258, and the last option takes precedence, so we
mount subvol2 and try to look up subvol1 inside of it, which fails.

So, instead, do a thorough scan through the argument list and remove any
subvol= and subvolid= options, then append subvolid=0 to the end. This
implicitly makes subvol= take precedence over subvolid=, but we're about
to add a stricter check for that. This also makes setup_root_args() more
generic, which we'll need soon.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:58 -07:00
Omar Sandoval 773cd04ec1 Btrfs: lock superblock before remounting for rw subvol
Since commit 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options"), when mounting a subvolume read/write when
another subvolume has previously been mounted read-only, we first do a
remount. However, this should be done with the superblock locked, as per
sync_filesystem():

	/*
	 * We need to be protected against the filesystem going from
	 * r/o to r/w or vice versa.
	 */
	WARN_ON(!rwsem_is_locked(&sb->s_umount));

This WARN_ON can easily be hit with:

mkfs.btrfs -f /dev/vdb
mount /dev/vdb /mnt
btrfs subvol create /mnt/vol1
btrfs subvol create /mnt/vol2
umount /mnt
mount -oro,subvol=/vol1 /dev/vdb /mnt
mount -orw,subvol=/vol2 /dev/vdb /mnt2

Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with different ro/rw options")
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:57 -07:00
Filipe Manana 0f31871f44 Btrfs: wake up extent state waiters on unlock through clear_extent_bits
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
through free_io_failure(), we weren't waking up any tasks waiting for the
extent's state EXTENT_LOCKED bit, leading to an hang.

So make sure clear_extent_bits() ends up waking up any waiters if the
bit EXTENT_LOCKED is supplied by its callers.

Zygo Blaxell was experiencing such hangs at inode eviction time after
file unlinks. Thanks to him for a set of scripts to reproduce the issue.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:56 -07:00
Filipe Manana c152b63efc Btrfs: fix chunk allocation regression leading to transaction abort
With commit 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction
in case device tree has hole") introduced in the kernel 4.1 merge window,
we end up using part of a device hole for which there are already pending
chunks or pinned chunks. Before that commit we didn't use the hole and
would just move on to the next hole in the device.

However when we adjust the start offset for the chunk allocation and we
have pinned chunks, we set it blindly to the end offset of the pinned
chunk we are currently processing, which is dangerous because we can
have a pending chunk that has a start offset that matches the end offset
of our pinned chunk - leading us to a case where we end up getting two
pending chunks that start at the same physical device offset, which makes
us later abort the current transaction with -EEXIST when finishing the
chunk allocation at btrfs_create_pending_block_groups():

[194737.659017] ------------[ cut here ]------------
[194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[194737.662209] BTRFS: Transaction aborted (error -17)
[194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
[194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
[194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
[194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
[194737.687509] Call Trace:
[194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
[194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
[194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
[194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
[194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
[194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
[194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists

The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
btrfs_create_pending_block_groups(), when it attempts to insert a
duplicated device extent item via btrfs_alloc_dev_extent().

This issue was reproducible with fstests generic/038 running in a loop for
several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
Applying Jeff's recent patch titled "btrfs: add missing discards when
unpinning extents with -o discard" makes the issue much easier to reproduce
(usually within 4 to 5 hours), since it pins chunks for longer periods of
time when an unused block group is deleted by the cleaner kthread.

Fix this by making sure that we never adjust the start offset to a lower
value than it currently has.

Fixes: 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:55 -07:00
Sasha Levin 2037a0933b btrfs: use after free when closing devices
__btrfs_close_devices() would call_rcu to free the device, which is racy with
list_for_each_entry() accessing the memory to retrieve the next device on the
list.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
David Sterba 01b810b889 btrfs: make root id query unprivileged
The INO_LOOKUP ioctl can lookup path for a given inode number and is
thus restricted. As a sideefect it can find the root id of the
containing subvolume and we're using this int the 'btrfs inspect rootid'
command.

The restriction is unnecessary in case we set the ioctl args
 args::treeid    = 0
 args::objectid  = 256 (BTRFS_FIRST_FREE_OBJECTID)

Then the path will be empty and the treeid is filled with the root id of
the inode on which the ioctl is called. This behaviour is unchanged,
after the root restriction is removed.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
Filipe Manana 2e6e518335 Btrfs: fix block group ->space_info null pointer dereference
When we create a block group we add it to the rbtree of block groups
before setting its ->space_info field (while it's NULL). This is
problematic since other tasks can access the block group from the
rbtree and attempt to use its ->space_info before it is set by
btrfs_make_block_group().

This can happen for example when a concurrent fitrim ioctl operation
is ongoing, which produces a trace like the following when
CONFIG_DEBUG_PAGEALLOC is set.

[11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
[11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0
[11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou
[11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000
[11509.608179] RIP: 0010:[<ffffffff8107d675>]  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
[11509.608179] RSP: 0018:ffff8801b3edf9e8  EFLAGS: 00010002
[11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
[11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000
[11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000
[11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0
[11509.608179] FS:  00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
[11509.608179] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0
[11509.608179] Stack:
[11509.608179]  0000000000000000 0000000000000000 0000000000000004 0000000000000000
[11509.608179]  ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000
[11509.608179]  0000000000000001 0000000000000000 ffff880100000000 00000000000006c4
[11509.608179] Call Trace:
[11509.608179]  [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02
[11509.608179]  [<ffffffff8107e806>] lock_acquire+0xa5/0x116
[11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
[11509.608179]  [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44
[11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
[11509.608179]  [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs]
[11509.608179]  [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs]
[11509.608179]  [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs]
[11509.608179]  [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs]
[11509.608179]  [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs]
[11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
[11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
[11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
[11509.608179]  [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e
[11509.608179]  [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479
[11509.608179]  [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e
[11509.608179]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[11509.608179]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[11509.608179]  [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f
[11509.608179]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00
[11509.608179] RIP  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
[11509.608179]  RSP <ffff8801b3edf9e8>
[11509.608179] CR2: 0000000000000018
[11509.608179] ---[ end trace 570a5c6769f0e49a ]---

Which corresponds to the following access in fs/btrfs/free-space-cache.c:

  static int do_trimming(struct btrfs_block_group_cache *block_group,
                         u64 *total_trimmed, u64 start, u64 bytes,
                         u64 reserved_start, u64 reserved_bytes,
                         struct btrfs_trim_range *trim_entry)
  {
       struct btrfs_space_info *space_info = block_group->space_info;
  (...)
       spin_lock(&space_info->lock);
       ^^^^^ - block_group->space_info is NULL...

Fix this by ensuring the block group's ->space_info is set before adding
the block group to the rbtree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
Anand Jain 33b97e4327 Btrfs: check error before reporting missing device and add uuid
Report missing device when add is successful,
otherwise it would exit as ENOMEM. And add uuid
to the report.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Qu Wenruo 1f6e4b3f9f btrfs: Fix superblock csum type check.
Old csum type check is wrong and can't catch csum_type 1(not supported).

Fix it to avoid hostile 0 division.

Reported-by: Lukas Lueg <lukas.lueg@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Filipe Manana 619d8c4ef7 Btrfs: incremental send, fix clone operations for compressed extents
Marc reported a problem where the receiving end of an incremental send
was performing clone operations that failed with -EINVAL. This happened
because, unlike for uncompressed extents, we were not checking if the
source clone offset and length, after summing the data offset, falls
within the source file's boundaries.

So make sure we do such checks when attempting to issue clone operations
for compressed extents.

Problem reproducible with the following steps:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount -o compress /dev/sdc /mnt2

  # Create the file with a single extent of 128K. This creates a metadata file
  # extent item with a data start offset of 0 and a logical length of 128K.
  $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo

  # Now rewrite the range 64K to 112K of our file. This will make the inode's
  # metadata continue to point to the 128K extent we created before, but now
  # with an extent item that points to the extent with a data start offset of
  # 112K and a logical length of 16K.
  # That metadata file extent item is associated with the logical file offset
  # at 176K and covers the logical file range 176K to 192K.
  $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo

  # Now rewrite the range 180K to 12K. This will make the inode's metadata
  # continue to point the the 128K extent we created earlier, with a single
  # extent item that points to it with a start offset of 112K and a logical
  # length of 4K.
  # That metadata file extent item is associated with the logical file offset
  # at 176K and covers the logical file range 176K to 180K.
  $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ touch /mnt/bar
  # Calls the btrfs clone ioctl.
  $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \
    -l $((4 * 1024)) /mnt/foo /mnt/bar

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  At subvol /mnt/snap1
  At subvol snap1

  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
  At subvol /mnt/snap2
  At snapshot snap2
  ERROR: failed to clone extents to bar
  Invalid argument

A test case for fstests follows soon.

Reported-by: Marc MERLIN <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.cz>
Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Christian Engelmayer ab3680dd18 btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation()
Commit 9c8b35b1ba ("btrfs: quota: Automatically update related qgroups or
mark INCONSISTENT flags when assigning/deleting a qgroup relations.")
introduced the allocation of a temporary ulist in function
btrfs_add_qgroup_relation() and added the corresponding cleanup to the out
path. However, the allocation was introduced before the src/dst level check
that directly returns. Fix the possible leakage of the ulist by moving the
allocation after the input validation. Detected by Coverity CID 1295988.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Filipe Manana 35c766425a Btrfs: fix mutex unlock without prior lock on space cache truncation
If the call to btrfs_truncate_inode_items() failed and we don't have a block
group, we were unlocking the cache_write_mutex without having locked it (we
do it only if we have a block group).

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
Anand Jain 816fcebe8f Btrfs: log when missing device is created
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba 6d13f5497f btrfs: fix warnings after changes in btrfs_abort_transaction
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, tree_root,
   ^
  CC [M]  fs/btrfs/ioctl.o
fs/btrfs/ioctl.c: In function ‘create_subvol’:
fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, root, PTR_ERR(new_root));

PTR_ERR returns long, but we're really using 'int' for the error codes
everywhere so just set and use the local variable.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba c0d19e2b9a btrfs: add 'cold' compiler annotations to all error handling functions
The annotated functios will be placed into .text.unlikely section. The
annotation also hints compiler to move the code out of the hot paths,
and may implicitly mark if-statement leading to that block as unlikely.

This is a heuristic, the impact on the generated code is not
significant.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba 1a9a8a71ed btrfs: report exact callsite where transaction abort occurs
WARN is called from a single location and all bugreports say that's in
super.c __btrfs_abort_transaction. This is slightly confusing as we'd
rather want to know the exact callsite. Whereas this information is
printed in the syslog below the stacktrace, this requires further look
and we usually see only the headline from WARNING.

Moving the WARN into the macro has to inline some code and increases
code by a few kilobytes:

  text    data     bss     dec     hex filename
835481   20305   14120  869906   d4612 btrfs.ko.before
842883   20305   14120  877308   d62fc btrfs.ko.after

The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
The increase is not small and could lead to worse icache use. The code
is on error/exit paths that can be recognized by compiler as cold and
moved out of the way so the impact is speculated to be low, if
measurable at all.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba 13028901a4 btrfs: let tree defrag work in SSD mode
Long time ago (2008) the defrag was automatic for new b-tree writes but
has been disabled after performance problems. There was a leftover in
tree-defrag.c that effectively stops any defragmentation on b-trees.
This is a bit unexpected and IMHO undesired. The SSD mode is an
optimization and defrag is supposed to work if the users asks for it.

Related commits:

6702ed490c
Btrfs: Add run time btree defrag, and an ioctl to force btree defrag

e18e4809b1
Btrfs: Add mount -o ssd, which includes optimizations for seek free
storage

b3236e68bf
Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there

9afbb0b752
Btrfs: Disable tree defrag in SSD mode

The last three commits switch the defrag+ssd off/on/off and the last one

3f157a2fd2
Btrfs: Online btree defragmentation fixes

misses the bits from tree-defrag.c to revert to the behaviour introduced
in e18e4809b1.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Filipe Manana 53e489bc8c Btrfs: check pending chunks when shrinking fs to avoid corruption
When we shrink the usable size of a device (its total_bytes), we go over
all the device extent items in the device tree and attempt to relocate
the chunk of any device extent that goes beyond the new usable size for
the device. We do that after setting the new usable size (total_bytes) in
the device object, so that all new allocations (and reallocations) don't
use areas of the device that go beyond the new (shorter) size. However we
were not considering that before setting the new size in the device,
pending chunks might have been created that use device extents that go
beyond the new size, and those device extents are not yet in the device
tree after we search the device tree - they are still attached to the
list of new block group for some ongoing transaction handle, and they are
only added to the device tree when the transaction handle is ended (via
btrfs_create_pending_block_groups()).

So check for pending chunks with device extents that go beyond the new
size and if any exists, commit the current transaction and repeat the
search in the device tree.

Not doing this it would mean we would return success to user space while
still having extents that go beyond the new size, and later user space
could override those locations on the device while the fs still references
them, causing all sorts of corruption and unexpected events.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Omar Sandoval 64ad6c4889 Btrfs: don't invalidate root dentry when subvolume deletion fails
Since commit bafc9b754f ("vfs: More precise tests in d_invalidate"),
mounted subvolumes can be deleted because d_invalidate() won't fail.
However, we run into problems when we attempt to delete the default
subvolume while it is mounted as the root filesystem:

	# btrfs subvol list /
	ID 257 gen 306 top level 5 path rootvol
	ID 267 gen 334 top level 5 path snap1
	# btrfs subvol get-default /
	ID 267 gen 334 top level 5 path snap1
	# btrfs inspect-internal rootid /
	267
	# mount -o subvol=/ /dev/vda1 /mnt
	# btrfs subvol del /mnt/snap1
	Delete subvolume (no-commit): '/mnt/snap1'
	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
	# findmnt /
	findmnt: can't read /proc/mounts: No such file or directory
	# ls /proc
	#

Markus reported that this same scenario simply led to a kernel oops.

This happens because in btrfs_ioctl_snap_destroy(), we call
d_invalidate() before we check may_destroy_subvol(), which means that we
detach the submounts and drop the dentry before erroring out. Instead,
we should only invalidate the dentry once the deletion has succeeded.
Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
will prune the dcache for the deleted subvolume.

Cc: <stable@vger.kernel.org>
Fixes: bafc9b754f ("vfs: More precise tests in d_invalidate")
Reported-by: Markus Schauler <mschauler@gmail.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Filipe Manana 8b191a6849 Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.

For example, for the following reproducer where this is needed (provided
by Robbie Ko):

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2

  $ mkdir -p /mnt/data/n1/n2
  $ mkdir /mnt/data/n4
  $ mkdir -p /mnt/data/t6/t7
  $ mkdir /mnt/data/t5
  $ mkdir /mnt/data/t7
  $ mkdir /mnt/data/n4/t2
  $ mkdir /mnt/data/t4
  $ mkdir /mnt/data/t3
  $ mv /mnt/data/t7 /mnt/data/n4/t2
  $ mv /mnt/data/t4 /mnt/data/n4/t2/t7
  $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
  $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
  $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
  $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
  $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
  $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
  $ mv /mnt/data/n4/t2 /mnt/data/n4/n1
  $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
  $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
  $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
  $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
  $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
  $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
  ERROR: send ioctl failed with -12: Cannot allocate memory

Where the parent snapshot directory hierarchy is the following:

  .                                                        (ino 256)
  |-- data/                                                (ino 257)
        |-- n4/                                            (ino 260)
             |-- t2/                                       (ino 265)
                  |-- t7/                                  (ino 264)
                       |-- t4/                             (ino 266)
                            |-- t5/                        (ino 263)
                                 |-- t6/                   (ino 261)
                                      |-- n1/              (ino 258)
                                      |-- n2/              (ino 259)
                                           |-- t7/         (ino 262)
                                                |-- t3/    (ino 267)

And the send snapshot's directory hierarchy is the following:

  .                                                        (ino 256)
  |-- data/                                                (ino 257)
        |-- n4/                                            (ino 260)
             |-- n1/                                       (ino 258)
                  |-- t2/                                  (ino 265)
                       |-- n2/                             (ino 259)
                       |-- t3/                             (ino 267)
                       |    |-- t7                         (ino 264)
                       |
                       |-- t6/                             (ino 261)
                       |    |-- t4/                        (ino 266)
                       |         |-- t5/                   (ino 263)
                       |
                       |-- t7/                             (ino 262)

While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:

  start inode 264, send progress of 265 for example
  parent of 264 -> 267
  parent of 267 -> 262
  parent of 262 -> 259
  parent of 259 -> 261
  parent of 261 -> 263
  parent of 263 -> 266
  parent of 266 -> 264
    |--> back to first iteration while current path string length
         is <= PATH_MAX, and fail with -ENOMEM otherwise

So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.

A test case for fstests follows soon.

Thanks to Robbie Ko for providing a reproducer for this problem.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-03 03:10:40 +01:00
Filipe Manana 80aa602756 Btrfs: incremental send, don't delay directory renames unnecessarily
Even though we delay the rename of directories when they become
descendents of other directories that were also renamed in the send
root to prevent infinite path build loops, we were doing it in cases
where this was not needed and was actually harmful resulting in
infinite path build loops as we ended up with a circular dependency
of delayed directory renames.

Consider the following reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2

  $ mkdir /mnt/data
  $ mkdir /mnt/data/n1
  $ mkdir /mnt/data/n1/n2
  $ mkdir /mnt/data/n4
  $ mkdir /mnt/data/n1/n2/p1
  $ mkdir /mnt/data/n1/n2/p1/p2
  $ mkdir /mnt/data/t6
  $ mkdir /mnt/data/t7
  $ mkdir -p /mnt/data/t5/t7
  $ mkdir /mnt/data/t2
  $ mkdir /mnt/data/t4
  $ mkdir -p /mnt/data/t1/t3
  $ mkdir /mnt/data/p1
  $ mv /mnt/data/t1 /mnt/data/p1
  $ mkdir -p /mnt/data/p1/p2
  $ mv /mnt/data/t4 /mnt/data/p1/p2/t1
  $ mv /mnt/data/t5 /mnt/data/n4/t5
  $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2
  $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7
  $ mv /mnt/data/t2 /mnt/data/n4/t1
  $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1
  $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2
  $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1
  $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7
  $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1
  $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5
  $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1
  $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2
  $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1
  $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1
  $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3
  $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2
  $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7
  $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1
  $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5
  $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5
  $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2
  $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2
  ERROR: send ioctl failed with -12: Cannot allocate memory

This reproducer resulted in an infinite path build loop when building the
path for inode 266 because the following circular dependency of delayed
directory renames was created:

   ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261

Where the notation "X <- Y" means the rename of inode X is delayed by the
rename of inode Y (X will be renamed after Y is renamed). This resulted
in an infinite path build loop of inode 266 because that inode has inode
261 as an ancestor in the send root and inode 261 is in the circular
dependency of delayed renames listed above.

Fix this by not delaying the rename of a directory inode if an ancestor of
the inode in the send root, which has a delayed rename operation, is not
also a descendent of the inode in the parent root.

Thanks to Robbie Ko for sending the reproducer example.
A test case for xfstests follows soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-03 03:10:20 +01:00
Anand Jain 2421a8cd5f Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain 24bd69cb0f Btrfs: sysfs: add support to add parent for fsid
To support seed sysfs layout and represent seed fsid under
the sprout we need the facility to create fsid under the
specified parent.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain b7c35e81ad Btrfs: sysfs: separate kobject and attribute creation
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain 1d1c1be372 Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain ef1a0daadf Btrfs: sysfs: make btrfs_sysfs_add_device() non static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain 0c10e2d482 Btrfs: sysfs: make btrfs_sysfs_add_fsid() non static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain 6c14a1641b Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info
since btrfs_kobj_rm_device() does nothing with fs_info

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain 1ba43816af Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info
btrfs_kobj_add_device() does not need fs_info any more.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain 2e3e12815a Btrfs: sysfs: provide framework to remove all fsid sysfs kobject
Just a helper function to clean up the sysfs fsid kobjects.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain 5a13f4308c Btrfs: sysfs: add pointer to access fs_info from fs_devices
adds fs_info pointer with struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain c73eccf75b Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain 2e7910d6ca Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices
This patch will provide a framework and help to create attributes
from the structure btrfs_fs_devices which are available even before
fs_info is created. So by moving the parent kobject super_kobj from
fs_info to btrfs_fs_devices, it will help to create attributes
from the btrfs_fs_devices as well.

Patches on top of this patch now will be able to create the
sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices
when devices are scanned and registered to the kernel.

Just to note, this does not change any of the existing btrfs sysfs
external kobject names and its attributes and not even the life
cycle of them. Changes are internal only. And to ensure the same,
this path has been tested with various device operations and,
checking and comparing the sysfs kobjects and attributes with
sysfs kobject and attributes with out this patch, and they remain
same.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain 00c921c23f Btrfs: sysfs: separate device kobject and its attribute creation
Separate device kobject and its attribute creation so that device
kobject can be created from the device discovery thread.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain 0dd2906f72 Btrfs: sysfs: let default_attrs be separate from the kset
As of now btrfs_attrs are provided using the default_attrs through
the kset. Separate them and create the default_attrs using the
sysfs_create_files instead. By doing this we will have the
flexibility that device discovery thread could create fsid
kobject.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain 720592157e Btrfs: sysfs: introduce function btrfs_sysfs_add_fsid() to create sysfs fsid
We need it in a seperate function so that it can be called from the
device discovery thread as well.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain 3a08f3b72a Btrfs: sysfs: rename __btrfs_sysfs_remove_one to btrfs_sysfs_remove_fsid
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain aaf1330516 Btrfs: sysfs: reorder the kobject creations
As of now the order in which the kobjects are created
at btrfs_sysfs_add_one() is..
 fsid
 features
 unknown features (dynamic features)
 devices.

Since we would move fsid and device kobject to fs_devices
from fs_info structure, this patch will reorder in which
the kobjects are created as below.
 fsid
 devices
 features
 unknown features (dynamic features)

And hence the btrfs_sysfs_remove_one() will follow the same
in reverse order. and the device kobject destroy now can
be moved into the function __btrfs_sysfs_remove_one()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain 4d435731f9 Btrfc: sysfs: fix, check if device_dir_kobj is init before destroy
Since the failure code in the btrfs_sysfs_add_one() can
call btrfs_sysfs_remove_one() even before device_dir_kobj
has been created we need to check if its null.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain 8345ea31dc Btrfs: sysfs: fix, kobject pointer clean up needed after kobject release
The sysfs clean up self test like in the below code fails, since
fs_info->device_dir_kobject still points to its stale kobject.
Reseting this pointer will help to fix this.

open_ctree()
{

ret = btrfs_sysfs_add_one(fs_info);
::
+       btrfs_sysfs_remove_one(fs_info);
+       ret = btrfs_sysfs_add_one(fs_info);
+       if (ret) {
+               pr_err("BTRFS: failed to init sysfs interface: %d\n", ret);
+               goto fail_block_groups;
+       }

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain e7e1aa9c91 Btrfs: sysfs: fix, undo sysfs device links
Theoritically need to remove the device links attributes, but since its entire device
kobject was removed, so there wasn't any issue of about it. Just do it nicely.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain 4e51f005a2 Btrfs: sysfs: fix, fs_info kobject_unregister has init_completion() twice
kobject_unregister is to handle the release of the kobject,
its completion init is being called in btrfs_sysfs_add_one(),
so we don't have to do the same in the open_ctree() again.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:19 +02:00
Anand Jain 248d200df3 Btrfs: sysfs: fix, btrfs_release_super_kobj() should to clean up the kobject data
The following test case fails indicating that, thread tried to init an initialized object.

kernel: [232104.016513] kobject (ffff880006c1c980): tried to init an initialized object, something is seriously wrong.

btrfs_sysfs_remove_one() self test code:

open_tree()
{
 ::
        ret = btrfs_sysfs_add_one(fs_info);
	if (ret) {
              pr_err("BTRFS: failed to init sysfs interface: %d\n", ret);
                goto fail_block_groups;
        }
+       btrfs_sysfs_remove_one(fs_info);
+       ret = btrfs_sysfs_add_one(fs_info);
+       if (ret) {
+               pr_err("BTRFS: failed to init sysfs interface: %d\n", ret);
+               goto fail_block_groups;
+       }

cleaning up the unregistered kobject fixes this.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:18 +02:00
Linus Torvalds 7ce14f6ff2 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I fixed up a regression from 4.0 where conversion between different
  raid levels would sometimes bail out without converting.

  Filipe tracked down a race where it was possible to double allocate
  chunks on the drive.

  Mark has a fix for fiemap.  All three will get bundled off for stable
  as well"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix regression in raid level conversion
  Btrfs: fix racy system chunk allocation when setting block group ro
  btrfs: clear 'ret' in btrfs_check_shared() loop
2015-05-23 11:14:10 -07:00
Mike Snitzer 326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
Chris Mason 153c35b6cc Btrfs: fix regression in raid level conversion
Commit 2f0810880f changed
btrfs_set_block_group_ro to avoid trying to allocate new chunks with the
new raid profile during conversion.  This fixed failures when there was
no space on the drive to allocate a new chunk, but the metadata
reserves were sufficient to continue the conversion.

But this ended up causing a regression when the drive had plenty of
space to allocate new chunks, mostly because reduce_alloc_profile isn't
using the new raid profile.

Fixing btrfs_reduce_alloc_profile is a bigger patch.  For now, do a
partial revert of 2f0810880, and don't error out if we hit ENOSPC.

Signed-off-by: Chris Mason <clm@fb.com>
Tested-by: Dave Sterba <dsterba@suse.cz>
Reported-by: Holger Hoffstaette <holger.hoffstaette@googlemail.com>
2015-05-20 11:03:38 -07:00
Filipe Manana a96295965b Btrfs: fix racy system chunk allocation when setting block group ro
If while setting a block group read-only we end up allocating a system
chunk, through check_system_chunk(), we were not doing it while holding
the chunk mutex which is a problem if a concurrent chunk allocation is
happening, through do_chunk_alloc(), as it means both block groups can
end up using the same logical addresses and physical regions in the
device(s). So make sure we hold the chunk mutex.

Cc: stable@vger.kernel.org  # 4.0+
Fixes: 2f0810880f ("btrfs: delete chunk allocation attemp when
                      setting block group ro")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-19 18:04:17 -07:00
Mark Fasheh 2c2ed5aa01 btrfs: clear 'ret' in btrfs_check_shared() loop
btrfs_check_shared() is leaking a return value of '1' from
find_parent_nodes(). As a result, callers (in this case, extent_fiemap())
are told extents are shared when they are not. This in turn broke fiemap on
btrfs for kernels v3.18 and up.

The fix is simple - we just have to clear 'ret' after we are done processing
the results of find_parent_nodes().

It wasn't clear to me at first what was happening with return values in
btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help
on irc. I added documentation to both functions to make things more clear
for the next hacker who might come across them.

If we could queue this up for -stable too that would be great.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-19 18:04:17 -07:00
Christoph Hellwig b25de9d6da block: remove BIO_EOPNOTSUPP
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:03 -06:00
Linus Torvalds c7309e88a6 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The first commit is a fix from Filipe for a very old extent buffer
  reuse race that triggered a BUG_ON.  It hasn't come up often, I looked
  through old logs at FB and we hit it a handful of times over the last
  year.

  The rest are other corners he hit during testing"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
  Btrfs: fix race between block group creation and their cache writeout
  Btrfs: fix panic when starting bg cache writeout after IO error
  Btrfs: fix crash after inode cache writeback failure
2015-05-16 15:50:58 -07:00
Filipe Manana 062c19e9dd Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:

    BUG_ON(extent_buffer_under_io(eb))

The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.

Here follows a sequence of steps that leads to the BUG_ON.

      CPU 0                                                    CPU 1                                                CPU 2

path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X

btrfs_release_path(path)
    unlocks X

                                                      reads eb X
                                                         X->refs incremented to 3
                                                      locks eb X
                                                      btrfs_del_items(X)
                                                         X becomes empty
                                                         clean_tree_block(X)
                                                             clear EXTENT_BUFFER_DIRTY from X
                                                         btrfs_del_leaf(X)
                                                             unlocks X
                                                             extent_buffer_get(X)
                                                                X->refs incremented to 4
                                                             btrfs_free_tree_block(X)
                                                                X's range is not pinned
                                                                X's range added to free
                                                                  space cache
                                                             free_extent_buffer_stale(X)
                                                                lock X->refs_lock
                                                                set EXTENT_BUFFER_STALE on X
                                                                release_extent_buffer(X)
                                                                    X->refs decremented to 3
                                                                    unlocks X->refs_lock
                                                      btrfs_release_path()
                                                         unlocks X
                                                         free_extent_buffer(X)
                                                             X->refs becomes 2

                                                                                                      __btrfs_cow_block(Y)
                                                                                                          btrfs_alloc_tree_block()
                                                                                                              btrfs_reserve_extent()
                                                                                                                  find_free_extent()
                                                                                                                      gets offset == X->start
                                                                                                              btrfs_init_new_buffer(X->start)
                                                                                                                  btrfs_find_create_tree_block(X->start)
                                                                                                                      alloc_extent_buffer(X->start)
                                                                                                                          find_extent_buffer(X->start)
                                                                                                                              finds eb X in radix tree

    free_extent_buffer(X)
        lock X->refs_lock
            test X->refs == 2
            test bit EXTENT_BUFFER_STALE is set
            test !extent_buffer_under_io(eb)

                                                                                                                              increments X->refs to 3
                                                                                                                              mark_extent_buffer_accessed(X)
                                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                                    --> does nothing,
                                                                                                                                        X->refs >= 2 and
                                                                                                                                        EXTENT_BUFFER_TREE_REF
                                                                                                                                        is set in X
                                                                                                              clear EXTENT_BUFFER_STALE from X
                                                                                                              locks X
                                                                                                          btrfs_mark_buffer_dirty()
                                                                                                              set_extent_buffer_dirty(X)
                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                     --> does nothing, X->refs >= 2 and
                                                                                                                         EXTENT_BUFFER_TREE_REF is set
                                                                                                                  sets EXTENT_BUFFER_DIRTY on X

            test and clear EXTENT_BUFFER_TREE_REF
            decrements X->refs to 2
        release_extent_buffer(X)
            decrements X->refs to 1
            unlock X->refs_lock

                                                                                                      unlock X
                                                                                                      free_extent_buffer(X)
                                                                                                          lock X->refs_lock
                                                                                                          release_extent_buffer(X)
                                                                                                              decrements X->refs to 0
                                                                                                              btrfs_release_extent_buffer_page(X)
                                                                                                                   BUG_ON(extent_buffer_under_io(X))
                                                                                                                       --> EXTENT_BUFFER_DIRTY set on X

Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.

A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:

[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0

I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

  while true; do
     mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
     mount /dev/sdd /mnt
     snapshot_cmd="btrfs subvolume snapshot -r /mnt"
     snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
     fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
     umount /mnt
  done

Which usually triggers the BUG_ON within less than 24 hours:

[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:11 -07:00
Filipe Manana ff1f8250a9 Btrfs: fix race between block group creation and their cache writeout
So creating a block group has 2 distinct phases:

Phase 1 - creates the btrfs_block_group_cache item and adds it to the
rbtree fs_info->block_group_cache_tree and to the corresponding list
space_info->block_groups[];

Phase 2 - adds the block group item to the extent tree and corresponding
items to the chunk tree.

The first phase adds the block_group_cache_item to a list of pending block
groups in the transaction handle, and phase 2 happens when
btrfs_end_transaction() is called against the transaction handle.

It happens that once phase 1 completes, other concurrent tasks that use
their own transaction handle, but points to the same running transaction
(struct btrfs_trans_handle->transaction), can use this block group for
space allocations and therefore mark it dirty. Dirty block groups are
tracked in a list belonging to the currently running transaction (struct
btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).

This is a problem because once a task calls btrfs_commit_transaction(),
it calls btrfs_start_dirty_block_groups() which will see all dirty block
groups and attempt to start their writeout, including those that are
still attached to the transaction handle of some concurrent task that
hasn't called btrfs_end_transaction() yet - which means those block
groups haven't gone through phase 2 yet and therefore when
write_one_cache_group() is called, it won't find the block group items
in the extent tree and abort the current transaction with -ENOENT,
turning the fs into readonly mode and require a remount.

Fix this by ignoring -ENOENT when looking for block group items in the
extent tree when we attempt to start the writeout of the block group
caches outside the critical section of the transaction commit. We will
try again later during the critical section and if there we still don't
find the block group item in the extent tree, we then abort the current
transaction.

This issue happened twice, once while running fstests btrfs/067 and once
for btrfs/078, which produced the following trace:

[ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[ 3278.707329] BTRFS: Transaction aborted (error -2)
(...)
[ 3278.731555] Call Trace:
[ 3278.732396]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[ 3278.733860]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[ 3278.735312]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[ 3278.736874]  [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[ 3278.738302]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[ 3278.739520]  [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[ 3278.741222]  [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs]
[ 3278.742797]  [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
[ 3278.744492]  [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[ 3278.746084]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[ 3278.747249]  [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs]
[ 3278.748744]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
[ 3278.749958]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 3278.751218]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
[ 3278.754197]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
[ 3278.755192]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
[ 3278.756236]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:10 -07:00
Filipe Manana 28aeeac1dd Btrfs: fix panic when starting bg cache writeout after IO error
When waiting for the writeback of block group cache we returned
immediately if there was an error during writeback without waiting
for the ordered extent to complete. This left a short time window
where if some other task attempts to start the writeout for the same
block group cache it can attempt to add a new ordered extent, starting
at the same offset (0) before the previous one is removed from the
ordered tree, causing an ordered tree panic (calls BUG()).

This normally doesn't happen in other write paths, such as buffered
writes or direct IO writes for regular files, since before marking
page ranges dirty we lock the ranges and wait for any ordered extents
within the range to complete first.

Fix this by making btrfs_wait_ordered_range() not return immediately
if it gets an error from the writeback, waiting for all ordered extents
to complete first.

This issue happened often when running the fstest btrfs/088 and it's
easy to trigger it by running in a loop until the panic happens:

  for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done

[17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
[17156.864052] ------------[ cut here ]------------
[17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70!
(...)
[17156.864052] Call Trace:
[17156.864052]  [<ffffffffa03876e3>] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
[17156.864052]  [<ffffffffa03787e2>] run_delalloc_nocow+0x5bf/0x747 [btrfs]
[17156.864052]  [<ffffffffa03789ff>] run_delalloc_range+0x95/0x353 [btrfs]
[17156.864052]  [<ffffffffa038b7fe>] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
[17156.864052]  [<ffffffffa038d75b>] __extent_writepage+0x129/0x1f7 [btrfs]
[17156.864052]  [<ffffffffa038da5a>] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
[17156.864052]  [<ffffffff810ad2af>] ? __module_text_address+0x12/0x59
[17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[17156.864052]  [<ffffffffa038df76>] extent_writepages+0x4b/0x5c [btrfs]
[17156.864052]  [<ffffffff81144431>] ? kmem_cache_free+0x9b/0xce
[17156.864052]  [<ffffffffa0376a46>] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
[17156.864052]  [<ffffffffa0389cd6>] ? free_extent_state+0x8c/0xc1 [btrfs]
[17156.864052]  [<ffffffffa0374871>] btrfs_writepages+0x28/0x2a [btrfs]
[17156.864052]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
[17156.864052]  [<ffffffff81102f36>] __filemap_fdatawrite_range+0x5a/0x61
[17156.864052]  [<ffffffff81102f6e>] filemap_fdatawrite_range+0x13/0x15
[17156.864052]  [<ffffffffa0383ef7>] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
[17156.864052]  [<ffffffffa03ab89e>] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
[17156.864052]  [<ffffffffa03ac1ab>] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
[17156.864052]  [<ffffffffa03ac1fd>] btrfs_write_out_cache+0x93/0xdc [btrfs]
[17156.864052]  [<ffffffffa0363847>] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
[17156.864052]  [<ffffffffa03638e6>] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
[17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[17156.864052]  [<ffffffffa037209e>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[17156.864052]  [<ffffffffa034c748>] btrfs_sync_fs+0xe1/0x12d [btrfs]

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:10 -07:00
Filipe Manana e43699d4b4 Btrfs: fix crash after inode cache writeback failure
If the writeback of an inode cache failed we were unnecessarilly
attempting to release again the delalloc metadata that we previously
reserved. However attempting to do this a second time triggers an
assertion at drop_outstanding_extent() because we have no more
outstanding extents for our inode cache's inode. If we were able
to start writeback of the cache the reserved metadata space is
released at btrfs_finished_ordered_io(), even if an error happens
during writeback.

So make sure we don't repeat the metadata space release if writeback
started for our inode cache.

This issue was trivial to reproduce by running the fstest btrfs/088
with "-o inode_cache", which triggered the assertion leading to a
BUG() call and requiring a reboot in order to run the remaining
fstests. Trace produced by btrfs/088:

[255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
[255289.388094] ------------[ cut here ]------------
[255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
[255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[255289.392068] Call Trace:
[255289.392068]  [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs]
[255289.392068]  [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
[255289.392068]  [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
[255289.392068]  [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
[255289.392068]  [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
[255289.392068]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[255289.392068]  [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
[255289.392068]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
[255289.392068]  [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
(...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:10 -07:00
Linus Torvalds af6472881a Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "When an arm user reported crashes near page_address(page) in my new
  code, it became clear that I can't be trusted with GFP masks.  Filipe
  beat me to the patch, and I'll just be in the corner with my dunce cap
  on"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong mapping flags for free space inode
2015-05-08 20:59:02 -07:00
Filipe Manana 1d3c61c2eb Btrfs: fix wrong mapping flags for free space inode
We were passing a flags value that differed from the intention in commit
2b10826800 ("Btrfs: don't use highmem for free space cache pages").

This caused problems in a ARM machine, leaving btrfs unusable there.

Reported-by: Merlijn Wajer <merlijn@wizzup.org>
Tested-by: Merlijn Wajer <merlijn@wizzup.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-06 17:06:13 -07:00
Jens Axboe dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Linus Torvalds 64887b6882 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "A few more btrfs fixes.

  These range from corners Filipe found in the new free space cache
  writeback to a grab bag of fixes from the list"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
  Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
  btrfs: unlock i_mutex after attempting to delete subvolume during send
  btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
  btrfs: fix race on ENOMEM in alloc_extent_buffer
  btrfs: handle ENOMEM in btrfs_alloc_tree_block
  Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
  Btrfs: don't check for delalloc_bytes in cache_save_setup
  Btrfs: fix deadlock when starting writeback of bg caches
  Btrfs: fix race between start dirty bg cache writeout and bg deletion
2015-05-01 07:46:21 -07:00
Forrest Liu 5d2361db48 Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().

Running following command repeatly can check this memory leak problem

    btrfs inspect-internal inode-resolve 256 /mnt/btrfs

Signed-off-by: Chien-Kuan Yeh <ckya@synology.com>
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-29 13:22:09 -07:00
Linus Torvalds f583381f50 Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe hit two problems in my block group cache patches.  We finalized
  the fixes last week and ran through more tests"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: prevent list corruption during free space cache processing
  Btrfs: fix inode cache writeout
2015-04-26 17:40:30 -07:00
Linus Torvalds 9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Yang Dongsheng 6e17d30bfa Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:

Problem:
	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
	(2). pwrite(test_fd, buf, 4096, 0)
	(3). close(test_fd)
	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
	(6). fsync(test_fd);
				<-------- we did not get the correct log entry for the file
Reason:
	When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.

Fix:
	This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:03 -07:00
Omar Sandoval 909e26dce3 btrfs: unlock i_mutex after attempting to delete subvolume during send
Whenever the check for a send in progress introduced in commit
521e0546c9 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:

[  +0.000059] ================================================
[  +0.000028] [ BUG: lock held when returning to user space! ]
[  +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[  +0.000026] ------------------------------------------------
[  +0.000029] btrfs/211 is leaving the kernel with locks still held!
[  +0.000029] 1 lock held by btrfs/211:
[  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0

Make sure we unlock it in the error path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:02 -07:00
Omar Sandoval b86054540e btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
When we try to access them later, things will blow up in various ways.

Also fix the comment about the return value, which is an errno on error,
not -1, and update the cases where it was not.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:01 -07:00
Omar Sandoval 5ca64f45e9 btrfs: fix race on ENOMEM in alloc_extent_buffer
Consider the following interleaving of overlapping calls to
alloc_extent_buffer:

Call 1:

- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages

Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
  mark_extent_buffer_accessed on it

mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.

The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:00 -07:00
Omar Sandoval 67b7859e9b btrfs: handle ENOMEM in btrfs_alloc_tree_block
This is one of the first places to give out when memory is tight. Handle
it properly rather than with a BUG_ON.

Also fix the comment about the return value, which is an ERR_PTR, not
NULL, on error.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:27:00 -07:00
Forrest Liu 1b98450816 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 100g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    # make device tree with one big hole
    for i in `seq 1 1 100`; do
        fallocate -l 1g $mntpath/$i
    done
    sync
    for i in `seq 1 1 95`; do
        rm $mntpath/$i
    done
    sync

    # wait cleaner thread remove unused block group
    sleep 300

    fallocate -l 1g $mntpath/aaa

    # failed to allocate new chunk
    fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb

    item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 106292051968 length 1073741824
    item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 103108575232 length 1073741824

Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:59 -07:00
Chris Mason e4c88f007b Btrfs: don't check for delalloc_bytes in cache_save_setup
Now that we're doing free space cache writeback outside the critical
section in the commit, there is a bigger window for delalloc_bytes to
be added after a cache has been written.  find_free_extent may do this
without putting the block group back into the dirty list, and also
without a transaction running.

Checking for delalloc_bytes in cache_save_setup means we might leave the
cache marked as written without invalidating it.  Consistency checks
during mount will toss the cache, but it's better to get rid of the
check in cache_save_setup and let it get invalidated by the checks
already done during cache write out.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:58 -07:00
Filipe Manana 24b89d08ef Btrfs: fix deadlock when starting writeback of bg caches
While starting the writes of the dirty block group caches, if we don't
find a block group item in the extent tree we were leaving without
releasing our path, running delayed references and then looping again to
process any new dirty block groups. However this second iteration of the
loop could cause a deadlock because it tries to lock some other extent
tree node/leaf which another task already locked and it's blocked because
it's waiting for a lock on some node/leaf that is in our path that was not
released before.
We could also deadlock when running the delayed references - as we could
end up trying to lock the same nodes/leafs that we have in our local path
(with a different lock type).

Got into such case when running xfstests:

[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
(...)
[20892.269378] Call Trace:
[20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
(...)
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly
[20892.298222] ------------[ cut here ]------------
[20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
(...)
[20892.326253] Call Trace:
[20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
[20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
[20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
[20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
[20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
[20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
[20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
[20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
[20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
(...)
[20892.361015] ---[ end trace 597f77e664245374 ]---
[21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
[21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.703696] Call Trace:
[21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
[21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
[21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
[21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
[21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
[21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
[21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
[21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
[21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
[21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
(...)
[21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
[21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.768457] Call Trace:
[21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
[21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
[21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
[21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
[21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
[21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
[21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
[21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
[21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
(...)
[21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
[21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.899960] Call Trace:
[21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
[21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
[21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
[21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
[21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
[21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
[21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
[21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
[21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
[21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
[21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
[21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
[21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
[21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
(...)

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:37 -07:00
Filipe Manana b58d1a9ef9 Btrfs: fix race between start dirty bg cache writeout and bg deletion
While running xfstests I ran into the following:

[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
[20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
[20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
[20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
[20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
[20892.269378] Call Trace:
[20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
[20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
[20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
[20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
[20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
[20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly

This happens because in btrfs_start_dirty_block_groups() we splice the
transaction's list of dirty block groups into a local list and then we
keep extracting the first element of the list without holding the
cache_write_mutex mutex. This means that before we acquire that mutex
the first block group on the list might be removed by a conurrent task
running btrfs_remove_block_group(). So make sure we extract the first
element (and test the list emptyness) while holding that mutex.

Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
                      outside critical section in commit")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:37 -07:00
Jens Axboe fe0f07d08e direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:28 -04:00
Chris Mason a3bdccc4e6 Btrfs: prevent list corruption during free space cache processing
__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
a list of bitmaps to record in the free space cache.  It was dropping
the lock while it worked on other components, which made a window for
free_bitmap() to free the bitmap struct without removing it from the
list.

This changes things to hold the lock the whole time, and also makes sure
we hold the lock during enospc cleanup.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-24 11:52:25 -07:00
Linus Torvalds ba0e4ae88f Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "I've been running these through a longer set of load tests because my
  commits change the free space cache writeout.  It fixes commit stalls
  on large filesystems (~20T space used and up) that we have been
  triggering here.  We were seeing new writers blocked for 10 seconds or
  more during commits, which is far from good.

  Josef and I fixed up ENOSPC aborts when deleting huge files (3T or
  more), that are triggered because our metadata reservations were not
  properly accounting for crcs and were not replenishing during the
  truncate.

  Also in this series, a number of qgroup fixes from Fujitsu and Dave
  Sterba collected most of the pending cleanups from the list"

* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits)
  btrfs: quota: Update quota tree after qgroup relationship change.
  btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
  btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
  btrfs: Update btrfs qgroup status item when rescan is done.
  btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
  btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
  btrfs: Check qgroup level in kernel qgroup assign.
  btrfs: qgroup: allow to remove qgroup which has parent but no child.
  btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
  btrfs: qgroup: do a reservation in a higher level.
  Btrfs: qgroup, Account data space in more proper timings.
  Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
  Btrfs: qgroup: free reserved in exceeding quota.
  Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
  btrfs: qgroup: fix limit args override whole limit struct
  btrfs: qgroup: update limit info in function btrfs_run_qgroups().
  btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
  btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
  btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
  btrfs: Support busy loop of write and delete
  ...
2015-04-24 07:40:02 -07:00
Chris Mason 85db36cfb3 Btrfs: fix inode cache writeout
The code to fix stalls during free spache cache IO wasn't using
the correct root when waiting on the IO for inode caches.  This
is only a problem when the inode cache is enabled with

mount -o inode_cache

This fixes the inode cache writeout to preserve any error values and
makes sure not to override the root when inode cache writeout is done.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-23 17:47:34 -07:00
David Howells 2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00