Commit Graph

523 Commits

Author SHA1 Message Date
Chris Mason f82c458a2c btrfs: fix lockups from btrfs_clear_path_blocking
The fair reader/writer locks mean that btrfs_clear_path_blocking needs
to strictly follow lock ordering rules even when we already have
blocking locks on a given path.

Before we can clear a blocking lock on the path, we need to make sure
all of the locks have been converted to blocking.  This will remove lock
inversions against anyone spinning in write_lock() against the buffers
we're trying to get read locks on.  These inversions didn't exist before
the fair read/writer locks, but now we need to be more careful.

We papered over this deadlock in the past by changing
btrfs_try_read_lock() to be a true trylock against both the spinlock and
the blocking lock.  This was slower, and not sufficient to fix all the
deadlocks.  This patch adds a btrfs_tree_read_lock_atomic(), which
basically means get the spinlock but trylock on the blocking lock.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Patrick Schmid <schmid@phys.ethz.ch>
cc: stable@vger.kernel.org #v3.15+
2014-11-19 10:34:35 -08:00
Chris Mason 27b19cc886 Merge branch 'cleanup/blocksize-diet-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-10-04 09:57:14 -07:00
Chris Mason bbf65cf0b5 Merge branch 'cleanup/misc-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/extent_io.c
2014-10-04 09:56:45 -07:00
Fabian Frederick b99d9a6a4a btrfs: fix shadow warning on cmp
cmp was declared twice in btrfs_compare_trees resulting in a shadow
warning. This patch renames second internal variable.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
David Sterba fccb84c94a btrfs: move checks for DUMMY_ROOT into a helper
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:33 +02:00
David Sterba 7ec20afbcb btrfs: new define for the inline extent data start
Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:33 +02:00
David Sterba 4d75f8a9c8 btrfs: remove blocksize from btrfs_alloc_free_block and rename
Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:54 +02:00
David Sterba 0308af4465 btrfs: remove unused parameter blocksize from btrfs_find_tree_block
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:52 +02:00
David Sterba ce86cd5917 btrfs: remove parameter blocksize from read_tree_block
We know the tree block size, no need to pass it around.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:50 +02:00
David Sterba 58dc4ce432 btrfs: remove unused parameter from readahead_tree_block
The parent_transid parameter has been unused since its introduction in
ca7a79ad8d ("Pass down the expected generation number when reading
tree blocks").  In reada_tree_block, it was even wrongly set to leafsize.
Transid check is done in the proper read and readahead ignores errors.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:30:18 +02:00
Filipe Manana f98de9b9c0 Btrfs: make btrfs_search_forward return with nodes unlocked
None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.

This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:02 -07:00
Filipe Manana 160f4089c8 Btrfs: avoid unnecessary switch of path locks to blocking mode
If we need to cow a node, increase the write lock level and retry the
tree search, there's no point of changing the node locks in our path
to blocking mode, as we only waste time and unnecessarily wake up other
tasks waiting on the spinning locks (just to block them again shortly
after) because we release our path before repeating the tree search.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:56 -07:00
Filipe Manana 24cdc847d9 Btrfs: unlock nodes earlier when inserting items in a btree
In ctree.c:setup_items_for_insert(), we can unlock all nodes in our
path before we process the leaf (shift items and data, adjust data
offsets, etc). This allows for better btree concurrency, as we're
often holding a write lock on at least the node at level 1.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:55 -07:00
David Sterba 707e8a0715 btrfs: use nodesize everywhere, kill leafsize
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

  text    data     bss     dec     hex filename
852418   24560   23112  900090   dbbfa btrfs.ko.before
851074   24584   23112  898770   db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:14 -07:00
Josef Bacik e339a6b097 Btrfs: __btrfs_mod_ref should always use no_quota
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:11 -07:00
Liu Bo 0b43e04f70 Btrfs: fix leaf corruption after __btrfs_drop_extents
Several reports about leaf corruption has been floating on the list, one of them
points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
after __btrfs_drop_extents(), it's really a rare case but it does exist.

The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().

So in btrfs_next_leaf(), we release the current path to re-search the last key of
the leaf for locating next leaf, and we've taken it into account that there might
be balance operations between leafs during this 'unlock and re-lock' dance, so
we check the path again and advance it if there are now more items available.
But things are a bit different if that last key happens to be removed and balance
gets a bigger key as the last one, and btrfs_search_slot will return it with
ret > 0, IOW, nothing change in this leaf except the new last key, then we think
we're okay because there is no more item balanced in, fine, we thinks we can
go to the next leaf.

However, we should return that bigger key, otherwise we deserve leaf corruption,
for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
it has dropped all extent matched the required range and finish_ordered_io can
safely insert a new extent, but it actually doesn't and ends up a leaf
corruption.

One may be asking that why our locking on extent io tree doesn't work as
expected, ie. it should avoid this kind of race situation.  But in
__btrfs_drop_extents(), we don't always find extents which are included within
our locking range, IOW, extents can start before our searching start, in this
case locking on extent io tree doesn't protect us from the race.

This takes the special case into account.

Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:10 -07:00
Filipe Manana 337c6f6830 Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
We might have had an item with the previous key in the tree right
before we released our path. And after we released our path, that
item might have been pushed to the first slot (0) of the leaf we
were holding due to a tree balance. Alternatively, an item with the
previous key can exist as the only element of a leaf (big fat item).
Therefore account for these 2 cases, so that our callers (like
btrfs_previous_item) don't miss an existing item with a key matching
the previous key we computed above.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:09 -07:00
Josef Bacik faa2dbf004 Btrfs: add sanity tests for new qgroup accounting code
This exercises the various parts of the new qgroup accounting code.  We do some
basic stuff and do some things with the shared refs to make sure all that code
works.  I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:49 -07:00
Josef Bacik fcebe4562d Btrfs: rework qgroup accounting
Currently qgroups account for space by intercepting delayed ref updates to fs
trees.  It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly.  The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together.  Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc.  This patch
accomplishes this by only adding qgroup operations for real ref changes.  We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time.  This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq:  we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence.  This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point.  This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:48 -07:00
Miao Xie 27cdeb7096 Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Josef Bacik 3f8a18cc53 Btrfs: hold the commit_root_sem when getting the commit root during send
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root.  We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07 09:08:39 -07:00
Josef Bacik 9e351cc862 Btrfs: remove transaction from send
Lets try this again.  We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send.  So fix this by making sure looking at the
commit roots is always going to be consistent.  We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once.  Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes.  With this patch we no longer
deadlock and pass all the weird send/receive corner cases.  Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:39:30 -07:00
Filipe Manana 6baa4293af Btrfs: correctly determine if blocks are shared in btrfs_compare_trees
Just comparing the pointers (logical disk addresses) of the btree nodes is
not completely bullet proof, we have to check if their generation numbers
match too.

It is guaranteed that a COW operation will result in a block with a different
logical disk address than the original block's address, but over time we can
reuse that former logical disk address.

For example, creating a 2Gb filesystem on a loop device, and having a script
running in a loop always updating the access timestamp of a file, resulted in
the same logical disk address being reused for the same fs btree block in about
only 4 minutes.

This could make us skip entire subtrees when doing an incremental send (which
is currently the only user of btrfs_compare_trees). However the odds of getting
2 blocks at the same tree level, with the same logical disk address, equal first
slot keys and different generations, should hopefully be very low.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:50 -04:00
Filipe David Borba Manana 23c6bf6a91 Btrfs: fix btrfs_search_slot_for_read backwards iteration
If the current path's leaf slot is 0, we do search for the previous
leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a
value corresponding to the number of items - 1 of the former leaf.
Fix this by using the slot set by btrfs_prev_leaf, decrementing it
by 1 if it's equal to the leaf's number of items.

Use of btrfs_search_slot_for_read() for backward iteration is used in
particular by the send feature, which could miss items when the input
leaf has less items than its previous leaf.

This could be reproduced by running btrfs/007 from xfstests in a loop.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-29 07:06:28 -08:00
Wang Shilong ade2e0b3ee Btrfs: fix to search previous metadata extent item since skinny metadata
There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.

To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:33 -08:00
Filipe David Borba Manana eb653de159 Btrfs: reduce btree node locking duration on item update
If we do a btree search with the goal of updating an existing item
without changing its size (ins_len == 0 and cow == 1), then we never
need to hold locks on upper level nodes (even when slot == 0) after we
COW their child nodes/leaves, as we won't have node splits or merges
in this scenario (that is, no key additions, removals or shifts on any
nodes or leaves).

Therefore release the locks immediately after COWing the child nodes/leaves
while navigating the btree, even if their parent slot is 0, instead of
returning a path to the caller with those nodes locked, which would get
released only when the caller releases or frees the path (or if it calls
btrfs_unlock_up_safe).

This is a common scenario, for example when updating inode items in fs
trees and block group items in the extent tree.

The following benchmarks were performed on a quad core machine with 32Gb
of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
quickly).

  sysbench --test=fileio --file-num=131072 --file-total-size=8G \
    --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Before this change:  49.85Mb/s (average of 5 runs)
After this change:   50.38Mb/s (average of 5 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:11 -08:00
Frank Holton efe120a067 Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:05 -08:00
Filipe David Borba Manana 5de865eebb Btrfs: fix tree mod logging
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:

  btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
      --- tests/btrfs/004.out	2013-11-26 18:25:29.263333714 +0000
      +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad	2013-12-10 15:25:10.327518516 +0000
      @@ -1,3 +1,8 @@
       QA output created by 004
       *** test backref walking
      -*** done
      +unexpected output from
      +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
      +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
      +
       ...
       (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
  Ran: btrfs/004
  Failures: btrfs/004
  Failed 1 of 1 tests

But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:

  $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
  inode 405 offset 454656 root 258
  inode 405 offset 454656 root 5

It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:

[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G        W  O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979]  000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982]  0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984]  ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991]  [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994]  [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997]  [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003]  [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005]  [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010]  [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019]  [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027]  [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034]  [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056]  [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058]  [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065]  [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071]  [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078]  [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080]  [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083]  [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085]  [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088]  [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090]  [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093]  [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096]  [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098]  [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100]  [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102]  [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0

These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:

  c8cc634165
  (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)

That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.

This issue has been experienced by several users recently, such as for example:

  http://www.spinics.net/lists/linux-btrfs/msg28574.html

After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:20:04 -08:00
Filipe David Borba Manana 7835776635 Btrfs: return immediately if tree log mod is not necessary
In ctree.c:tree_mod_log_set_node_key() we were calling
__tree_mod_log_insert_key() even when the modification doesn't need
to be logged. This would allocate a tree_mod_elem structure, fill it
and pass it to  __tree_mod_log_insert(), which would just acquire
the tree mod log write lock and then free the tree_mod_elem structure
and return (that is, a no-op).

Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
which just returns immediately if the modification doesn't need to be
logged (without allocating the structure, fill it, acquire write lock,
free structure).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:56 -08:00
Filipe David Borba Manana 2ef1fed285 Btrfs: more efficient push_leaf_right
Currently when finding the leaf to insert a key into a btree, if the
leaf doesn't have enough space to store the item we attempt to move
off some items from our leaf to its right neighbor leaf, and if this
fails to create enough free space in our leaf, we try to move off more
items to the left neighbor leaf as well.

When trying to move off items to the right neighbor leaf, if it has
enough room to store the new key but not not enough room to move off
at least one item from our target leaf, __push_leaf_right returns 1 and
we have to attempt to move items to the left neighbor (push_leaf_left
function) without touching the right neighbor leaf.
For the case where the right leaf has enough room to store at least 1
item from our leaf, we end up modifying (and dirtying) both our leaf
and the right leaf. This is non-optimal for the case where the new key
is greater than any key in our target leaf because it can be inserted at
slot 0 of the right neighbor leaf and we don't need to touch our leaf
at all nor to attempt to move off items to the left neighbor leaf.

Therefore this change just selects the right neighbor leaf as our new
target leaf if it has enough room for the new key without modifying our
initial target leaf - we do this only if the new key is higher than any
key in the initial target leaf.

While running the following test, push_leaf_right was called by split_leaf
4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
special case (right leaf has enough room and new key is higher than any key
in the initial target leaf).

Test:

  sysbench --test=fileio --file-num=512 --file-total-size=5G \
    --file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
    --max-requests=100000 --file-io-mode=sync [prepare|run]

Results:

sequential writes

Throughput before this change: 65.71Mb/sec (average of 10 runs)
Throughput after this change:  66.58Mb/sec (average of 10 runs)

random writes

Throughput before this change: 10.75Mb/sec (average of 10 runs)
Throughput after this change:  11.56Mb/sec (average of 10 runs)

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:53 -08:00
Filipe David Borba Manana 5a4267ca20 Btrfs: try harder to avoid btree node splits
When attempting to move items from our target leaf to its neighbor
leaves (right and left), we only need to free data_size - free_space
bytes from our leaf in order to add the new item (which has size of
data_size bytes). Therefore attempt to move items to the right and
left leaves if they have at least data_size - free_space bytes free,
instead of data_size bytes free.

After 5 runs of the following test, I got a smaller number of btree
node splits overall:

sysbench --test=fileio --file-num=512 --file-total-size=5G \
  --file-test-mode=seqwr --num-threads=512 \
   --file-block-size=8192 --max-requests=100000 --file-io-mode=sync

Before this change:
* 6171 splits (average of 5 test runs)
* 61.508Mb/sec of throughput (average of 5 test runs)

After this change:
* 6036 splits (average of 5 test runs)
* 63.533Mb/sec of throughput (average of 5 test runs)

An ideal test would not just have multiple threads/processes writing
to a file (insertion of file extent items) but also do other operations
that result in insertion of items with varied sizes, like file/directory
creations, creation of links, symlinks, xattrs, etc.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:46 -08:00
Kelley Nielsen 3f870c2899 btrfs: expand btrfs_find_item() to include find_orphan_item functionality
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.

Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:37 -08:00
Kelley Nielsen 75ac2dd907 btrfs: expand btrfs_find_item() to include find_root_ref functionality
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.

Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Kelley Nielsen e33d5c3d6d btrfs: bootstrap generic btrfs_find_item interface
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.

This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Josef Bacik 16e7549f04 Btrfs: incompatible format change to remove hole extents
Btrfs has always had these filler extent data items for holes in inodes.  This
has made somethings very easy, like logging hole punches and sending hole
punches.  However for large holey files these extent data items are pure
overhead.  So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files.  This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any.  I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Dulshani Gunawardhana 678712545b btrfs: Fix checkpatch.pl warning of spacing issues
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:31 -05:00
Dulshani Gunawardhana fae7f21cec btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:53 -05:00
Filipe David Borba Manana e8b0d724d5 Btrfs: fix btrfs_prev_leaf() previous key computation
If we decrement the key type, we must reset its offset to the largest
possible offset (u64)-1. If we decrement the key's objectid, then we
must reset the key's type and offset to their largest possible values,
(u8)-1 and (u64)-1 respectively. Not doing so can make us miss an
items in the tree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:26 -05:00
Liu Bo 498456d33e Btrfs: kill unused code in btrfs_search_forward
After commit de78b51a28
(btrfs: remove cache only arguments from defrag path), @blockptr is no more
used.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:56 -05:00
Filipe David Borba Manana 6174d3cb43 Btrfs: remove unused max_key arg from btrfs_search_forward
It is not used for anything.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:57 -05:00
Ross Kirk 0a4e558609 btrfs: remove unused parameter from btrfs_header_fsid
Remove unused parameter, 'eb'. Unused since introduction in
5f39d397df

Updated to be rebased against current upstream and correct diff supplied this time!

Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:16 -05:00
Josef Bacik d4b4087c43 Btrfs: do a full search everytime in btrfs_search_old_slot
While running some snashot aware defrag tests I noticed I was panicing every
once and a while in key_search.  This is because of the optimization that says
if we find a key at slot 0 it will be at slot 0 all the way down the rest of the
tree.  This isn't the case for btrfs_search_old_slot since it will likely replay
changes to a buffer if something has changed since we took our sequence number.
So short circuit this optimization by setting prev_cmp to -1 every time we call
key_search so we will do our normal binary search.  With this patch I am no
longer seeing the panics I was seeing before.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:51:17 -05:00
Ross Kirk dd3cc16b87 btrfs: drop unused parameter from btrfs_item_nr
Remove unused eb parameter from btrfs_item_nr

Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:50:48 -05:00
Josef Bacik 83d4cfd4da Btrfs: fixup error handling in btrfs_reloc_cow
If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent.  So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:54 -04:00
Filipe David Borba Manana d7396f0735 Btrfs: optimize key searches in btrfs_search_slot
When the binary search returns 0 (exact match), the target key
will necessarily be at slot 0 of all nodes below the current one,
so in this case the binary search is not needed because it will
always return 0, and we waste time doing it, holding node locks
for longer than necessary, etc.

Below follow histograms with the times spent on the current approach of
doing a binary search when the previous binary search returned 0, and
times for the new approach, which directly picks the first item/child
node in the leaf/node.

Current approach:

Count: 6682
Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
Percentiles:  90th: 124.000; 95th: 145.000; 99th: 206.000
  35.000 -   61.080:  1235 ################
  61.080 -  106.053:  4207 #####################################################
 106.053 -  183.606:  1122 ##############
 183.606 -  317.341:   111 #
 317.341 -  547.959:     6 |
 547.959 - 8370.000:     1 |

Approach proposed by this patch:

Count: 6682
Range:  6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev:  7.160
Percentiles:  90th: 23.000; 95th: 27.000; 99th: 40.000
   6.000 -    8.418:    58 #
   8.418 -   11.670:  1149 #########################
  11.670 -   16.046:  2418 #####################################################
  16.046 -   21.934:  2098 ##############################################
  21.934 -   29.854:   744 ################
  29.854 -   40.511:   154 ###
  40.511 -   54.848:    41 #
  54.848 -   74.136:     5 |
  74.136 -  100.087:     9 |
 100.087 -  135.000:     6 |

These samples were captured during a run of the btrfs tests 001, 002 and
004 in the xfstests, with a leaf/node size of 4Kb.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:42 -04:00
Geert Uytterhoeven b308bc2f05 Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long
Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
casts it to a pointer, while all callers cast it to unsigned long again.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:16 -04:00
Geert Uytterhoeven fba6aa7565 Btrfs: Make btrfs_header_fsid() return unsigned long
Internally, btrfs_header_fsid() calculates an unsigned long, but casts
it to a pointer, while all callers cast it to unsigned long again.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:15 -04:00
Geert Uytterhoeven c1c9ff7c94 Btrfs: Remove superfluous casts from u64 to unsigned long long
u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:08 -04:00
Stefan Behrens 35a3621beb Btrfs: get rid of sparse warnings
make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__

I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.

All these changes should be obviously safe.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:15:50 -04:00
Josef Bacik ba5e8f2e2d Btrfs: fix send issues related to inode number reuse
If you are sending a snapshot and specifying a parent snapshot we will walk the
trees and figure out where they differ and send the differences only.  The way
we check for differences are if the leaves aren't the same and if the keys are
not the same within the leaves.  So if neither leaf is the same (ie the leaf has
been cow'ed from the parent snapshot) we walk each item in the send root and
check it against the parent root.  If the items match exactly then we don't do
anything.  This doesn't quite work for inode refs, since they will just have the
name and the parent objectid.  If you move the file from a directory and then
remove that directory and re-create a directory with the same inode number as
the old directory and then move that file back into that directory we will
assume that nothing changed and you will get errors when you try to receive.

In order to fix this we need to do extra checking to see if the inode ref really
is the same or not.  So do this by passing down BTRFS_COMPARE_TREE_SAME if the
items match.  Then if the key type is an inode ref we can do some extra
checking, otherwise we just keep processing.  The extra checking is to look up
the generation of the directory in the parent volume and compare it to the
generation of the send volume.  If they match then they are the same directory
and we are good to go.  If they don't we have to add them to the changed refs
list.

This means we have to track the generation of the ref we're trying to lookup
when we iterate all the refs for a particular inode.  So in the case of looking
for new refs we have to get the generation from the parent volume, and in the
case of looking for deleted refs we have to get the generation from the send
volume to compare with.

There was also the issue of using a ulist to keep track of the directories we
needed to check.  Because we can get a deleted ref and a new ref for the same
inode number the ulist won't work since it indexes based on the value.  So
instead just dup any directory ref we find and add it to a local list, and then
process that list as normal and do away with using a ulist for this altogether.

Before we would fail all of the tests in the far-progs that related to moving
directories (test group 32).  With this patch we now pass these tests, and all
of the tests in the far-progs send testing suite.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:15:44 -04:00
Josef Bacik 9ec7267751 Btrfs: stop using GFP_ATOMIC when allocating rewind ebs
There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:50 -04:00
Josef Bacik db7f3436c1 Btrfs: deal with enomem in the rewind path
We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log.  Instead of BUG_ON()'ing in this case pass up ENOMEM.  I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:49 -04:00
Josef Bacik c8cc634165 Btrfs: stop using GFP_ATOMIC for the tree mod log allocations
Previously we held the tree mod lock when adding stuff because we use it to
check and see if we truly do want to track tree modifications.  This is
admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty
hard and often is not nice.  So instead do our basic checks to see if we don't
need to track modifications, and if those pass then do our allocation, and then
when we go to insert the new modification check if we still care, and if we
don't just free up our mod and return.  Otherwise we're good to go and we can
carry on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:17 -04:00
Liu Bo b5b9b5b318 Btrfs: fix extent buffer leak after backref walking
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:42 -04:00
Josef Bacik 7fb7d76f96 Btrfs: only do the tree_mod_log_free_eb if this is our last ref
There is another bug in the tree mod log stuff in that we're calling
tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
is that if this block is shared by multiple snapshots we will call this multiple
times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
I no longer hit the panic in __tree_mod_log_rewind.  Thanks,

Cc: stable@vger.kernel.org
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:20 -04:00
Josef Bacik f1ca7e98a6 Btrfs: hold the tree mod lock in __tree_mod_log_rewind
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;

Thansk,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:18 -04:00
Josef Bacik 0b08851fda Btrfs: optimize reada_for_balance
This patch does two things.  First we no longer explicitly read in the blocks
we're trying to readahead.  For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there.  Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go.  Hopefully this will cut down on
the number of re-searches.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:32 -04:00
Josef Bacik bdf7c00e8f Btrfs: optimize read_block_for_search
This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified.  The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:32 -04:00
Liu Bo 33157e05db Btrfs: check if leaf's parent exists before pushing items around
During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:58 -04:00
Liu Bo fdd99c7294 Btrfs: dont do log_removal in insert_new_root
As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.

As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.

As above, insert_new_root can get rid of log_removal.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:57 -04:00
Stefan Behrens 8f69dbd236 Btrfs: fix a comment
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:23 -04:00
Josef Bacik b1c79e0947 Btrfs: handle running extent ops with skinny metadata
Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata.  We also lose
the level of the metadata we are working on.  So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:15 -04:00
Eric Sandeen 48a3b6366f btrfs: make static code static & remove dead code
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:23 -04:00
Jan Schmidt fc36ed7e0b Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.

However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.

To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.

The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.

Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:17 -04:00
Josef Bacik 416bc6580b Btrfs: fix all callers of read_tree_block
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly.  You need to check
and make sure the extent_buffer is uptodate before you use it.  This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not.  With this we no longer
leak EB's when things go horribly wrong.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:07 -04:00
Tsutomu Itoh 4b90c68015 Btrfs: remove unused argument of btrfs_extend_item()
Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:53 -04:00
Tsutomu Itoh afe5fea72b Btrfs: cleanup of function where fixup_low_keys() is called
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Tsutomu Itoh d6a0a12684 Btrfs: remove unused argument of fixup_low_keys()
Argument 'trans' is not used in fixup_low_keys(). So, remove it.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Jan Schmidt 47fb091fb7 Btrfs: fix unlock after free on rewinded tree blocks
When tree_mod_log_rewind decides to make a copy of the current tree buffer
for its modifications, it subsequently freed the buffer before unlocking it.
Obviously, those operations are required in reverse order.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:48 -04:00
Jan Schmidt 30b0463a93 Btrfs: fix accessing the root pointer in tree mod log functions
The tree mod log functions were accessing root->node->... directly, without
use of btrfs_root_node() or explicit rcu locking. This could lead to an
extent buffer reference being leaked and another reference being freed too
early when preemtion was enabled.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:47 -04:00
Jan Schmidt 90f8d62ebb Btrfs: fix tree mod log regression on root split operations
Commit d9abbf1c changed tree mod log locking around ROOT_REPLACE operations.
When a tree root is split, however, we were logging removal of all elements
from the root node before logging removal of half of the elements for the
split operation. This leads to a BUG_ON when rewinding.

This commit removes the erroneous logging of removal of all elements.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:47 -04:00
Josef Bacik 09a2a8f96e Btrfs: fix bad extent logging
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery.  I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem.  The
problem is if your application does something like this

[prealloc][prealloc][prealloc]

the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents.  So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space.  If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later.  The data and such will still work,
but everything else is broken.  This patch fixes this by not allowing extents
that are on the modified list to be merged.  This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree.  So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents.  With this patch the testcase I've created no
longer creates a bogus file system after replaying the log.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:34 -04:00
Josef Bacik 3173a18f70 Btrfs: add a incompatible format change for smaller metadata extent refs
We currently store the first key of the tree block inside the reference for the
tree block in the extent tree.  This takes up quite a bit of space.  Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster.  This is not an automatic format change, you
must enable it at mkfs time or with btrfstune.  This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:18 -04:00
Jan Schmidt d9abbf1c31 Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
Therefore, those operations must be enclosed by tree_mod_log_write_lock()
and tree_mod_log_write_unlock() calls.

Those calls are private to the tree_mod_log_* functions, which means that
removal of the elements of an old root node must be logged from
tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
(Btrfs: fix a tree mod logging issue for root replacement operations).

This fixes the brand-new version of xfstest 276 as of commit cfe73f71.

Cc: stable@vger.kernel.org
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-03-21 19:31:52 -04:00
Eric Sandeen de78b51a28 btrfs: remove cache only arguments from defrag path
The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell.  Chris says they're dead code, so remove them.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:36 -05:00
Eric Sandeen 1c697d4acc btrfs: annotate intentional switch case fallthroughs
This keeps static checkers happy.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:19 -05:00
Arne Jansen 2a745b14bc Btrfs: fix crash in log replay with qgroups enabled
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-14 20:47:41 -05:00
Chris Mason 57ba86c00f Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
This reverts commit 6a7a665d78.

This was bug was fixed differently in 3.6, so this commit
isn't needed.

Conflicts:
	fs/btrfs/ctree.c

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-18 19:35:32 -05:00
Chris Mason 4c3e696981 Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
This reverts commit 95c80bb1f6.

The bug addressed by this commit was fixed differently back in 3.6

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-18 15:43:18 -05:00
Josef Bacik 5124e00ec5 Btrfs: only unlock and relock if we have to
I noticed while doing fsync tests that we were always dropping the path and
re-searching when we first cow the log root even though we've already gotten
the write lock on the root.  That's because we don't take into account that
there might not be a parent node, so fix the check to make sure there is
actually a parent node before we undo all of this work for nothing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:27 -05:00
Josef Bacik 41be1f3b40 Btrfs: optimize leaf_space_used
This gets called at least 4 times for every level while adding an object,
and it involves 3 kmapping calls, which on my box take about 5us a piece.
So instead use a token, which brings us down to 1 kmap call and makes this
function take 1/3 of the time per call.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:26 -05:00
Josef Bacik 70c8a91ce2 Btrfs: log changed inodes based on the extent map tree
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree.  So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:24 -05:00
Josef Bacik d6393786cd Btrfs: add path->really_keep_locks
You'd think path->keep_locks would keep all the locks wouldn't you?  You'd
be wrong.  It only keeps them if the slot is pointing to the last item in
the node.  This is for use with btrfs_next_leaf, which needs this sort of
thing.  But the horrible horrible things I'm going to do to the tree log
means I really need everything held from root to leaf so I can add and
delete items in the same search.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:24 -05:00
Anand Jain 5f3ab90a72 Btrfs: rename root_times_lock to root_item_lock
Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:21 -05:00
Julia Lawall 6c1500f22a fs/btrfs: drop if around WARN_ON
Just use WARN_ON rather than an if containing only WARN_ON(1).

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e;
@@
- if (e) WARN_ON(1);
+ WARN_ON(e);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:24 -05:00
Julia Lawall 31b1a2bd75 fs/btrfs: use WARN
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:23 -05:00
Liu Bo 32adf09013 Btrfs: cleanup unused arguments
'disk_key' is not used at all.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:35 -05:00
Liu Bo 0e411ecec6 Btrfs: kill unnecessary arguments in del_ptr
The argument 'tree_mod_log' is not necessary since all of callers enable it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:35 -05:00
Liu Bo 6a7a665d78 Btrfs: reorder tree mod log operations in deleting a pointer
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:34 -05:00
Liu Bo 95c80bb1f6 Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).

So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.

Here is the details,
- If we do memory move for inserting a single pointer, we need to
  add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.

- If we do memory move for deleting a single pointer, we need to
  decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
  deleting.

- If we do memory move for balance left/right, we need to decrease
  node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:33 -05:00
Liu Bo 7bfdcf7fba Btrfs: fix memory leak when cloning root's node
After cloning root's node, we forgot to dec the src's ref
which can lead to a memory leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-25 15:55:21 -04:00
Jan Schmidt 01763a2e37 Btrfs: comment for loop in tree_mod_log_insert_move
Emphasis the way tree_mod_log_insert_move avoids adding
MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of
the move operation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:40 +02:00
Jan Schmidt d638108484 Btrfs: fix extent buffer reference for tree mod log roots
In get_old_root we grab a lock on the extent buffer before we obtain a
reference on that buffer. That order is changed now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:39 +02:00
Jan Schmidt 5b6602e762 Btrfs: determine level of old roots
In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:38 +02:00
Jan Schmidt 834328a849 Btrfs: tree mod log's old roots could still be part of the tree
Tree mod log treated old root buffers as always empty buffers when starting
the rewind operations. However, the old root may still be part of the
current tree at a lower level, with still some valid entries.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:37 +02:00
Jan Schmidt ba1bfbd592 Btrfs: fix a tree mod logging issue for root replacement operations
Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in
two places. Where needed, we call tree_mod_log_free_eb explicitly now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-23 15:09:14 +02:00
Jan Schmidt 57911b8ba8 Btrfs: don't put removals from push_node_left into tree mod log twice
Independant of the check (push_items < src_items) tree_mod_log_eb_copy did
log the removal of the old data entries from the source buffer. Therefore,
we must not call tree_mod_log_eb_move if the check evaluates to true, as
that would log the removal twice, finally resulting in (rewinded) buffers
with wrong values for header_nritems.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-23 15:09:11 +02:00
Robin Dong 8d1a1317af btrfs: remove unused function btrfs_insert_some_items()
The function btrfs_insert_some_items() would not be called by any other functions,
so remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
2012-10-09 09:15:43 -04:00
Chris Mason 74dd17fbe3 Btrfs: fix btrfs send for inline items and compression
The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk.  If we're compressed, this isn't
true, and so it was off into extents owned by other files.

It was also improperly handling inline extents.  This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent.  It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-01 15:19:00 -04:00
Chris Mason b12a3b1ea2 Btrfs: don't run __tree_mod_log_free_eb on leaves
When we split a leaf, we may end up inserting a new root on top of that
leaf.  The reflog code was incorrectly assuming the old root was always
a node.  This makes sure we skip over leaves.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:34 -04:00
Arne Jansen 1fa11e265f Btrfs: fix deadlock in wait_for_more_refs
Commit a168650c introduced a waiting mechanism to prevent busy waiting in
btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
where a tree_mod_seq is held while waiting for the io to complete, while
the end_io calls btrfs_run_delayed_refs.
This whole mechanism is unnecessary. If not enough runnable refs are
available to satisfy count, just return as count is more like a guideline
than a strict requirement.
In case we have to run all refs, commit transaction makes sure that no
other threads are working in the transaction anymore, so we just assert
here that no refs are blocked.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-08-28 16:53:32 -04:00
Chris Mason 113c1cb530 Merge branch 'send-v2' of git://github.com/ablock84/linux-btrfs into for-linus
This is the kernel portion of btrfs send/receive

Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/backref.h
	fs/btrfs/ctree.c
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 19:19:10 -04:00
Alexander Block 7069830a9e Btrfs: add btrfs_compare_trees function
This function is used to find the differences between
two trees. The tree compare skips whole subtrees if it
detects shared tree blocks and thus is pretty fast.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
2012-07-25 23:30:14 +02:00
Arne Jansen e679376911 Btrfs: add helper for tree enumeration
Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-25 17:33:18 +02:00
Arne Jansen 2f38b3e190 Btrfs: add helper for tree enumeration
Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-07-10 15:14:42 +02:00
Jan Schmidt 097b8a7c9e Btrfs: join tree mod log code with the code holding back delayed refs
We've got two mechanisms both required for reliable backref resolving (tree
mod log and holding back delayed refs). You cannot make use of one without
the other. So instead of requiring the user of this mechanism to setup both
correctly, we join them into a single interface.

Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
as we did before, which was of no value.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-10 15:14:41 +02:00
Jan Schmidt cf5388307a Btrfs: fix buffer leak in btrfs_next_old_leaf
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-07-10 15:14:41 +02:00
Jan Schmidt d42244a0d3 Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
With the tree mod log, we may end up with two roots (the current root and a
rewinded version of it) both pointing to two leaves, l1 and l2, of which l2
had already been cow-ed in the current transaction. If we don't rewind any
tree blocks, we cannot have two roots both pointing to an already cowed tree
block.

Now there is btrfs_next_leaf, which has a leaf locked and wants a lock on
the next (right) leaf. And there is push_leaf_left, which has a (cowed!)
leaf locked and wants a lock on the previous (left) leaf.

In order to solve this dead lock situation, we use try_lock in
btrfs_next_leaf (only in case it's called with a tree mod log time_seq
paramter) and if we fail to get a lock on the next leaf, we give up our lock
on the current leaf and retry from the very beginning.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:40 +02:00
Jan Schmidt 19956c7e94 Btrfs: fix tree mod log rewind of ADD operations
When a MOD_LOG_KEY_ADD operation is rewinded, we remove the key from the
tree block. If its not the last key, removal involves a move operation.
This move operation was explicitly done before this commit.

However, at insertion time, there's a move operation before the actual
addition to make room for the new key, which is recorded in the tree mod
log as well. This means, we must drop the move operation when rewinding the
add operation, because the next operation we'll be rewinding will be the
corresponding MOD_LOG_MOVE_KEYS operation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:40 +02:00
Jan Schmidt c3e0696523 Btrfs: always put insert_ptr modifications into the tree mod log
Several callers of insert_ptr set the tree_mod_log parameter to 0 to avoid
addition to the tree mod log. In fact, we need all of those operations. This
commit simply removes the additional parameter and makes addition to the
tree mod log unconditional.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:39 +02:00
Jan Schmidt 28da9fb446 Btrfs: fix tree mod log for root replacements at leaf level
For the tree mod log, we don't log any operations at leaf level. If the root
is at the leaf level (i.e. the tree consists only of the root), then
__tree_mod_log_oldest_root will find a ROOT_REPLACE operation in the log
(because we always log that one no matter which level), but no other
operations.

With this patch __tree_mod_log_oldest_root exits cleanly instead of
BUGging in this situation. get_old_root checks if its really a root at leaf
level in case we don't have any operations and WARNs if this assumption
breaks.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-27 16:34:38 +02:00
Chris Mason 4325edd078 Btrfs: init old_generation in get_old_root
gcc was giving an uninit variable warning here.  Strictly
speaking we don't need to init it, but this will make things
much less error prone.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-06-15 20:06:54 -04:00
Jan Schmidt 3310c36eef Btrfs: fix race in tree mod log addition
When adding to the tree modification log, we grab two locks at different
stages. We must not drop the outer lock until we're done with section
protected by the inner lock. This moves the unlock call for the outer lock
to the appropriate position.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-14 18:52:39 +02:00
Jan Schmidt 3d7806eca4 Btrfs: add btrfs_next_old_leaf
To make sense of the tree mod log, the backref walker not only needs
btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was
calling btrfs_search_slot. This obviously didn't give the correct result.

This commit adds btrfs_next_old_leaf, a drop-in replacement for
btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly
like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot
with this time_seq parameter.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-14 18:52:09 +02:00
Jan Schmidt a95236d99f Btrfs: fix return value for __tree_mod_log_oldest_root
In __tree_mod_log_oldest_root() we must return the found operation even if
it's not a ROOT_REPLACE operation. Otherwise, the caller assumes that there
are no operations to be rewinded and returns immediately.

The code in the caller is modified to improve readability.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-14 18:44:22 +02:00
Jan Schmidt 8ba97a15e7 Btrfs: use btrfs_read_lock_root_node in get_old_root
get_old_root could race with root node updates because we weren't locking
the node early enough. Use btrfs_read_lock_root_node to grab the root locked
in the very beginning and release the lock as soon as possible (just like
btrfs_search_slot does).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-14 18:44:21 +02:00
Jan Schmidt 4d5a0565ce Btrfs: remove call to btrfs_header_nritems with no effect
This is a leftover from cleanup patch 559af821. Before the cleanup,
btrfs_header_nritems was called inside an if condition. As it has no side
effects we need to preserve here, it should simply be dropped.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-06-04 14:35:29 +02:00
Chris Mason 1e20932a23 Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/ulist.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-31 16:49:53 -04:00
Jan Schmidt c31931088f Btrfs: fix tree mod log rewinded level and rewinding of moved keys
When we rewind REMOVE_WHILE_FREEING operations, there's code that allocates
a fresh buffer instead of cloning the old one. Setting that buffer's level
correctly was missing in this case.

When rewinding a MOVE_KEYS operation, btrfs_node_key_ptr_offset(slot) was
missing for memmove_extent_buffer()'s arguments.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-31 19:56:19 +02:00
Jan Schmidt f395694c2c Btrfs: fix tree mod log del_ptr
Logging for del_ptr when we're not deleting the last pointer was wrong. This
fixes both, duplicate log entries and log sequence.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-31 19:56:19 +02:00
Jan Schmidt e9b7fd4d8b Btrfs: add tree_mod_dont_log helper
Replace duplicate code by small inline helper function.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-31 19:56:18 +02:00
Jan Schmidt 926dd8a640 Btrfs: add missing spin_lock for insertion into tree mod log
tree_mod_alloc calls __get_tree_mod_seq and must acquire a spinlock before
doing so.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-31 19:56:18 +02:00
Tsutomu Itoh 018642a1f1 Btrfs: return value of btrfs_read_buffer is checked correctly
btrfs_read_buffer() has the possibility of returning the error.
Therefore, I add the code in which the return value of btrfs_read_buffer()
is checked.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-05-30 10:23:41 -04:00
Jan Schmidt 5d9e75c41d Btrfs: add btrfs_search_old_slot
The tree modification log together with the current state of the tree gives
a consistent, old version of the tree. btrfs_search_old_slot is used to
search through this old version and return old (dummy!) extent buffers.
Naturally, this function cannot do any tree modifications.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30 15:17:33 +02:00
Jan Schmidt f3ea38da3e Btrfs: add del_ptr and insert_ptr modifications to the tree mod log
Record all relevant modifications to block pointers in the tree mod log so
that we can rewind them later on for backref walking.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30 15:17:32 +02:00
Jan Schmidt f230475e62 Btrfs: put all block modifications into the tree mod log
When running functions that can make changes to the internal trees
(e.g. btrfs_search_slot), we check if somebody may be interested in the
block we're currently modifying. If so, we record our modification to be
able to rewind it later on.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30 15:17:29 +02:00
Jan Schmidt bd989ba359 Btrfs: add tree modification log functions
The tree mod log will log modifications made fs-tree nodes. Most
modifications are done by autobalance of the tree. Such changes are recorded
as long as a block entry exists. When released, the log is cleaned.

With the tree modification log, it's possible to reconstruct a consistent
old state of the tree. This is required to do backref walking on a busy
file system.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-30 15:17:01 +02:00
Jan Schmidt 5581a51a59 Btrfs: don't set for_cow parameter for tree block functions
Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed
parameter for_cow = 1. In fact, these two functions should never mark
their tree modification operations as for_cow, because they can change
the number of blocks referenced by a tree.

Hence, we remove the extra for_cow parameter from these functions and
make them pass a zero down.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-05-26 12:17:53 +02:00
Wang Sheng-Hui f775738f6f btrfs/ctree.c: remove the unnecessary 'return -1;' at the end of bin_search
The code path should not reach there. Remove it.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
2012-05-11 10:56:37 -04:00
Chris Mason b9fab919b7 Btrfs: avoid sleeping in verify_parent_transid while atomic
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.

But, a few callers are using it with spinlocks held.  Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case.  This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-06 07:23:47 -04:00
Chris Mason e5846fc665 Btrfs: Add properly locking around add_root_to_dirty_list
add_root_to_dirty_list happens once at the very beginning of the
transaction, but it is still racey.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-04 15:14:11 -04:00
Chris Mason 1d4284bd6e Merge branch 'error-handling' into for-linus
Conflicts:
	fs/btrfs/ctree.c
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/inode.c
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28 20:31:37 -04:00
Chris Mason f7c79f30cb Btrfs: adjust the write_lock_level as we unlock
btrfs_search_slot sometimes needs write locks on high levels of
the tree.  It remembers the highest level that needs a write lock
and will use that for all future searches through the tree in a given
call.

But, very often we'll just cow the top level or the level below and we
won't really need write locks on the root again after that.  This patch
changes things to adjust the write lock requirement as it unlocks
levels.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 17:04:24 -04:00
Chris Mason cfed81a04e Btrfs: add the ability to cache a pointer into the eb
This cuts down on the CPU time used by map_private_extent_buffer

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 17:04:23 -04:00
Josef Bacik 3083ee2e18 Btrfs: introduce free_extent_buffer_stale
Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory.  So instead of evicting these pages, we
could end up evicting things we actually care about.  Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks.  This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26 16:51:08 -04:00
Jeff Mahoney 79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Mark Fasheh 305a26af5b btrfs: Go readonly on tree errors in balance_level
balace_level() seems to deal with missing tree nodes by BUG_ON(). Instead,
we can easily just set the file system readonly and bubble -EROFS back up
the stack.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2012-03-22 01:45:38 +01:00
Mark Fasheh b68dc2a93e btrfs: Don't BUG_ON errors from update_ref_for_cow()
__btrfs_cow_block(), the only caller of update_ref_for_cow() will BUG_ON()
any error return.  Instead, we can go read-only fs as update_ref_for_cow()
manipulates disk data in a way which doesn't look like it's easily rolled
back.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:38 +01:00
Mark Fasheh e5df957328 btrfs: Go readonly on bad extent refs in update_ref_for_cow()
update_ref_for_cow() will BUG_ON() after it's call to
btrfs_lookup_extent_info() if no existing references are found.  Since refs
are computed directly from disk, this should be treated as a corruption
instead of a logic error.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:37 +01:00
Mark Fasheh be1a5564fd btrfs: Don't BUG_ON() errors in update_ref_for_cow()
The only caller of update_ref_for_cow() is __btrfs_cow_block() which was
originally ignoring any return values. update_ref_for_cow() however doesn't
look like a candidate to become a void function - there are a few places
where errors can occur.

So instead I changed update_ref_for_cow() to bubble all errors up (instead
of BUG_ON). __btrfs_cow_block() was then updated to catch and BUG_ON() any
errors from update_ref_for_cow(). The end effect is that we have no change
in behavior, but about 8 different places where a BUG_ON(ret) was removed.

Obviously a future patch will have to address the BUG_ON() in
__btrfs_cow_block().

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:36 +01:00
Jeff Mahoney 143bede527 btrfs: return void in functions without error conditions
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Arne Jansen 66d7e7f09f Btrfs: mark delayed refs as for cow
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.

Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.

Also pass in the fs_info for later use.

btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-12-22 16:22:27 +01:00
Liu Bo f1ebcc74d5 Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.

But there are two cases to lead to tree corruptions:

1) multi-thread snapshots can commit serveral snapshots in a transaction,
   and this may change the src root when processing the following pending
   snapshots, which lead to the former snapshots corruptions;

2) the free inode cache was changing the roots when it root the cache,
   which lead to corruptions.

This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-15 09:53:28 -05:00
Li Zefan a05a9bb18a Btrfs: fix array bound checking
Otherwise we can execced the array bound of path->slots[].

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:41 +02:00
Chris Mason 31533fb263 Btrfs: remove lockdep magic from btrfs_next_leaf
Before the reader/writer locks, btrfs_next_leaf needed to keep
the path blocking to avoid making lockdep upset.

Now that btrfs_next_leaf only takes read locks, this isn't required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:47 -04:00
Chris Mason bd681513fa Btrfs: switch the btrfs tree locks to reader/writer
The btrfs metadata btree is the source of significant
lock contention, especially in the root node.   This
commit changes our locking to use a reader/writer
lock.

The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking.  Atomics
count the number of blocking readers or writers at any
given time.

It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.

In read heavy workloads this is dramatically faster.  In write
heavy workloads we're still faster because of less contention
on the root node lock.

We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:46 -04:00
Chris Mason a65917156e Btrfs: stop using highmem for extent_buffers
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.

The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.

This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:45 -04:00
Josef Bacik 25b8b936ed Btrfs: don't map extent buffer if path->skip_locking is set
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search.  He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked.  So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set.  Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore.  Thanks,

Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-06-09 10:12:07 -04:00
Chris Mason ff5714cca9 Merge branch 'for-chris' of
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

Conflicts:
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-28 07:00:39 -04:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Tsutomu Itoh 1cd307990d Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Josef Bacik 026fd31782 Btrfs: don't always do readahead
Our readahead is sort of sloppy, and really isn't always needed.  For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat.  Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes.  This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:14 -04:00
Josef Bacik 7e2355ba1a Btrfs: don't look at the extent buffer level 3 times in a row
We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing.  I don't think I've
ever seen this tripped, so kill it.  This saves us 2 kmap's per level in our
search.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:11 -04:00
Josef Bacik cb25c2ea6a Btrfs: map the node block when looking for readahead targets
If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_.  So go ahead and map the
extent buffer while we look for readahead targets.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:10 -04:00
Chris Mason 945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Miao Xie 16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
David Sterba b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba 62a45b6092 btrfs: make functions static when possible
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba edc95aec57 btrfs: remove nested duplicate variable declarations
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
Tsutomu Itoh 97d9a8a420 Btrfs: check return value of read_tree_block()
This patch is checking return value of read_tree_block(),
and if it is NULL, error processing.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:37 -04:00
Tsutomu Itoh db5b493ac7 Btrfs: cleanup some BUG_ON()
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:35 -04:00
liubo 1abe9b8a13 Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
              dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
              dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
        btrfs_ordered_extent_add
        btrfs_ordered_extent_remove
        btrfs_ordered_extent_start
        btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
        btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
        __extent_writepage
        btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
        btrfs_inode_new
        btrfs_inode_request
        btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
        btrfs_sync_file
        btrfs_sync_fs

These show sync arguments.

6) transaction:
        btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
	btrfs_delayed_tree_ref
	btrfs_delayed_data_ref
	btrfs_delayed_ref_head
	btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
	btrfs_chunk_alloc
	btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
	btrfs_reserved_extent_alloc
	btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:33 -04:00
Chris Mason 240f62c875 Btrfs: use RCU instead of a spinlock to protect the root node
The pointer to the extent buffer for the root of each tree
is protected by a spinlock so that we can safely read the pointer
and take a reference on the extent buffer.

But now that the extent buffers are freed via RCU, we can safely
use rcu_read_lock instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:22 -04:00
Josef Bacik a826d6dcb3 Btrfs: check items for correctness as we search
Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for.  If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either.  This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:37 -04:00
Tsutomu Itoh 91ca338d77 btrfs: check NULL or not
Should check if functions returns NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Jesper Juhl ff175d57f0 btrfs: Don't pass NULL ptr to func that may deref it.
Hi,

In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:

...
 		if (!path) {
 			err = -ENOMEM;
 			goto out;
 		}
...
 	out:
 		btrfs_free_path(path);
 		return err;

btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.

There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Andi Kleen 559af82114 Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:37 -04:00
Chris Mason cb44921a09 Btrfs: don't loop forever on bad btree blocks
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.

This changes things to properly kick out the EIO.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:31:30 -04:00
Chris Mason 99d8f83c98 Btrfs: fix split_leaf double split corner case
split_leaf was not properly balancing leaves when it was forced to
split a leaf twice.  This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.

If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:14:50 -04:00
Yan, Zheng 5bdd3536cb Btrfs: Fix block generation verification race
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 21:35:33 -04:00
Yan, Zheng 3fd0a5585e Btrfs: Metadata ENOSPC handling for balance
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:

1. Avoid COW tree leave in the phrase of merging tree.

2. Handle interaction with snapshot creation.

3. make the backref cache can live across transactions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:54 -04:00
Yan, Zheng f0486c68e4 Btrfs: Introduce contexts for metadata reservation
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.

Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:

Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().

Make space accounting more accurate, mainly for handling read only
block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Linus Torvalds 795d580bae Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: add check for changed leaves in setup_leaf_for_split
  Btrfs: create snapshot references in same commit as snapshot
  Btrfs: fix small race with delalloc flushing waitqueue's
  Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
  Btrfs: fix chunk allocate size calculation
  Btrfs: kill max_extent mount option
  Btrfs: fail to mount if we have problems reading the block groups
  Btrfs: check btrfs_get_extent return for IS_ERR()
  Btrfs: handle kmalloc() failure in inode lookup ioctl
  Btrfs: dereferencing freed memory
  Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
  Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
  Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
  Btrfs: add NULL check for do_walk_down()
  Btrfs: remove duplicate include in ioctl.c

Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
2010-04-05 13:21:15 -07:00
Chris Mason 109f6aef5f Btrfs: add check for changed leaves in setup_leaf_for_split
setup_leaf_for_split needs to drop the path and search again, and has
checks to see if the item we want to split changed size.  But, it misses
the case where the leaf changed and now has enough room for the item
we want to insert.

This adds an extra check to make sure the leaf really needs splitting
before we call btrfs_split_leaf(), which keeps us from trying to split
a leaf with a single item.

btrfs_split_leaf() will blindly split the single item leaf, leaving us
with one good leaf and one empty leaf and then a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-04-05 14:42:01 -04:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Yan, Zheng 86b9f2eca5 Btrfs: Fix per root used space accounting
The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:35 -05:00
Yan, Zheng ad48fd7546 Btrfs: Add btrfs_duplicate_item
btrfs_duplicate_item duplicates item with new key, guaranteeing
the source item and the new items are in the same tree leaf and
contiguous. It allows us to split file extent in place, without
using lock_extent to prevent bookend extent race.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15 21:24:25 -05:00
Yan, Zheng a571952143 Btrfs: check size of inode backref before adding hardlink
For every hardlink in btrfs, there is a corresponding inode back
reference. All inode back references for hardlinks in a given
directory are stored in single b-tree item. The size of b-tree item
is limited by the size of b-tree leaf, so we can only create limited
number of hardlinks to a given file in a directory.

The original code lacks of the check, it oops if the number of
hardlinks goes over the limit. This patch fixes the issue by adding
check to btrfs_link and btrfs_rename.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24 09:17:31 -04:00
Yan Zheng d717aa1d31 Btrfs: Avoid delayed reference update looping
btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.

There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 12:42:46 -04:00
Yan Zheng 0a4eefbb74 Btrfs: Fix ordering of key field checks in btrfs_previous_item
Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Diego Calleja 20736abaa3 Btrfs: Remove code duplication in comp_keys
comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:46 -04:00
Yan Zheng 33c66f430b Btrfs: fix locking issue in btrfs_find_next_key
When walking up the tree, btrfs_find_next_key assumes the upper level tree
block is properly locked. This isn't always true even path->keep_locks is 1.
This is because btrfs_find_next_key may advance path->slots[] several times
instead of only once.

When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
we can't guarantee the original value of 'path->slots[level]' is
'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
'level + 1' isn't locked.

This patch fixes the issue by explicitly checking the locking state,
re-searching the tree if it's not locked.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng e457afec60 Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
if 1 is returned by btrfs_search_slot, the path already points to the
first item with 'key > searching key'. So increasing path->slots[0] by
one is superfluous in that case.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Chris Mason cfbb930846 Btrfs: balance btree more often
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done.  This commit
makes us more aggressively balance leaves and nodes as they become
less full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Chris Mason b361242102 Btrfs: stop avoiding balancing at the end of the transaction.
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.

With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore.  So, the balance avoidance is no longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Yan Zheng 5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Chris Mason 76a05b35a3 Btrfs: Don't loop forever on metadata IO failures
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:32 -04:00
Chris Mason 8c594ea81d Btrfs: use the right node in reada_for_balance
reada_for_balance was using the wrong index into the path node array,
so it wasn't reading the right blocks.  We never directly used the
results of the read done by this function because the btree search is
started over at the end.

This fixes reada_for_balance to reada in the correct node and to
avoid searching past the last slot in the node.  It also makes sure to
hold the parent lock while we are finding the nodes to read.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-20 15:53:09 -04:00
Stoyan Gaydarov c293498be6 Btrfs: BUG to BUG_ON changes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 17:05:11 -04:00
Chris Mason 8e73f27501 Btrfs: Optimize locking in btrfs_next_leaf()
btrfs_next_leaf was using blocking locks when it could have been using
faster spinning ones instead.  This adds a few extra checks around
the pieces that block and switches over to spinning locks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:14:18 -04:00
Chris Mason c8c42864f6 Btrfs: break up btrfs_search_slot into smaller pieces
btrfs_search_slot was doing too many things at once.  This breaks
it up into more reasonable units.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-03 10:14:18 -04:00
Chris Mason a4b6e07d1a Btrfs: limit balancing work while flushing delayed refs
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.

This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit.  It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.

The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs.  Over the long term we'll make a
special log for the last delayed ref updates as well.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason b9473439d3 Btrfs: leave btree locks spinning more often
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying.  This may require a kmalloc and it
was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer.  Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00
Chris Mason 44871b1b24 Btrfs: reduce stack usage in some crucial tree balancing functions
Many of the tree balancing functions follow the same pattern.

1) cow a block
2) do something to the result

This commit breaks them up into two functions so the variables and
code required for part two don't suck down stack during part one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason 56bec294de Btrfs: do extent allocation and reference count updates in the background
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem.  For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot.  btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations.  The tree is ordered by byte number of the extent and byte number
of the parent for the back reference.  The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction.  Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason 9fa8cfe706 Btrfs: don't preallocate metadata blocks during btrfs_search_slot
In order to avoid doing expensive extent management with tree locks held,
btrfs_search_slot will preallocate tree blocks for use by COW without
any tree locks held.

A later commit moves all of the extent allocation work for COW into
a delayed update mechanism, and this preallocation will no longer be
required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason b9447ef80b Btrfs: fix spinlock assertions on UP systems
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places.  But, it wasn't correct for UP compiled
kernels.

This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-09 11:45:38 -04:00
Chris Mason 4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Jeff Mahoney e00f730865 Btrfs: remove btrfs_init_path
btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:11:25 -05:00
Jeff Mahoney 7951f3cefb Btrfs: balance_level checks !child after access
The BUG_ON() is in the wrong spot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:06:15 -05:00