2007-06-12 21:07:21 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public
|
|
|
|
* License v2 as published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
* General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public
|
|
|
|
* License along with this program; if not, write to the
|
|
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
|
|
* Boston, MA 021110-1307, USA.
|
|
|
|
*/
|
2007-07-11 22:00:37 +08:00
|
|
|
#include <linux/sched.h>
|
2007-12-22 05:27:24 +08:00
|
|
|
#include <linux/pagemap.h>
|
2008-04-29 03:29:52 +08:00
|
|
|
#include <linux/writeback.h>
|
2008-08-12 21:13:26 +08:00
|
|
|
#include <linux/blkdev.h>
|
2009-02-04 22:23:45 +08:00
|
|
|
#include <linux/sort.h>
|
2009-03-11 00:39:20 +08:00
|
|
|
#include <linux/rcupdate.h>
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
#include <linux/kthread.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2011-06-14 18:52:17 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2008-11-20 23:22:27 +08:00
|
|
|
#include "compat.h"
|
2007-12-11 22:25:06 +08:00
|
|
|
#include "hash.h"
|
2007-02-26 23:40:21 +08:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "disk-io.h"
|
|
|
|
#include "print-tree.h"
|
2007-03-17 04:20:31 +08:00
|
|
|
#include "transaction.h"
|
2008-03-25 03:01:56 +08:00
|
|
|
#include "volumes.h"
|
2008-06-26 04:01:30 +08:00
|
|
|
#include "locking.h"
|
2009-04-03 21:47:43 +08:00
|
|
|
#include "free-space-cache.h"
|
2007-02-26 23:40:21 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
/* control flags for do_chunk_alloc's force field
|
|
|
|
* CHUNK_ALLOC_NO_FORCE means to only allocate a chunk
|
|
|
|
* if we really need one.
|
|
|
|
*
|
|
|
|
* CHUNK_ALLOC_FORCE means it must try to allocate one
|
|
|
|
*
|
|
|
|
* CHUNK_ALLOC_LIMITED means to only try and allocate one
|
|
|
|
* if we have very few chunks already allocated. This is
|
|
|
|
* used as part of the clustering code to help make sure
|
|
|
|
* we have a good pool of storage to cluster in, without
|
|
|
|
* filling the FS with empty chunks
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
CHUNK_ALLOC_NO_FORCE = 0,
|
|
|
|
CHUNK_ALLOC_FORCE = 1,
|
|
|
|
CHUNK_ALLOC_LIMITED = 2,
|
|
|
|
};
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
/*
|
|
|
|
* Control how reservations are dealt with.
|
|
|
|
*
|
|
|
|
* RESERVE_FREE - freeing a reservation.
|
|
|
|
* RESERVE_ALLOC - allocating space and we need to update bytes_may_use for
|
|
|
|
* ENOSPC accounting
|
|
|
|
* RESERVE_ALLOC_NO_ACCOUNT - allocating space and we should not update
|
|
|
|
* bytes_may_use as the ENOSPC accounting is done elsewhere
|
|
|
|
*/
|
|
|
|
enum {
|
|
|
|
RESERVE_FREE = 0,
|
|
|
|
RESERVE_ALLOC = 1,
|
|
|
|
RESERVE_ALLOC_NO_ACCOUNT = 2,
|
|
|
|
};
|
|
|
|
|
2008-11-13 03:19:50 +08:00
|
|
|
static int update_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 bytenr, u64 num_bytes, int alloc);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner_objectid,
|
|
|
|
u64 owner_offset, int refs_to_drop,
|
|
|
|
struct btrfs_delayed_extent_op *extra_op);
|
|
|
|
static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_item *ei);
|
|
|
|
static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins, int ref_mod);
|
|
|
|
static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, struct btrfs_disk_key *key,
|
|
|
|
int level, struct btrfs_key *ins);
|
2009-02-21 00:00:09 +08:00
|
|
|
static int do_chunk_alloc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *extent_root, u64 alloc_bytes,
|
|
|
|
u64 flags, int force);
|
2009-09-12 04:11:19 +08:00
|
|
|
static int find_next_key(struct btrfs_path *path, int level,
|
|
|
|
struct btrfs_key *key);
|
2009-09-12 04:12:44 +08:00
|
|
|
static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
|
|
|
|
int dump_block_groups);
|
2011-07-27 05:00:46 +08:00
|
|
|
static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
|
|
|
|
u64 num_bytes, int reserve);
|
2009-02-21 00:00:09 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
static noinline int
|
|
|
|
block_group_cache_done(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
smp_mb();
|
|
|
|
return cache->cached == BTRFS_CACHE_FINISHED;
|
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
static int block_group_bits(struct btrfs_block_group_cache *cache, u64 bits)
|
|
|
|
{
|
|
|
|
return (cache->flags & bits) == bits;
|
|
|
|
}
|
|
|
|
|
2011-04-20 21:52:26 +08:00
|
|
|
static void btrfs_get_block_group(struct btrfs_block_group_cache *cache)
|
2009-11-14 04:12:59 +08:00
|
|
|
{
|
|
|
|
atomic_inc(&cache->count);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
if (atomic_dec_and_test(&cache->count)) {
|
|
|
|
WARN_ON(cache->pinned > 0);
|
|
|
|
WARN_ON(cache->reserved > 0);
|
2011-03-29 13:46:06 +08:00
|
|
|
kfree(cache->free_space_ctl);
|
2009-11-14 04:12:59 +08:00
|
|
|
kfree(cache);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-11-14 04:12:59 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
|
|
|
* this adds the block group to the fs_info rb tree for the block group
|
|
|
|
* cache
|
|
|
|
*/
|
2008-12-02 22:54:17 +08:00
|
|
|
static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *block_group)
|
|
|
|
{
|
|
|
|
struct rb_node **p;
|
|
|
|
struct rb_node *parent = NULL;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
|
|
|
|
spin_lock(&info->block_group_cache_lock);
|
|
|
|
p = &info->block_group_cache_tree.rb_node;
|
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
|
|
|
cache = rb_entry(parent, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
|
|
|
if (block_group->key.objectid < cache->key.objectid) {
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
} else if (block_group->key.objectid > cache->key.objectid) {
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
} else {
|
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
rb_link_node(&block_group->cache_node, parent, p);
|
|
|
|
rb_insert_color(&block_group->cache_node,
|
|
|
|
&info->block_group_cache_tree);
|
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This will return the block group at or after bytenr if contains is 0, else
|
|
|
|
* it will return the block group that contains the bytenr
|
|
|
|
*/
|
|
|
|
static struct btrfs_block_group_cache *
|
|
|
|
block_group_cache_tree_search(struct btrfs_fs_info *info, u64 bytenr,
|
|
|
|
int contains)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache, *ret = NULL;
|
|
|
|
struct rb_node *n;
|
|
|
|
u64 end, start;
|
|
|
|
|
|
|
|
spin_lock(&info->block_group_cache_lock);
|
|
|
|
n = info->block_group_cache_tree.rb_node;
|
|
|
|
|
|
|
|
while (n) {
|
|
|
|
cache = rb_entry(n, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
|
|
|
end = cache->key.objectid + cache->key.offset - 1;
|
|
|
|
start = cache->key.objectid;
|
|
|
|
|
|
|
|
if (bytenr < start) {
|
|
|
|
if (!contains && (!ret || start < ret->key.objectid))
|
|
|
|
ret = cache;
|
|
|
|
n = n->rb_left;
|
|
|
|
} else if (bytenr > start) {
|
|
|
|
if (contains && bytenr <= end) {
|
|
|
|
ret = cache;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
n = n->rb_right;
|
|
|
|
} else {
|
|
|
|
ret = cache;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2008-12-12 05:30:39 +08:00
|
|
|
if (ret)
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(ret);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static int add_excluded_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 num_bytes)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
u64 end = start + num_bytes - 1;
|
|
|
|
set_extent_bits(&root->fs_info->freed_extents[0],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
|
|
|
set_extent_bits(&root->fs_info->freed_extents[1],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
|
|
|
return 0;
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static void free_excluded_extents(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
u64 start, end;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
start = cache->key.objectid;
|
|
|
|
end = start + cache->key.offset - 1;
|
|
|
|
|
|
|
|
clear_extent_bits(&root->fs_info->freed_extents[0],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
|
|
|
clear_extent_bits(&root->fs_info->freed_extents[1],
|
|
|
|
start, end, EXTENT_UPTODATE, GFP_NOFS);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static int exclude_super_stripes(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
{
|
|
|
|
u64 bytenr;
|
|
|
|
u64 *logical;
|
|
|
|
int stripe_len;
|
|
|
|
int i, nr, ret;
|
|
|
|
|
2009-11-26 17:31:11 +08:00
|
|
|
if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
|
|
|
|
stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
|
|
|
|
cache->bytes_super += stripe_len;
|
|
|
|
ret = add_excluded_extent(root, cache->key.objectid,
|
|
|
|
stripe_len);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
|
|
|
|
bytenr = btrfs_sb_offset(i);
|
|
|
|
ret = btrfs_rmap_block(&root->fs_info->mapping_tree,
|
|
|
|
cache->key.objectid, bytenr,
|
|
|
|
0, &logical, &nr, &stripe_len);
|
|
|
|
BUG_ON(ret);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
while (nr--) {
|
2009-09-12 04:11:20 +08:00
|
|
|
cache->bytes_super += stripe_len;
|
2009-09-12 04:11:19 +08:00
|
|
|
ret = add_excluded_extent(root, logical[nr],
|
|
|
|
stripe_len);
|
|
|
|
BUG_ON(ret);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
kfree(logical);
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static struct btrfs_caching_control *
|
|
|
|
get_caching_control(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
struct btrfs_caching_control *ctl;
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (cache->cached != BTRFS_CACHE_STARTED) {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2010-09-17 04:17:03 +08:00
|
|
|
/* We're loading it the fast way, so we don't have a caching_ctl. */
|
|
|
|
if (!cache->caching_ctl) {
|
|
|
|
spin_unlock(&cache->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
ctl = cache->caching_ctl;
|
|
|
|
atomic_inc(&ctl->count);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
return ctl;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void put_caching_control(struct btrfs_caching_control *ctl)
|
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&ctl->count))
|
|
|
|
kfree(ctl);
|
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
|
|
|
* this is only called by cache_block_group, since we could have freed extents
|
|
|
|
* we need to check the pinned_extents for any extents that can't be used yet
|
|
|
|
* since their free space will be released as soon as the transaction commits.
|
|
|
|
*/
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_fs_info *info, u64 start, u64 end)
|
|
|
|
{
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
u64 extent_start, extent_end, size, total_added = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while (start < end) {
|
2009-09-12 04:11:19 +08:00
|
|
|
ret = find_first_extent_bit(info->pinned_extents, start,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
&extent_start, &extent_end,
|
2009-09-12 04:11:19 +08:00
|
|
|
EXTENT_DIRTY | EXTENT_UPTODATE);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
2009-11-26 17:31:11 +08:00
|
|
|
if (extent_start <= start) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
start = extent_end + 1;
|
|
|
|
} else if (extent_start > start && extent_start < end) {
|
|
|
|
size = extent_start - start;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_added += size;
|
2008-11-21 01:16:16 +08:00
|
|
|
ret = btrfs_add_free_space(block_group, start,
|
|
|
|
size);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
start = extent_end + 1;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (start < end) {
|
|
|
|
size = end - start;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_added += size;
|
2008-11-21 01:16:16 +08:00
|
|
|
ret = btrfs_add_free_space(block_group, start, size);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
return total_added;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
|
|
|
|
2011-07-01 02:42:28 +08:00
|
|
|
static noinline void caching_thread(struct btrfs_work *work)
|
2007-05-10 08:13:14 +08:00
|
|
|
{
|
2011-07-01 02:42:28 +08:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
struct btrfs_root *extent_root;
|
2007-05-10 08:13:14 +08:00
|
|
|
struct btrfs_path *path;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
u64 total_found = 0;
|
2009-09-12 04:11:19 +08:00
|
|
|
u64 last = 0;
|
|
|
|
u32 nritems;
|
|
|
|
int ret = 0;
|
2007-10-16 04:14:48 +08:00
|
|
|
|
2011-07-01 02:42:28 +08:00
|
|
|
caching_ctl = container_of(work, struct btrfs_caching_control, work);
|
|
|
|
block_group = caching_ctl->block_group;
|
|
|
|
fs_info = block_group->fs_info;
|
|
|
|
extent_root = fs_info->extent_root;
|
|
|
|
|
2007-05-10 08:13:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
2011-07-01 02:42:28 +08:00
|
|
|
goto out;
|
2007-09-15 04:15:28 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
last = max_t(u64, block_group->key.objectid, BTRFS_SUPER_INFO_OFFSET);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
2008-06-26 04:01:30 +08:00
|
|
|
/*
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
* We don't want to deadlock with somebody trying to allocate a new
|
|
|
|
* extent for the extent root while also trying to search the extent
|
|
|
|
* root to add free space. So we skip locking and search the commit
|
|
|
|
* root, since its read-only
|
2008-06-26 04:01:30 +08:00
|
|
|
*/
|
|
|
|
path->skip_locking = 1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
path->search_commit_root = 1;
|
2011-05-13 22:32:11 +08:00
|
|
|
path->reada = 1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2008-12-12 23:03:26 +08:00
|
|
|
key.objectid = last;
|
2007-05-10 08:13:14 +08:00
|
|
|
key.offset = 0;
|
2009-09-12 04:11:19 +08:00
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
2009-08-01 02:57:55 +08:00
|
|
|
again:
|
2009-09-12 04:11:19 +08:00
|
|
|
mutex_lock(&caching_ctl->mutex);
|
2009-08-01 02:57:55 +08:00
|
|
|
/* need to make sure the commit_root doesn't disappear */
|
|
|
|
down_read(&fs_info->extent_commit_sem);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
|
2007-05-10 08:13:14 +08:00
|
|
|
if (ret < 0)
|
2008-09-24 01:14:11 +08:00
|
|
|
goto err;
|
2008-12-09 05:46:26 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2011-06-01 00:07:27 +08:00
|
|
|
if (btrfs_fs_closing(fs_info) > 1) {
|
2009-07-28 20:41:57 +08:00
|
|
|
last = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
break;
|
2009-07-28 20:41:57 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (path->slots[0] < nritems) {
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
} else {
|
|
|
|
ret = find_next_key(path, 0, &key);
|
|
|
|
if (ret)
|
2007-05-10 08:13:14 +08:00
|
|
|
break;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2011-05-12 05:30:53 +08:00
|
|
|
if (need_resched() ||
|
|
|
|
btrfs_next_leaf(extent_root, path)) {
|
|
|
|
caching_ctl->progress = last;
|
2011-05-28 19:00:39 +08:00
|
|
|
btrfs_release_path(path);
|
2011-05-12 05:30:53 +08:00
|
|
|
up_read(&fs_info->extent_commit_sem);
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
2009-09-12 04:11:19 +08:00
|
|
|
cond_resched();
|
2011-05-12 05:30:53 +08:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
continue;
|
2009-09-12 04:11:19 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (key.objectid < block_group->key.objectid) {
|
|
|
|
path->slots[0]++;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
continue;
|
2007-05-10 08:13:14 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2007-05-10 08:13:14 +08:00
|
|
|
if (key.objectid >= block_group->key.objectid +
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
block_group->key.offset)
|
2007-05-10 08:13:14 +08:00
|
|
|
break;
|
2007-09-15 04:15:28 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (key.type == BTRFS_EXTENT_ITEM_KEY) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_found += add_new_free_space(block_group,
|
|
|
|
fs_info, last,
|
|
|
|
key.objectid);
|
2007-09-15 04:15:28 +08:00
|
|
|
last = key.objectid + key.offset;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (total_found > (1024 * 1024 * 2)) {
|
|
|
|
total_found = 0;
|
|
|
|
wake_up(&caching_ctl->wait);
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2007-05-10 08:13:14 +08:00
|
|
|
path->slots[0]++;
|
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
ret = 0;
|
2007-05-10 08:13:14 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
total_found += add_new_free_space(block_group, fs_info, last,
|
|
|
|
block_group->key.objectid +
|
|
|
|
block_group->key.offset);
|
2009-09-12 04:11:19 +08:00
|
|
|
caching_ctl->progress = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
block_group->caching_ctl = NULL;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
block_group->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
spin_unlock(&block_group->lock);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2007-06-23 02:16:25 +08:00
|
|
|
err:
|
2007-05-10 08:13:14 +08:00
|
|
|
btrfs_free_path(path);
|
2009-07-30 21:40:40 +08:00
|
|
|
up_read(&fs_info->extent_commit_sem);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
free_excluded_extents(extent_root, block_group);
|
|
|
|
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
2011-07-01 02:42:28 +08:00
|
|
|
out:
|
2009-09-12 04:11:19 +08:00
|
|
|
wake_up(&caching_ctl->wait);
|
|
|
|
|
|
|
|
put_caching_control(caching_ctl);
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2010-08-26 04:54:15 +08:00
|
|
|
static int cache_block_group(struct btrfs_block_group_cache *cache,
|
|
|
|
struct btrfs_trans_handle *trans,
|
2010-12-08 22:15:11 +08:00
|
|
|
struct btrfs_root *root,
|
2010-08-26 04:54:15 +08:00
|
|
|
int load_cache_only)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
{
|
2011-11-15 02:52:14 +08:00
|
|
|
DEFINE_WAIT(wait);
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info = cache->fs_info;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2011-11-15 02:52:14 +08:00
|
|
|
caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
|
|
|
|
BUG_ON(!caching_ctl);
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&caching_ctl->list);
|
|
|
|
mutex_init(&caching_ctl->mutex);
|
|
|
|
init_waitqueue_head(&caching_ctl->wait);
|
|
|
|
caching_ctl->block_group = cache;
|
|
|
|
caching_ctl->progress = cache->key.objectid;
|
|
|
|
atomic_set(&caching_ctl->count, 1);
|
|
|
|
caching_ctl->work.func = caching_thread;
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
/*
|
|
|
|
* This should be a rare occasion, but this could happen I think in the
|
|
|
|
* case where one thread starts to load the space cache info, and then
|
|
|
|
* some other thread starts a transaction commit which tries to do an
|
|
|
|
* allocation while the other thread is still loading the space cache
|
|
|
|
* info. The previous loop should have kept us from choosing this block
|
|
|
|
* group, but if we've moved to the state where we will wait on caching
|
|
|
|
* block groups we need to first check if we're doing a fast load here,
|
|
|
|
* so we can wait for it to finish, otherwise we could end up allocating
|
|
|
|
* from a block group who's cache gets evicted for one reason or
|
|
|
|
* another.
|
|
|
|
*/
|
|
|
|
while (cache->cached == BTRFS_CACHE_FAST) {
|
|
|
|
struct btrfs_caching_control *ctl;
|
|
|
|
|
|
|
|
ctl = cache->caching_ctl;
|
|
|
|
atomic_inc(&ctl->count);
|
|
|
|
prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
|
|
|
|
schedule();
|
|
|
|
|
|
|
|
finish_wait(&ctl->wait, &wait);
|
|
|
|
put_caching_control(ctl);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cache->cached != BTRFS_CACHE_NO) {
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
kfree(caching_ctl);
|
2009-09-12 04:11:19 +08:00
|
|
|
return 0;
|
2011-11-15 02:52:14 +08:00
|
|
|
}
|
|
|
|
WARN_ON(cache->caching_ctl);
|
|
|
|
cache->caching_ctl = caching_ctl;
|
|
|
|
cache->cached = BTRFS_CACHE_FAST;
|
|
|
|
spin_unlock(&cache->lock);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
2010-08-26 04:54:15 +08:00
|
|
|
/*
|
|
|
|
* We can't do the read from on-disk cache during a commit since we need
|
2010-12-08 22:15:11 +08:00
|
|
|
* to have the normal tree locking. Also if we are currently trying to
|
|
|
|
* allocate blocks for the tree root we can't do the fast caching since
|
|
|
|
* we likely hold important locks.
|
2010-08-26 04:54:15 +08:00
|
|
|
*/
|
2011-03-24 18:24:28 +08:00
|
|
|
if (trans && (!trans->transaction->in_commit) &&
|
2011-10-04 02:07:49 +08:00
|
|
|
(root && root != root->fs_info->tree_root) &&
|
|
|
|
btrfs_test_opt(root, SPACE_CACHE)) {
|
2010-08-26 04:54:15 +08:00
|
|
|
ret = load_free_space_cache(fs_info, cache);
|
|
|
|
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (ret == 1) {
|
2011-11-15 02:52:14 +08:00
|
|
|
cache->caching_ctl = NULL;
|
2010-08-26 04:54:15 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
|
|
|
} else {
|
2011-11-15 02:52:14 +08:00
|
|
|
if (load_cache_only) {
|
|
|
|
cache->caching_ctl = NULL;
|
|
|
|
cache->cached = BTRFS_CACHE_NO;
|
|
|
|
} else {
|
|
|
|
cache->cached = BTRFS_CACHE_STARTED;
|
|
|
|
}
|
2010-08-26 04:54:15 +08:00
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
2011-11-15 02:52:14 +08:00
|
|
|
wake_up(&caching_ctl->wait);
|
2011-02-02 23:53:47 +08:00
|
|
|
if (ret == 1) {
|
2011-11-15 02:52:14 +08:00
|
|
|
put_caching_control(caching_ctl);
|
2011-02-02 23:53:47 +08:00
|
|
|
free_excluded_extents(fs_info->extent_root, cache);
|
2010-08-26 04:54:15 +08:00
|
|
|
return 0;
|
2011-02-02 23:53:47 +08:00
|
|
|
}
|
2011-11-15 02:52:14 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We are not going to do the fast caching, set cached to the
|
|
|
|
* appropriate value and wakeup any waiters.
|
|
|
|
*/
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (load_cache_only) {
|
|
|
|
cache->caching_ctl = NULL;
|
|
|
|
cache->cached = BTRFS_CACHE_NO;
|
|
|
|
} else {
|
|
|
|
cache->cached = BTRFS_CACHE_STARTED;
|
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
wake_up(&caching_ctl->wait);
|
2010-08-26 04:54:15 +08:00
|
|
|
}
|
|
|
|
|
2011-11-15 02:52:14 +08:00
|
|
|
if (load_cache_only) {
|
|
|
|
put_caching_control(caching_ctl);
|
2009-09-12 04:11:19 +08:00
|
|
|
return 0;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
down_write(&fs_info->extent_commit_sem);
|
2011-11-15 02:52:14 +08:00
|
|
|
atomic_inc(&caching_ctl->count);
|
2009-09-12 04:11:19 +08:00
|
|
|
list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
|
|
|
|
up_write(&fs_info->extent_commit_sem);
|
|
|
|
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(cache);
|
2009-09-12 04:11:19 +08:00
|
|
|
|
2011-07-01 02:42:28 +08:00
|
|
|
btrfs_queue_worker(&fs_info->caching_workers, &caching_ctl->work);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2008-09-24 01:14:11 +08:00
|
|
|
return ret;
|
2007-05-10 08:13:14 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
|
|
|
* return the block group that starts at or after bytenr
|
|
|
|
*/
|
2009-01-06 10:25:51 +08:00
|
|
|
static struct btrfs_block_group_cache *
|
|
|
|
btrfs_lookup_first_block_group(struct btrfs_fs_info *info, u64 bytenr)
|
2008-05-25 02:04:53 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2008-05-25 02:04:53 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache = block_group_cache_tree_search(info, bytenr, 0);
|
2008-05-25 02:04:53 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return cache;
|
2008-05-25 02:04:53 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
/*
|
2009-05-15 01:52:22 +08:00
|
|
|
* return the block group that contains the given bytenr
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
*/
|
2009-01-06 10:25:51 +08:00
|
|
|
struct btrfs_block_group_cache *btrfs_lookup_block_group(
|
|
|
|
struct btrfs_fs_info *info,
|
|
|
|
u64 bytenr)
|
2007-05-06 22:15:01 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-05-06 22:15:01 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache = block_group_cache_tree_search(info, bytenr, 1);
|
2007-10-16 04:15:19 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return cache;
|
2007-05-06 22:15:01 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info,
|
|
|
|
u64 flags)
|
2008-03-25 03:01:59 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
2009-03-11 00:39:20 +08:00
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
flags &= BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM |
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA;
|
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(found, head, list) {
|
2010-09-17 04:19:09 +08:00
|
|
|
if (found->flags & flags) {
|
2009-03-11 00:39:20 +08:00
|
|
|
rcu_read_unlock();
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return found;
|
2009-03-11 00:39:20 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2009-03-11 00:39:20 +08:00
|
|
|
rcu_read_unlock();
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return NULL;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
/*
|
|
|
|
* after adding space to the filesystem, we need to clear the full flags
|
|
|
|
* on all the space infos.
|
|
|
|
*/
|
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(found, head, list)
|
|
|
|
found->full = 0;
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
static u64 div_factor(u64 num, int factor)
|
|
|
|
{
|
|
|
|
if (factor == 10)
|
|
|
|
return num;
|
|
|
|
num *= factor;
|
|
|
|
do_div(num, 10);
|
|
|
|
return num;
|
|
|
|
}
|
|
|
|
|
2010-10-27 01:37:56 +08:00
|
|
|
static u64 div_factor_fine(u64 num, int factor)
|
|
|
|
{
|
|
|
|
if (factor == 100)
|
|
|
|
return num;
|
|
|
|
num *= factor;
|
|
|
|
do_div(num, 100);
|
|
|
|
return num;
|
|
|
|
}
|
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
u64 btrfs_find_block_group(struct btrfs_root *root,
|
|
|
|
u64 search_start, u64 search_hint, int owner)
|
2007-04-27 22:08:34 +08:00
|
|
|
{
|
2007-10-16 04:15:19 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-04-27 22:08:34 +08:00
|
|
|
u64 used;
|
2008-12-12 05:30:39 +08:00
|
|
|
u64 last = max(search_hint, search_start);
|
|
|
|
u64 group_start = 0;
|
2007-05-01 03:25:45 +08:00
|
|
|
int full_search = 0;
|
2008-12-12 05:30:39 +08:00
|
|
|
int factor = 9;
|
2008-05-25 02:04:53 +08:00
|
|
|
int wrapped = 0;
|
2007-05-01 03:25:45 +08:00
|
|
|
again:
|
2008-09-26 22:05:48 +08:00
|
|
|
while (1) {
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
if (!cache)
|
|
|
|
break;
|
2007-10-16 04:15:19 +08:00
|
|
|
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock(&cache->lock);
|
2007-10-16 04:15:19 +08:00
|
|
|
last = cache->key.objectid + cache->key.offset;
|
|
|
|
used = btrfs_block_group_used(&cache->item);
|
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
if ((full_search || !cache->ro) &&
|
|
|
|
block_group_bits(cache, BTRFS_BLOCK_GROUP_METADATA)) {
|
2008-09-26 22:05:48 +08:00
|
|
|
if (used + cache->pinned + cache->reserved <
|
2008-12-12 05:30:39 +08:00
|
|
|
div_factor(cache->key.offset, factor)) {
|
|
|
|
group_start = cache->key.objectid;
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2008-04-04 04:29:03 +08:00
|
|
|
goto found;
|
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2007-05-19 01:28:27 +08:00
|
|
|
cond_resched();
|
2007-04-27 22:08:34 +08:00
|
|
|
}
|
2008-05-25 02:04:53 +08:00
|
|
|
if (!wrapped) {
|
|
|
|
last = search_start;
|
|
|
|
wrapped = 1;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (!full_search && factor < 10) {
|
2007-05-06 22:15:01 +08:00
|
|
|
last = search_start;
|
2007-05-01 03:25:45 +08:00
|
|
|
full_search = 1;
|
2008-05-25 02:04:53 +08:00
|
|
|
factor = 10;
|
2007-05-01 03:25:45 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2007-05-06 22:15:01 +08:00
|
|
|
found:
|
2008-12-12 05:30:39 +08:00
|
|
|
return group_start;
|
2008-06-26 04:01:30 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-09-06 04:13:11 +08:00
|
|
|
/* simple helper to search for an existing extent at a given offset */
|
2008-09-24 01:14:14 +08:00
|
|
|
int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len)
|
2008-09-06 04:13:11 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_key key;
|
2008-09-24 01:14:14 +08:00
|
|
|
struct btrfs_path *path;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2008-09-06 04:13:11 +08:00
|
|
|
key.objectid = start;
|
|
|
|
key.offset = len;
|
|
|
|
btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
|
|
|
|
ret = btrfs_search_slot(NULL, root->fs_info->extent_root, &key, path,
|
|
|
|
0, 0);
|
2008-09-24 01:14:14 +08:00
|
|
|
btrfs_free_path(path);
|
2007-12-11 22:25:06 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
/*
|
|
|
|
* helper function to lookup reference count and flags of extent.
|
|
|
|
*
|
|
|
|
* the head node for delayed ref is used to store the sum of all the
|
|
|
|
* reference count modifications queued up in the rbtree. the head
|
|
|
|
* node may also store the extent flags to set. This way you can check
|
|
|
|
* to see what the reference count and extent flags would be if all of
|
|
|
|
* the delayed refs are not processed.
|
|
|
|
*/
|
|
|
|
int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr,
|
|
|
|
u64 num_bytes, u64 *refs, u64 *flags)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_key key;
|
|
|
|
u32 item_size;
|
|
|
|
u64 num_refs;
|
|
|
|
u64 extent_flags;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = num_bytes;
|
|
|
|
if (!trans) {
|
|
|
|
path->skip_locking = 1;
|
|
|
|
path->search_commit_root = 1;
|
|
|
|
}
|
|
|
|
again:
|
|
|
|
ret = btrfs_search_slot(trans, root->fs_info->extent_root,
|
|
|
|
&key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
if (ret == 0) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
if (item_size >= sizeof(*ei)) {
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item);
|
|
|
|
num_refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
extent_flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
} else {
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
struct btrfs_extent_item_v0 *ei0;
|
|
|
|
BUG_ON(item_size != sizeof(*ei0));
|
|
|
|
ei0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item_v0);
|
|
|
|
num_refs = btrfs_extent_refs_v0(leaf, ei0);
|
|
|
|
/* FIXME: this isn't correct for data */
|
|
|
|
extent_flags = BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
|
|
|
#else
|
|
|
|
BUG();
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
BUG_ON(num_refs == 0);
|
|
|
|
} else {
|
|
|
|
num_refs = 0;
|
|
|
|
extent_flags = 0;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!trans)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
head = btrfs_find_delayed_ref_head(trans, bytenr);
|
|
|
|
if (head) {
|
|
|
|
if (!mutex_trylock(&head->mutex)) {
|
|
|
|
atomic_inc(&head->node.refs);
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-05-16 22:48:46 +08:00
|
|
|
|
2011-05-02 21:29:25 +08:00
|
|
|
/*
|
|
|
|
* Mutex was contended, block until it's released and try
|
|
|
|
* again
|
|
|
|
*/
|
2010-05-16 22:48:46 +08:00
|
|
|
mutex_lock(&head->mutex);
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
btrfs_put_delayed_ref(&head->node);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (head->extent_op && head->extent_op->update_flags)
|
|
|
|
extent_flags |= head->extent_op->flags_to_set;
|
|
|
|
else
|
|
|
|
BUG_ON(num_refs == 0);
|
|
|
|
|
|
|
|
num_refs += head->node.ref_mod;
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
}
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
out:
|
|
|
|
WARN_ON(num_refs == 0);
|
|
|
|
if (refs)
|
|
|
|
*refs = num_refs;
|
|
|
|
if (flags)
|
|
|
|
*flags = extent_flags;
|
|
|
|
out_free:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-12-12 01:42:00 +08:00
|
|
|
/*
|
|
|
|
* Back reference rules. Back refs have three main goals:
|
|
|
|
*
|
|
|
|
* 1) differentiate between all holders of references to an extent so that
|
|
|
|
* when a reference is dropped we can make sure it was a valid reference
|
|
|
|
* before freeing the extent.
|
|
|
|
*
|
|
|
|
* 2) Provide enough information to quickly find the holders of an extent
|
|
|
|
* if we notice a given block is corrupted or bad.
|
|
|
|
*
|
|
|
|
* 3) Make it easy to migrate blocks for FS shrinking or storage pool
|
|
|
|
* maintenance. This is actually the same as #2, but with a slightly
|
|
|
|
* different use case.
|
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* There are two kinds of back refs. The implicit back refs is optimized
|
|
|
|
* for pointers in non-shared tree blocks. For a given pointer in a block,
|
|
|
|
* back refs of this kind provide information about the block's owner tree
|
|
|
|
* and the pointer's key. These information allow us to find the block by
|
|
|
|
* b-tree searching. The full back refs is for pointers in tree blocks not
|
|
|
|
* referenced by their owner trees. The location of tree block is recorded
|
|
|
|
* in the back refs. Actually the full back refs is generic, and can be
|
|
|
|
* used in all cases the implicit back refs is used. The major shortcoming
|
|
|
|
* of the full back refs is its overhead. Every time a tree block gets
|
|
|
|
* COWed, we have to update back refs entry for all pointers in it.
|
|
|
|
*
|
|
|
|
* For a newly allocated tree block, we use implicit back refs for
|
|
|
|
* pointers in it. This means most tree related operations only involve
|
|
|
|
* implicit back refs. For a tree block created in old transaction, the
|
|
|
|
* only way to drop a reference to it is COW it. So we can detect the
|
|
|
|
* event that tree block loses its owner tree's reference and do the
|
|
|
|
* back refs conversion.
|
|
|
|
*
|
|
|
|
* When a tree block is COW'd through a tree, there are four cases:
|
|
|
|
*
|
|
|
|
* The reference count of the block is one and the tree is the block's
|
|
|
|
* owner tree. Nothing to do in this case.
|
|
|
|
*
|
|
|
|
* The reference count of the block is one and the tree is not the
|
|
|
|
* block's owner tree. In this case, full back refs is used for pointers
|
|
|
|
* in the block. Remove these full back refs, add implicit back refs for
|
|
|
|
* every pointers in the new block.
|
|
|
|
*
|
|
|
|
* The reference count of the block is greater than one and the tree is
|
|
|
|
* the block's owner tree. In this case, implicit back refs is used for
|
|
|
|
* pointers in the block. Add full back refs for every pointers in the
|
|
|
|
* block, increase lower level extents' reference counts. The original
|
|
|
|
* implicit back refs are entailed to the new block.
|
|
|
|
*
|
|
|
|
* The reference count of the block is greater than one and the tree is
|
|
|
|
* not the block's owner tree. Add implicit back refs for every pointer in
|
|
|
|
* the new block, increase lower level extents' reference count.
|
|
|
|
*
|
|
|
|
* Back Reference Key composing:
|
|
|
|
*
|
|
|
|
* The key objectid corresponds to the first byte in the extent,
|
|
|
|
* The key type is used to differentiate between types of back refs.
|
|
|
|
* There are different meanings of the key offset for different types
|
|
|
|
* of back refs.
|
|
|
|
*
|
2007-12-12 01:42:00 +08:00
|
|
|
* File extents can be referenced by:
|
|
|
|
*
|
|
|
|
* - multiple snapshots, subvolumes, or different generations in one subvol
|
2008-09-24 01:14:14 +08:00
|
|
|
* - different files inside a single subvolume
|
2007-12-12 01:42:00 +08:00
|
|
|
* - different offsets inside a file (bookend extents in file.c)
|
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The extent ref structure for the implicit back refs has fields for:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
|
|
|
* - Objectid of the subvolume root
|
|
|
|
* - objectid of the file holding the reference
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* - original offset in the file
|
|
|
|
* - how many bookend extents
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The key offset for the implicit back refs is hash of the first
|
|
|
|
* three fields.
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The extent ref structure for the full back refs has field for:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* - number of pointers in the tree leaf
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* The key offset for the implicit back refs is the first byte of
|
|
|
|
* the tree leaf
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* When a file extent is allocated, The implicit back refs is used.
|
|
|
|
* the fields are filled in:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* (root_key.objectid, inode objectid, offset in file, 1)
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* When a file extent is removed file truncation, we find the
|
|
|
|
* corresponding implicit back refs and check the following fields:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* (btrfs_header_owner(leaf), inode objectid, offset in file)
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* Btree extents can be referenced by:
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* - Different subvolumes
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* Both the implicit back refs and the full back refs for tree blocks
|
|
|
|
* only consist of key. The key offset for the implicit back refs is
|
|
|
|
* objectid of block's owner tree. The key offset for the full back refs
|
|
|
|
* is the first byte of parent block.
|
2007-12-12 01:42:00 +08:00
|
|
|
*
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
* When implicit back refs is used, information about the lowest key and
|
|
|
|
* level of the tree block are required. These information are stored in
|
|
|
|
* tree block info structure.
|
2007-12-12 01:42:00 +08:00
|
|
|
*/
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
static int convert_extent_item_v0(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 owner, u32 extra_size)
|
2007-12-11 22:25:06 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_item *item;
|
|
|
|
struct btrfs_extent_item_v0 *ei0;
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
struct btrfs_tree_block_info *bi;
|
|
|
|
struct extent_buffer *leaf;
|
2007-12-11 22:25:06 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
u32 new_size = sizeof(*item);
|
|
|
|
u64 refs;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
BUG_ON(btrfs_item_size_nr(leaf, path->slots[0]) != sizeof(*ei0));
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
ei0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item_v0);
|
|
|
|
refs = btrfs_extent_refs_v0(leaf, ei0);
|
|
|
|
|
|
|
|
if (owner == (u64)-1) {
|
|
|
|
while (1) {
|
|
|
|
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
BUG_ON(ret > 0);
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key,
|
|
|
|
path->slots[0]);
|
|
|
|
BUG_ON(key.objectid != found_key.objectid);
|
|
|
|
if (found_key.type != BTRFS_EXTENT_REF_V0_KEY) {
|
|
|
|
path->slots[0]++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
owner = btrfs_ref_objectid_v0(leaf, ref0);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID)
|
|
|
|
new_size += sizeof(*bi);
|
|
|
|
|
|
|
|
new_size -= sizeof(*ei0);
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path,
|
|
|
|
new_size + extra_size, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
|
|
|
ret = btrfs_extend_item(trans, root, path, new_size);
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
btrfs_set_extent_refs(leaf, item, refs);
|
|
|
|
/* FIXME: get real generation */
|
|
|
|
btrfs_set_extent_generation(leaf, item, 0);
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
btrfs_set_extent_flags(leaf, item,
|
|
|
|
BTRFS_EXTENT_FLAG_TREE_BLOCK |
|
|
|
|
BTRFS_BLOCK_FLAG_FULL_BACKREF);
|
|
|
|
bi = (struct btrfs_tree_block_info *)(item + 1);
|
|
|
|
/* FIXME: get first key of the block */
|
|
|
|
memset_extent_buffer(leaf, 0, (unsigned long)bi, sizeof(*bi));
|
|
|
|
btrfs_set_tree_block_level(leaf, bi, (int)owner);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_flags(leaf, item, BTRFS_EXTENT_FLAG_DATA);
|
|
|
|
}
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
u32 high_crc = ~(u32)0;
|
|
|
|
u32 low_crc = ~(u32)0;
|
|
|
|
__le64 lenum;
|
|
|
|
|
|
|
|
lenum = cpu_to_le64(root_objectid);
|
2009-04-19 20:02:41 +08:00
|
|
|
high_crc = crc32c(high_crc, &lenum, sizeof(lenum));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
lenum = cpu_to_le64(owner);
|
2009-04-19 20:02:41 +08:00
|
|
|
low_crc = crc32c(low_crc, &lenum, sizeof(lenum));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
lenum = cpu_to_le64(offset);
|
2009-04-19 20:02:41 +08:00
|
|
|
low_crc = crc32c(low_crc, &lenum, sizeof(lenum));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
return ((u64)high_crc << 31) ^ (u64)low_crc;
|
|
|
|
}
|
|
|
|
|
|
|
|
static u64 hash_extent_data_ref_item(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_data_ref *ref)
|
|
|
|
{
|
|
|
|
return hash_extent_data_ref(btrfs_extent_data_ref_root(leaf, ref),
|
|
|
|
btrfs_extent_data_ref_objectid(leaf, ref),
|
|
|
|
btrfs_extent_data_ref_offset(leaf, ref));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int match_extent_data_ref(struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_data_ref *ref,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
if (btrfs_extent_data_ref_root(leaf, ref) != root_objectid ||
|
|
|
|
btrfs_extent_data_ref_objectid(leaf, ref) != owner ||
|
|
|
|
btrfs_extent_data_ref_offset(leaf, ref) != offset)
|
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int lookup_extent_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid,
|
|
|
|
u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_extent_data_ref *ref;
|
2008-09-24 01:14:14 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 nritems;
|
2007-12-11 22:25:06 +08:00
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int recow;
|
|
|
|
int err = -ENOENT;
|
2007-12-11 22:25:06 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
key.objectid = bytenr;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_EXTENT_DATA_REF_KEY;
|
|
|
|
key.offset = hash_extent_data_ref(root_objectid,
|
|
|
|
owner, offset);
|
|
|
|
}
|
|
|
|
again:
|
|
|
|
recow = 0;
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto fail;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent) {
|
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
key.type = BTRFS_EXTENT_REF_V0_KEY;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
goto fail;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
while (1) {
|
|
|
|
if (path->slots[0] >= nritems) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0)
|
|
|
|
err = ret;
|
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
recow = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (key.objectid != bytenr ||
|
|
|
|
key.type != BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
|
|
|
|
if (match_extent_data_ref(leaf, ref, root_objectid,
|
|
|
|
owner, offset)) {
|
|
|
|
if (recow) {
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
path->slots[0]++;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
fail:
|
|
|
|
return err;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int insert_extent_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, int refs_to_add)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 size;
|
2008-09-24 01:14:14 +08:00
|
|
|
u32 num_refs;
|
|
|
|
int ret;
|
2007-12-11 22:25:06 +08:00
|
|
|
|
|
|
|
key.objectid = bytenr;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
size = sizeof(struct btrfs_shared_data_ref);
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_EXTENT_DATA_REF_KEY;
|
|
|
|
key.offset = hash_extent_data_ref(root_objectid,
|
|
|
|
owner, offset);
|
|
|
|
size = sizeof(struct btrfs_extent_data_ref);
|
|
|
|
}
|
2007-12-11 22:25:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key, size);
|
|
|
|
if (ret && ret != -EEXIST)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
if (parent) {
|
|
|
|
struct btrfs_shared_data_ref *ref;
|
2008-09-24 01:14:14 +08:00
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_shared_data_ref);
|
|
|
|
if (ret == 0) {
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref, refs_to_add);
|
|
|
|
} else {
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref);
|
|
|
|
num_refs += refs_to_add;
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref, num_refs);
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
struct btrfs_extent_data_ref *ref;
|
|
|
|
while (ret == -EEXIST) {
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
if (match_extent_data_ref(leaf, ref, root_objectid,
|
|
|
|
owner, offset))
|
|
|
|
break;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.offset++;
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key,
|
|
|
|
size);
|
|
|
|
if (ret && ret != -EEXIST)
|
|
|
|
goto fail;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
}
|
|
|
|
ref = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
if (ret == 0) {
|
|
|
|
btrfs_set_extent_data_ref_root(leaf, ref,
|
|
|
|
root_objectid);
|
|
|
|
btrfs_set_extent_data_ref_objectid(leaf, ref, owner);
|
|
|
|
btrfs_set_extent_data_ref_offset(leaf, ref, offset);
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref, refs_to_add);
|
|
|
|
} else {
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref);
|
|
|
|
num_refs += refs_to_add;
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref, num_refs);
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
ret = 0;
|
|
|
|
fail:
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2007-12-11 22:25:06 +08:00
|
|
|
return ret;
|
2007-12-11 22:25:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int remove_extent_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
int refs_to_drop)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_extent_data_ref *ref1 = NULL;
|
|
|
|
struct btrfs_shared_data_ref *ref2 = NULL;
|
2008-09-24 01:14:14 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 num_refs = 0;
|
2008-09-24 01:14:14 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
|
|
|
|
if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
ref1 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
|
|
|
|
} else if (key.type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
ref2 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_shared_data_ref);
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref2);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
} else if (key.type == BTRFS_EXTENT_REF_V0_KEY) {
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
num_refs = btrfs_ref_count_v0(leaf, ref0);
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
BUG_ON(num_refs < refs_to_drop);
|
|
|
|
num_refs -= refs_to_drop;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
if (num_refs == 0) {
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
} else {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.type == BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref1, num_refs);
|
|
|
|
else if (key.type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref2, num_refs);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
else {
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
btrfs_set_ref_count_v0(leaf, ref0, num_refs);
|
|
|
|
}
|
|
|
|
#endif
|
2008-09-24 01:14:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline u32 extent_data_ref_count(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref)
|
2008-11-20 10:17:22 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_data_ref *ref1;
|
|
|
|
struct btrfs_shared_data_ref *ref2;
|
|
|
|
u32 num_refs = 0;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
|
|
|
if (iref) {
|
|
|
|
if (btrfs_extent_inline_ref_type(leaf, iref) ==
|
|
|
|
BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
ref1 = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
|
|
|
|
} else {
|
|
|
|
ref2 = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref2);
|
|
|
|
}
|
|
|
|
} else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
ref1 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_data_ref);
|
|
|
|
num_refs = btrfs_extent_data_ref_count(leaf, ref1);
|
|
|
|
} else if (key.type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
ref2 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_shared_data_ref);
|
|
|
|
num_refs = btrfs_shared_data_ref_count(leaf, ref2);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
} else if (key.type == BTRFS_EXTENT_REF_V0_KEY) {
|
|
|
|
struct btrfs_extent_ref_v0 *ref0;
|
|
|
|
ref0 = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_ref_v0);
|
|
|
|
num_refs = btrfs_ref_count_v0(leaf, ref0);
|
2008-11-20 23:22:27 +08:00
|
|
|
#endif
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
WARN_ON(1);
|
|
|
|
}
|
|
|
|
return num_refs;
|
|
|
|
}
|
2008-11-20 10:17:22 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int lookup_tree_block_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid)
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
int ret;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.objectid = bytenr;
|
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_BLOCK_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_TREE_BLOCK_REF_KEY;
|
|
|
|
key.offset = root_objectid;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (ret == -ENOENT && parent) {
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.type = BTRFS_EXTENT_REF_V0_KEY;
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -ENOENT;
|
|
|
|
}
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
#endif
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static noinline int insert_tree_block_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent,
|
|
|
|
u64 root_objectid)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_key key;
|
2008-09-24 01:14:14 +08:00
|
|
|
int ret;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.objectid = bytenr;
|
|
|
|
if (parent) {
|
|
|
|
key.type = BTRFS_SHARED_BLOCK_REF_KEY;
|
|
|
|
key.offset = parent;
|
|
|
|
} else {
|
|
|
|
key.type = BTRFS_TREE_BLOCK_REF_KEY;
|
|
|
|
key.offset = root_objectid;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-09-24 01:14:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static inline int extent_ref_type(u64 parent, u64 owner)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int type;
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
if (parent > 0)
|
|
|
|
type = BTRFS_SHARED_BLOCK_REF_KEY;
|
|
|
|
else
|
|
|
|
type = BTRFS_TREE_BLOCK_REF_KEY;
|
|
|
|
} else {
|
|
|
|
if (parent > 0)
|
|
|
|
type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
else
|
|
|
|
type = BTRFS_EXTENT_DATA_REF_KEY;
|
|
|
|
}
|
|
|
|
return type;
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
static int find_next_key(struct btrfs_path *path, int level,
|
|
|
|
struct btrfs_key *key)
|
2009-03-13 22:10:06 +08:00
|
|
|
|
2007-03-03 05:08:05 +08:00
|
|
|
{
|
2009-06-28 09:07:35 +08:00
|
|
|
for (; level < BTRFS_MAX_LEVEL; level++) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!path->nodes[level])
|
|
|
|
break;
|
|
|
|
if (path->slots[level] + 1 >=
|
|
|
|
btrfs_header_nritems(path->nodes[level]))
|
|
|
|
continue;
|
|
|
|
if (level == 0)
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[level], key,
|
|
|
|
path->slots[level] + 1);
|
|
|
|
else
|
|
|
|
btrfs_node_key_to_cpu(path->nodes[level], key,
|
|
|
|
path->slots[level] + 1);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
2007-03-08 00:50:24 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/*
|
|
|
|
* look for inline back ref. if back ref is found, *ref_ret is set
|
|
|
|
* to the address of inline back ref, and 0 is returned.
|
|
|
|
*
|
|
|
|
* if back ref isn't found, *ref_ret is set to the address where it
|
|
|
|
* should be inserted, and -ENOENT is returned.
|
|
|
|
*
|
|
|
|
* if insert is true and there are too many inline back refs, the path
|
|
|
|
* points to the extent item, and -EAGAIN is returned.
|
|
|
|
*
|
|
|
|
* NOTE: inline back refs are ordered in the same way that back ref
|
|
|
|
* items in the tree are ordered.
|
|
|
|
*/
|
|
|
|
static noinline_for_stack
|
|
|
|
int lookup_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref **ref_ret,
|
|
|
|
u64 bytenr, u64 num_bytes,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int insert)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
u64 flags;
|
|
|
|
u64 item_size;
|
|
|
|
unsigned long ptr;
|
|
|
|
unsigned long end;
|
|
|
|
int extra_size;
|
|
|
|
int type;
|
|
|
|
int want;
|
|
|
|
int ret;
|
|
|
|
int err = 0;
|
2007-08-09 08:17:12 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
key.objectid = bytenr;
|
2008-09-24 01:14:14 +08:00
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
2009-03-13 22:10:06 +08:00
|
|
|
key.offset = num_bytes;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
want = extent_ref_type(parent, owner);
|
|
|
|
if (insert) {
|
|
|
|
extra_size = btrfs_extent_inline_ref_size(want);
|
2009-06-11 20:51:10 +08:00
|
|
|
path->keep_locks = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else
|
|
|
|
extra_size = -1;
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, extra_size, 1);
|
2009-03-13 23:00:37 +08:00
|
|
|
if (ret < 0) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
if (!insert) {
|
|
|
|
err = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ret = convert_extent_item_v0(trans, root, path, owner,
|
|
|
|
extra_size);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
BUG_ON(item_size < sizeof(*ei));
|
|
|
|
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
|
|
|
|
ptr = (unsigned long)(ei + 1);
|
|
|
|
end = (unsigned long)ei + item_size;
|
|
|
|
|
|
|
|
if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
|
|
|
|
ptr += sizeof(struct btrfs_tree_block_info);
|
|
|
|
BUG_ON(ptr > end);
|
|
|
|
} else {
|
|
|
|
BUG_ON(!(flags & BTRFS_EXTENT_FLAG_DATA));
|
|
|
|
}
|
|
|
|
|
|
|
|
err = -ENOENT;
|
|
|
|
while (1) {
|
|
|
|
if (ptr >= end) {
|
|
|
|
WARN_ON(ptr > end);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)ptr;
|
|
|
|
type = btrfs_extent_inline_ref_type(leaf, iref);
|
|
|
|
if (want < type)
|
|
|
|
break;
|
|
|
|
if (want > type) {
|
|
|
|
ptr += btrfs_extent_inline_ref_size(type);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
struct btrfs_extent_data_ref *dref;
|
|
|
|
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
if (match_extent_data_ref(leaf, dref, root_objectid,
|
|
|
|
owner, offset)) {
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (hash_extent_data_ref_item(leaf, dref) <
|
|
|
|
hash_extent_data_ref(root_objectid, owner, offset))
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
u64 ref_offset;
|
|
|
|
ref_offset = btrfs_extent_inline_ref_offset(leaf, iref);
|
|
|
|
if (parent > 0) {
|
|
|
|
if (parent == ref_offset) {
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ref_offset < parent)
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
if (root_objectid == ref_offset) {
|
|
|
|
err = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (ref_offset < root_objectid)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ptr += btrfs_extent_inline_ref_size(type);
|
|
|
|
}
|
|
|
|
if (err == -ENOENT && insert) {
|
|
|
|
if (item_size + extra_size >=
|
|
|
|
BTRFS_MAX_EXTENT_ITEM_SIZE(root)) {
|
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* To add new inline back ref, we have to make sure
|
|
|
|
* there is no corresponding back ref item.
|
|
|
|
* For simplicity, we just do not add new inline back
|
|
|
|
* ref if there is any kind of item for this block
|
|
|
|
*/
|
2009-06-28 09:07:35 +08:00
|
|
|
if (find_next_key(path, 0, &key) == 0 &&
|
|
|
|
key.objectid == bytenr &&
|
2009-06-11 20:51:10 +08:00
|
|
|
key.type < BTRFS_BLOCK_GROUP_ITEM_KEY) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
*ref_ret = (struct btrfs_extent_inline_ref *)ptr;
|
|
|
|
out:
|
2009-06-11 20:51:10 +08:00
|
|
|
if (insert) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->keep_locks = 0;
|
|
|
|
btrfs_unlock_up_safe(path, 1);
|
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* helper to add new inline back ref
|
|
|
|
*/
|
|
|
|
static noinline_for_stack
|
|
|
|
int setup_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int refs_to_add,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
unsigned long ptr;
|
|
|
|
unsigned long end;
|
|
|
|
unsigned long item_offset;
|
|
|
|
u64 refs;
|
|
|
|
int size;
|
|
|
|
int type;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
item_offset = (unsigned long)iref - (unsigned long)ei;
|
|
|
|
|
|
|
|
type = extent_ref_type(parent, owner);
|
|
|
|
size = btrfs_extent_inline_ref_size(type);
|
|
|
|
|
|
|
|
ret = btrfs_extend_item(trans, root, path, size);
|
|
|
|
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
refs += refs_to_add;
|
|
|
|
btrfs_set_extent_refs(leaf, ei, refs);
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
|
|
|
|
|
|
|
ptr = (unsigned long)ei + item_offset;
|
|
|
|
end = (unsigned long)ei + btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
if (ptr < end - size)
|
|
|
|
memmove_extent_buffer(leaf, ptr + size, ptr,
|
|
|
|
end - size - ptr);
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)ptr;
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref, type);
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
struct btrfs_extent_data_ref *dref;
|
|
|
|
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
btrfs_set_extent_data_ref_root(leaf, dref, root_objectid);
|
|
|
|
btrfs_set_extent_data_ref_objectid(leaf, dref, owner);
|
|
|
|
btrfs_set_extent_data_ref_offset(leaf, dref, offset);
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, dref, refs_to_add);
|
|
|
|
} else if (type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
struct btrfs_shared_data_ref *sref;
|
|
|
|
sref = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, sref, refs_to_add);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
} else if (type == BTRFS_SHARED_BLOCK_REF_KEY) {
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, root_objectid);
|
|
|
|
}
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int lookup_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref **ref_ret,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = lookup_inline_extent_backref(trans, root, path, ref_ret,
|
|
|
|
bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner, offset, 0);
|
|
|
|
if (ret != -ENOENT)
|
2007-06-23 02:16:25 +08:00
|
|
|
return ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
*ref_ret = NULL;
|
|
|
|
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
ret = lookup_tree_block_ref(trans, root, path, bytenr, parent,
|
|
|
|
root_objectid);
|
|
|
|
} else {
|
|
|
|
ret = lookup_extent_data_ref(trans, root, path, bytenr, parent,
|
|
|
|
root_objectid, owner, offset);
|
2009-03-13 23:00:37 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/*
|
|
|
|
* helper to update/remove inline back ref
|
|
|
|
*/
|
|
|
|
static noinline_for_stack
|
|
|
|
int update_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
int refs_to_mod,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct btrfs_extent_data_ref *dref = NULL;
|
|
|
|
struct btrfs_shared_data_ref *sref = NULL;
|
|
|
|
unsigned long ptr;
|
|
|
|
unsigned long end;
|
|
|
|
u32 item_size;
|
|
|
|
int size;
|
|
|
|
int type;
|
|
|
|
int ret;
|
|
|
|
u64 refs;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
refs = btrfs_extent_refs(leaf, ei);
|
|
|
|
WARN_ON(refs_to_mod < 0 && refs + refs_to_mod <= 0);
|
|
|
|
refs += refs_to_mod;
|
|
|
|
btrfs_set_extent_refs(leaf, ei, refs);
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
|
|
|
|
|
|
|
type = btrfs_extent_inline_ref_type(leaf, iref);
|
|
|
|
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
|
|
|
|
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
refs = btrfs_extent_data_ref_count(leaf, dref);
|
|
|
|
} else if (type == BTRFS_SHARED_DATA_REF_KEY) {
|
|
|
|
sref = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
refs = btrfs_shared_data_ref_count(leaf, sref);
|
|
|
|
} else {
|
|
|
|
refs = 1;
|
|
|
|
BUG_ON(refs_to_mod != -1);
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(refs_to_mod < 0 && refs < -refs_to_mod);
|
|
|
|
refs += refs_to_mod;
|
|
|
|
|
|
|
|
if (refs > 0) {
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, dref, refs);
|
|
|
|
else
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, sref, refs);
|
|
|
|
} else {
|
|
|
|
size = btrfs_extent_inline_ref_size(type);
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
ptr = (unsigned long)iref;
|
|
|
|
end = (unsigned long)ei + item_size;
|
|
|
|
if (ptr + size < end)
|
|
|
|
memmove_extent_buffer(leaf, ptr, ptr + size,
|
|
|
|
end - ptr - size);
|
|
|
|
item_size -= size;
|
|
|
|
ret = btrfs_truncate_item(trans, root, path, item_size, 1);
|
|
|
|
}
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline_for_stack
|
|
|
|
int insert_inline_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, int refs_to_add,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = lookup_inline_extent_backref(trans, root, path, &iref,
|
|
|
|
bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner, offset, 1);
|
|
|
|
if (ret == 0) {
|
|
|
|
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
|
|
|
|
ret = update_inline_extent_backref(trans, root, path, iref,
|
|
|
|
refs_to_add, extent_op);
|
|
|
|
} else if (ret == -ENOENT) {
|
|
|
|
ret = setup_inline_extent_backref(trans, root, path, iref,
|
|
|
|
parent, root_objectid,
|
|
|
|
owner, offset, refs_to_add,
|
|
|
|
extent_op);
|
2008-11-07 11:02:51 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int insert_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 bytenr, u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int refs_to_add)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
BUG_ON(refs_to_add != 1);
|
|
|
|
ret = insert_tree_block_ref(trans, root, path, bytenr,
|
|
|
|
parent, root_objectid);
|
|
|
|
} else {
|
|
|
|
ret = insert_extent_data_ref(trans, root, path, bytenr,
|
|
|
|
parent, root_objectid,
|
|
|
|
owner, offset, refs_to_add);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int remove_extent_backref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_extent_inline_ref *iref,
|
|
|
|
int refs_to_drop, int is_data)
|
|
|
|
{
|
|
|
|
int ret;
|
2009-03-13 23:00:37 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(!is_data && refs_to_drop != 1);
|
|
|
|
if (iref) {
|
|
|
|
ret = update_inline_extent_backref(trans, root, path, iref,
|
|
|
|
-refs_to_drop, NULL);
|
|
|
|
} else if (is_data) {
|
|
|
|
ret = remove_extent_data_ref(trans, root, path, refs_to_drop);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
static int btrfs_issue_discard(struct block_device *bdev,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u64 start, u64 len)
|
|
|
|
{
|
2011-03-24 18:24:27 +08:00
|
|
|
return blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_NOFS, 0);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
|
2011-03-24 18:24:27 +08:00
|
|
|
u64 num_bytes, u64 *actual_bytes)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2011-03-24 18:24:27 +08:00
|
|
|
u64 discarded_bytes = 0;
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio *bbio = NULL;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-10-14 21:24:59 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/* Tell the block device(s) that the sectors can be discarded */
|
2011-03-24 18:24:27 +08:00
|
|
|
ret = btrfs_map_block(&root->fs_info->mapping_tree, REQ_DISCARD,
|
2011-08-04 23:15:33 +08:00
|
|
|
bytenr, &num_bytes, &bbio, 0);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!ret) {
|
2011-08-04 23:15:33 +08:00
|
|
|
struct btrfs_bio_stripe *stripe = bbio->stripes;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
|
2011-08-04 23:15:33 +08:00
|
|
|
for (i = 0; i < bbio->num_stripes; i++, stripe++) {
|
2011-08-04 22:52:27 +08:00
|
|
|
if (!stripe->dev->can_discard)
|
|
|
|
continue;
|
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
ret = btrfs_issue_discard(stripe->dev->bdev,
|
|
|
|
stripe->physical,
|
|
|
|
stripe->length);
|
|
|
|
if (!ret)
|
|
|
|
discarded_bytes += stripe->length;
|
|
|
|
else if (ret != -EOPNOTSUPP)
|
|
|
|
break;
|
2011-08-04 22:52:27 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Just in case we get back EOPNOTSUPP for some reason,
|
|
|
|
* just ignore the return value so we don't screw up
|
|
|
|
* people calling discard_extent.
|
|
|
|
*/
|
|
|
|
ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
2011-08-04 23:15:33 +08:00
|
|
|
kfree(bbio);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
}
|
2011-03-24 18:24:27 +08:00
|
|
|
|
|
|
|
if (actual_bytes)
|
|
|
|
*actual_bytes = discarded_bytes;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
root_objectid == BTRFS_TREE_LOG_OBJECTID);
|
|
|
|
|
|
|
|
if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
ret = btrfs_add_delayed_tree_ref(trans, bytenr, num_bytes,
|
|
|
|
parent, root_objectid, (int)owner,
|
|
|
|
BTRFS_ADD_DELAYED_REF, NULL);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_add_delayed_data_ref(trans, bytenr, num_bytes,
|
|
|
|
parent, root_objectid, owner, offset,
|
|
|
|
BTRFS_ADD_DELAYED_REF, NULL);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 owner, u64 offset, int refs_to_add,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_extent_item *item;
|
|
|
|
u64 refs;
|
|
|
|
int ret;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
path->reada = 1;
|
|
|
|
path->leave_spinning = 1;
|
|
|
|
/* this will setup the path even if it fails to insert the back ref */
|
|
|
|
ret = insert_inline_extent_backref(trans, root->fs_info->extent_root,
|
|
|
|
path, bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner, offset,
|
|
|
|
refs_to_add, extent_op);
|
|
|
|
if (ret == 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (ret != -EAGAIN) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
refs = btrfs_extent_refs(leaf, item);
|
|
|
|
btrfs_set_extent_refs(leaf, item, refs + refs_to_add);
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, item);
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-03-13 22:10:06 +08:00
|
|
|
|
|
|
|
path->reada = 1;
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/* now insert the actual backref */
|
|
|
|
ret = insert_extent_backref(trans, root->fs_info->extent_root,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path, bytenr, parent, root_objectid,
|
|
|
|
owner, offset, refs_to_add);
|
2009-03-13 22:10:06 +08:00
|
|
|
BUG_ON(ret);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
out:
|
2009-03-13 22:10:06 +08:00
|
|
|
btrfs_free_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return err;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
int insert_reserved)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int ret = 0;
|
|
|
|
struct btrfs_delayed_data_ref *ref;
|
|
|
|
struct btrfs_key ins;
|
|
|
|
u64 parent = 0;
|
|
|
|
u64 ref_root = 0;
|
|
|
|
u64 flags = 0;
|
|
|
|
|
|
|
|
ins.objectid = node->bytenr;
|
|
|
|
ins.offset = node->num_bytes;
|
|
|
|
ins.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
|
|
|
|
ref = btrfs_delayed_node_to_data_ref(node);
|
|
|
|
if (node->type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
parent = ref->parent;
|
|
|
|
else
|
|
|
|
ref_root = ref->root;
|
|
|
|
|
|
|
|
if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
|
|
|
|
if (extent_op) {
|
|
|
|
BUG_ON(extent_op->update_key);
|
|
|
|
flags |= extent_op->flags_to_set;
|
|
|
|
}
|
|
|
|
ret = alloc_reserved_file_extent(trans, root,
|
|
|
|
parent, ref_root, flags,
|
|
|
|
ref->objectid, ref->offset,
|
|
|
|
&ins, node->ref_mod);
|
|
|
|
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
|
|
|
|
ret = __btrfs_inc_extent_ref(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent,
|
|
|
|
ref_root, ref->objectid,
|
|
|
|
ref->offset, node->ref_mod,
|
|
|
|
extent_op);
|
|
|
|
} else if (node->action == BTRFS_DROP_DELAYED_REF) {
|
|
|
|
ret = __btrfs_free_extent(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent,
|
|
|
|
ref_root, ref->objectid,
|
|
|
|
ref->offset, node->ref_mod,
|
|
|
|
extent_op);
|
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_extent_item *ei)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_extent_flags(leaf, ei);
|
|
|
|
if (extent_op->update_flags) {
|
|
|
|
flags |= extent_op->flags_to_set;
|
|
|
|
btrfs_set_extent_flags(leaf, ei, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (extent_op->update_key) {
|
|
|
|
struct btrfs_tree_block_info *bi;
|
|
|
|
BUG_ON(!(flags & BTRFS_EXTENT_FLAG_TREE_BLOCK));
|
|
|
|
bi = (struct btrfs_tree_block_info *)(ei + 1);
|
|
|
|
btrfs_set_tree_block_key(leaf, bi, &extent_op->key);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
u32 item_size;
|
2009-03-13 22:10:06 +08:00
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
key.objectid = node->bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = node->num_bytes;
|
|
|
|
|
|
|
|
path->reada = 1;
|
|
|
|
path->leave_spinning = 1;
|
|
|
|
ret = btrfs_search_slot(trans, root->fs_info->extent_root, &key,
|
|
|
|
path, 0, 1);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (ret > 0) {
|
|
|
|
err = -EIO;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
ret = convert_extent_item_v0(trans, root->fs_info->extent_root,
|
|
|
|
path, (u64)-1, 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
BUG_ON(item_size < sizeof(*ei));
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return err;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
int insert_reserved)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_delayed_tree_ref *ref;
|
|
|
|
struct btrfs_key ins;
|
|
|
|
u64 parent = 0;
|
|
|
|
u64 ref_root = 0;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ins.objectid = node->bytenr;
|
|
|
|
ins.offset = node->num_bytes;
|
|
|
|
ins.type = BTRFS_EXTENT_ITEM_KEY;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ref = btrfs_delayed_node_to_tree_ref(node);
|
|
|
|
if (node->type == BTRFS_SHARED_BLOCK_REF_KEY)
|
|
|
|
parent = ref->parent;
|
|
|
|
else
|
|
|
|
ref_root = ref->root;
|
|
|
|
|
|
|
|
BUG_ON(node->ref_mod != 1);
|
|
|
|
if (node->action == BTRFS_ADD_DELAYED_REF && insert_reserved) {
|
|
|
|
BUG_ON(!extent_op || !extent_op->update_flags ||
|
|
|
|
!extent_op->update_key);
|
|
|
|
ret = alloc_reserved_tree_block(trans, root,
|
|
|
|
parent, ref_root,
|
|
|
|
extent_op->flags_to_set,
|
|
|
|
&extent_op->key,
|
|
|
|
ref->level, &ins);
|
|
|
|
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
|
|
|
|
ret = __btrfs_inc_extent_ref(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent, ref_root,
|
|
|
|
ref->level, 0, 1, extent_op);
|
|
|
|
} else if (node->action == BTRFS_DROP_DELAYED_REF) {
|
|
|
|
ret = __btrfs_free_extent(trans, root, node->bytenr,
|
|
|
|
node->num_bytes, parent, ref_root,
|
|
|
|
ref->level, 0, 1, extent_op);
|
|
|
|
} else {
|
|
|
|
BUG();
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* helper function to actually process a single delayed ref entry */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_delayed_ref_node *node,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op,
|
|
|
|
int insert_reserved)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (btrfs_delayed_ref_is_head(node)) {
|
2009-03-13 22:10:06 +08:00
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
/*
|
|
|
|
* we've hit the end of the chain and we were supposed
|
|
|
|
* to insert this extent into the tree. But, it got
|
|
|
|
* deleted before we ever needed to insert it, so all
|
|
|
|
* we have to do is clean up the accounting
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(extent_op);
|
|
|
|
head = btrfs_delayed_node_to_head(node);
|
2009-03-13 22:10:06 +08:00
|
|
|
if (insert_reserved) {
|
2010-05-16 22:46:25 +08:00
|
|
|
btrfs_pin_extent(root, node->bytenr,
|
|
|
|
node->num_bytes, 1);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (head->is_data) {
|
|
|
|
ret = btrfs_del_csums(trans, root,
|
|
|
|
node->bytenr,
|
|
|
|
node->num_bytes);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (node->type == BTRFS_TREE_BLOCK_REF_KEY ||
|
|
|
|
node->type == BTRFS_SHARED_BLOCK_REF_KEY)
|
|
|
|
ret = run_delayed_tree_ref(trans, root, node, extent_op,
|
|
|
|
insert_reserved);
|
|
|
|
else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
|
|
|
|
node->type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
ret = run_delayed_data_ref(trans, root, node, extent_op,
|
|
|
|
insert_reserved);
|
|
|
|
else
|
|
|
|
BUG();
|
|
|
|
return ret;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline struct btrfs_delayed_ref_node *
|
|
|
|
select_delayed_ref(struct btrfs_delayed_ref_head *head)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
int action = BTRFS_ADD_DELAYED_REF;
|
|
|
|
again:
|
|
|
|
/*
|
|
|
|
* select delayed ref of type BTRFS_ADD_DELAYED_REF first.
|
|
|
|
* this prevents ref count from going down to zero when
|
|
|
|
* there still are pending delayed ref.
|
|
|
|
*/
|
|
|
|
node = rb_prev(&head->node.rb_node);
|
|
|
|
while (1) {
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node,
|
|
|
|
rb_node);
|
|
|
|
if (ref->bytenr != head->node.bytenr)
|
|
|
|
break;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (ref->action == action)
|
2009-03-13 22:10:06 +08:00
|
|
|
return ref;
|
|
|
|
node = rb_prev(node);
|
|
|
|
}
|
|
|
|
if (action == BTRFS_ADD_DELAYED_REF) {
|
|
|
|
action = BTRFS_DROP_DELAYED_REF;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
static noinline int run_clustered_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct list_head *cluster)
|
2009-03-13 22:10:06 +08:00
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct btrfs_delayed_ref_head *locked_ref = NULL;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_delayed_extent_op *extent_op;
|
2009-03-13 22:10:06 +08:00
|
|
|
int ret;
|
2009-03-13 22:17:05 +08:00
|
|
|
int count = 0;
|
2009-03-13 22:10:06 +08:00
|
|
|
int must_insert_reserved = 0;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
while (1) {
|
|
|
|
if (!locked_ref) {
|
2009-03-13 22:17:05 +08:00
|
|
|
/* pick a new head ref from the cluster list */
|
|
|
|
if (list_empty(cluster))
|
2009-03-13 22:10:06 +08:00
|
|
|
break;
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
locked_ref = list_entry(cluster->next,
|
|
|
|
struct btrfs_delayed_ref_head, cluster);
|
|
|
|
|
|
|
|
/* grab the lock that says we are going to process
|
|
|
|
* all the refs for this head */
|
|
|
|
ret = btrfs_delayed_ref_lock(trans, locked_ref);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we may have dropped the spin lock to get the head
|
|
|
|
* mutex lock, and that might have given someone else
|
|
|
|
* time to free the head. If that's true, it has been
|
|
|
|
* removed from our list and we can move on.
|
|
|
|
*/
|
|
|
|
if (ret == -EAGAIN) {
|
|
|
|
locked_ref = NULL;
|
|
|
|
count++;
|
|
|
|
continue;
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
|
|
|
}
|
2007-03-07 09:08:01 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
|
|
|
* record the must insert reserved flag before we
|
|
|
|
* drop the spin lock.
|
|
|
|
*/
|
|
|
|
must_insert_reserved = locked_ref->must_insert_reserved;
|
|
|
|
locked_ref->must_insert_reserved = 0;
|
2007-12-11 22:25:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
extent_op = locked_ref->extent_op;
|
|
|
|
locked_ref->extent_op = NULL;
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
|
|
|
* locked_ref is the head node, so we have to go one
|
|
|
|
* node back for any delayed ref updates
|
|
|
|
*/
|
|
|
|
ref = select_delayed_ref(locked_ref);
|
|
|
|
if (!ref) {
|
|
|
|
/* All delayed refs have been processed, Go ahead
|
|
|
|
* and send the head node to run_one_delayed_ref,
|
|
|
|
* so that any accounting fixes can happen
|
|
|
|
*/
|
|
|
|
ref = &locked_ref->node;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
if (extent_op && must_insert_reserved) {
|
|
|
|
kfree(extent_op);
|
|
|
|
extent_op = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (extent_op) {
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
|
|
|
ret = run_delayed_extent_op(trans, root,
|
|
|
|
ref, extent_op);
|
|
|
|
BUG_ON(ret);
|
|
|
|
kfree(extent_op);
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
list_del_init(&locked_ref->cluster);
|
2009-03-13 22:10:06 +08:00
|
|
|
locked_ref = NULL;
|
|
|
|
}
|
2007-03-03 05:08:05 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
ref->in_tree = 0;
|
|
|
|
rb_erase(&ref->rb_node, &delayed_refs->root);
|
|
|
|
delayed_refs->num_entries--;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = run_one_delayed_ref(trans, root, ref, extent_op,
|
2009-03-13 22:10:06 +08:00
|
|
|
must_insert_reserved);
|
|
|
|
BUG_ON(ret);
|
2009-02-12 22:27:38 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_put_delayed_ref(ref);
|
|
|
|
kfree(extent_op);
|
2009-03-13 22:17:05 +08:00
|
|
|
count++;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
cond_resched();
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this starts processing the delayed reference count updates and
|
|
|
|
* extent insertions we have queued up so far. count can be
|
|
|
|
* 0, which means to process everything in the tree at the start
|
|
|
|
* of the run (but not newly added entries), or it can be some target
|
|
|
|
* number you'd like to process.
|
|
|
|
*/
|
|
|
|
int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, unsigned long count)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct list_head cluster;
|
|
|
|
int ret;
|
|
|
|
int run_all = count == (unsigned long)-1;
|
|
|
|
int run_most = 0;
|
|
|
|
|
|
|
|
if (root == root->fs_info->extent_root)
|
|
|
|
root = root->fs_info->tree_root;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
INIT_LIST_HEAD(&cluster);
|
|
|
|
again:
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
if (count == 0) {
|
|
|
|
count = delayed_refs->num_entries * 2;
|
|
|
|
run_most = 1;
|
|
|
|
}
|
|
|
|
while (1) {
|
|
|
|
if (!(run_all || run_most) &&
|
|
|
|
delayed_refs->num_heads_ready < 64)
|
|
|
|
break;
|
2009-02-12 22:27:38 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
2009-03-13 22:17:05 +08:00
|
|
|
* go find something we can process in the rbtree. We start at
|
|
|
|
* the beginning of the tree, and then build a cluster
|
|
|
|
* of refs to process starting at the first one we are able to
|
|
|
|
* lock
|
2009-03-13 22:10:06 +08:00
|
|
|
*/
|
2009-03-13 22:17:05 +08:00
|
|
|
ret = btrfs_find_ref_cluster(trans, &cluster,
|
|
|
|
delayed_refs->run_delayed_start);
|
|
|
|
if (ret)
|
2009-03-13 22:10:06 +08:00
|
|
|
break;
|
|
|
|
|
2009-03-13 22:17:05 +08:00
|
|
|
ret = run_clustered_refs(trans, root, &cluster);
|
|
|
|
BUG_ON(ret < 0);
|
|
|
|
|
|
|
|
count -= min_t(unsigned long, ret, count);
|
|
|
|
|
|
|
|
if (count == 0)
|
|
|
|
break;
|
2009-02-12 22:27:38 +08:00
|
|
|
}
|
2009-03-13 22:17:05 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
if (run_all) {
|
|
|
|
node = rb_first(&delayed_refs->root);
|
2009-03-13 22:17:05 +08:00
|
|
|
if (!node)
|
2009-03-13 22:10:06 +08:00
|
|
|
goto out;
|
2009-03-13 22:17:05 +08:00
|
|
|
count = (unsigned long)-1;
|
2007-08-11 02:06:19 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
while (node) {
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node,
|
|
|
|
rb_node);
|
|
|
|
if (btrfs_delayed_ref_is_head(ref)) {
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
2007-04-02 23:20:42 +08:00
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
head = btrfs_delayed_node_to_head(ref);
|
|
|
|
atomic_inc(&ref->refs);
|
|
|
|
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
2011-05-02 21:29:25 +08:00
|
|
|
/*
|
|
|
|
* Mutex was contended, block until it's
|
|
|
|
* released and try again
|
|
|
|
*/
|
2009-03-13 22:10:06 +08:00
|
|
|
mutex_lock(&head->mutex);
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
|
|
|
|
btrfs_put_delayed_ref(ref);
|
2009-03-13 22:11:24 +08:00
|
|
|
cond_resched();
|
2009-03-13 22:10:06 +08:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
node = rb_next(node);
|
|
|
|
}
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
schedule_timeout(1);
|
|
|
|
goto again;
|
2007-10-16 04:14:19 +08:00
|
|
|
}
|
2007-06-23 02:16:25 +08:00
|
|
|
out:
|
2009-03-13 22:17:05 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
2007-03-07 09:08:01 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 flags,
|
|
|
|
int is_data)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_extent_op *extent_op;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
|
|
|
|
if (!extent_op)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
extent_op->flags_to_set = flags;
|
|
|
|
extent_op->update_flags = 1;
|
|
|
|
extent_op->update_key = 0;
|
|
|
|
extent_op->is_data = is_data ? 1 : 0;
|
|
|
|
|
|
|
|
ret = btrfs_add_delayed_extent_op(trans, bytenr, num_bytes, extent_op);
|
|
|
|
if (ret)
|
|
|
|
kfree(extent_op);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int check_delayed_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct btrfs_delayed_data_ref *data_ref;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
ret = -ENOENT;
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
head = btrfs_find_delayed_ref_head(trans, bytenr);
|
|
|
|
if (!head)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (!mutex_trylock(&head->mutex)) {
|
|
|
|
atomic_inc(&head->node.refs);
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-05-02 21:29:25 +08:00
|
|
|
/*
|
|
|
|
* Mutex was contended, block until it's released and let
|
|
|
|
* caller try again
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
mutex_lock(&head->mutex);
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
btrfs_put_delayed_ref(&head->node);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
|
|
|
node = rb_prev(&head->node.rb_node);
|
|
|
|
if (!node)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
|
|
|
|
if (ref->bytenr != bytenr)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
ret = 1;
|
|
|
|
if (ref->type != BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
data_ref = btrfs_delayed_node_to_data_ref(ref);
|
|
|
|
|
|
|
|
node = rb_prev(node);
|
|
|
|
if (node) {
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
if (ref->bytenr == bytenr)
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (data_ref->root != root->root_key.objectid ||
|
|
|
|
data_ref->objectid != objectid || data_ref->offset != offset)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&head->mutex);
|
|
|
|
out:
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int check_committed_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr)
|
2007-12-18 09:14:01 +08:00
|
|
|
{
|
|
|
|
struct btrfs_root *extent_root = root->fs_info->extent_root;
|
2008-07-30 21:26:11 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_data_ref *ref;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
struct btrfs_extent_item *ei;
|
2008-07-30 21:26:11 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 item_size;
|
2007-12-18 09:14:01 +08:00
|
|
|
int ret;
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2007-12-18 09:14:01 +08:00
|
|
|
key.objectid = bytenr;
|
2008-09-24 01:14:14 +08:00
|
|
|
key.offset = (u64)-1;
|
2008-07-30 21:26:11 +08:00
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
2007-12-18 09:14:01 +08:00
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
BUG_ON(ret == 0);
|
2008-10-31 02:20:02 +08:00
|
|
|
|
|
|
|
ret = -ENOENT;
|
|
|
|
if (path->slots[0] == 0)
|
2008-09-24 01:14:14 +08:00
|
|
|
goto out;
|
2007-12-18 09:14:01 +08:00
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
path->slots[0]--;
|
2008-07-30 21:26:11 +08:00
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
|
2007-12-18 09:14:01 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.objectid != bytenr || key.type != BTRFS_EXTENT_ITEM_KEY)
|
2007-12-18 09:14:01 +08:00
|
|
|
goto out;
|
2008-07-30 21:26:11 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = 1;
|
|
|
|
item_size = btrfs_item_size_nr(leaf, path->slots[0]);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
WARN_ON(item_size != sizeof(struct btrfs_extent_item_v0));
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item);
|
2008-01-04 02:23:19 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (item_size != sizeof(*ei) +
|
|
|
|
btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
|
|
|
|
goto out;
|
2007-12-18 09:14:01 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (btrfs_extent_generation(leaf, ei) <=
|
|
|
|
btrfs_root_last_snapshot(&root->root_item))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)(ei + 1);
|
|
|
|
if (btrfs_extent_inline_ref_type(leaf, iref) !=
|
|
|
|
BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
if (btrfs_extent_refs(leaf, ei) !=
|
|
|
|
btrfs_extent_data_ref_count(leaf, ref) ||
|
|
|
|
btrfs_extent_data_ref_root(leaf, ref) !=
|
|
|
|
root->root_key.objectid ||
|
|
|
|
btrfs_extent_data_ref_objectid(leaf, ref) != objectid ||
|
|
|
|
btrfs_extent_data_ref_offset(leaf, ref) != offset)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
int ret;
|
|
|
|
int ret2;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
do {
|
|
|
|
ret = check_committed_ref(trans, root, path, objectid,
|
|
|
|
offset, bytenr);
|
|
|
|
if (ret && ret != -ENOENT)
|
2008-07-30 21:26:11 +08:00
|
|
|
goto out;
|
2008-10-31 02:20:02 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret2 = check_delayed_ref(trans, root, path, objectid,
|
|
|
|
offset, bytenr);
|
|
|
|
} while (ret2 == -EAGAIN);
|
|
|
|
|
|
|
|
if (ret2 && ret2 != -ENOENT) {
|
|
|
|
ret = ret2;
|
|
|
|
goto out;
|
2008-07-30 21:26:11 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
if (ret != -ENOENT || ret2 != -ENOENT)
|
|
|
|
ret = 0;
|
2007-12-18 09:14:01 +08:00
|
|
|
out:
|
2008-10-31 02:20:02 +08:00
|
|
|
btrfs_free_path(path);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
|
|
|
|
WARN_ON(ret > 0);
|
2008-07-30 21:26:11 +08:00
|
|
|
return ret;
|
2007-12-18 09:14:01 +08:00
|
|
|
}
|
2007-04-10 21:27:04 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
|
2009-02-04 22:23:45 +08:00
|
|
|
struct btrfs_root *root,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct extent_buffer *buf,
|
|
|
|
int full_backref, int inc)
|
2008-09-24 01:14:14 +08:00
|
|
|
{
|
|
|
|
u64 bytenr;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u64 num_bytes;
|
|
|
|
u64 parent;
|
2008-09-24 01:14:14 +08:00
|
|
|
u64 ref_root;
|
|
|
|
u32 nritems;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_file_extent_item *fi;
|
|
|
|
int i;
|
|
|
|
int level;
|
|
|
|
int ret = 0;
|
|
|
|
int (*process_func)(struct btrfs_trans_handle *, struct btrfs_root *,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u64, u64, u64, u64, u64, u64);
|
2008-09-24 01:14:14 +08:00
|
|
|
|
|
|
|
ref_root = btrfs_header_owner(buf);
|
|
|
|
nritems = btrfs_header_nritems(buf);
|
|
|
|
level = btrfs_header_level(buf);
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (!root->ref_cows && level == 0)
|
|
|
|
return 0;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (inc)
|
|
|
|
process_func = btrfs_inc_extent_ref;
|
|
|
|
else
|
|
|
|
process_func = btrfs_free_extent;
|
2008-09-24 01:14:14 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (full_backref)
|
|
|
|
parent = buf->start;
|
|
|
|
else
|
|
|
|
parent = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < nritems; i++) {
|
2008-09-24 01:14:14 +08:00
|
|
|
if (level == 0) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_item_key_to_cpu(buf, &key, i);
|
2008-09-24 01:14:14 +08:00
|
|
|
if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
|
|
|
|
continue;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
fi = btrfs_item_ptr(buf, i,
|
2008-09-24 01:14:14 +08:00
|
|
|
struct btrfs_file_extent_item);
|
|
|
|
if (btrfs_file_extent_type(buf, fi) ==
|
|
|
|
BTRFS_FILE_EXTENT_INLINE)
|
|
|
|
continue;
|
|
|
|
bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
|
|
|
|
if (bytenr == 0)
|
|
|
|
continue;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
|
|
|
|
key.offset -= btrfs_file_extent_offset(buf, fi);
|
|
|
|
ret = process_func(trans, root, bytenr, num_bytes,
|
|
|
|
parent, ref_root, key.objectid,
|
|
|
|
key.offset);
|
2008-09-24 01:14:14 +08:00
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
} else {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
bytenr = btrfs_node_blockptr(buf, i);
|
|
|
|
num_bytes = btrfs_level_size(root, level - 1);
|
|
|
|
ret = process_func(trans, root, bytenr, num_bytes,
|
|
|
|
parent, ref_root, level - 1, 0);
|
2008-09-24 01:14:14 +08:00
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
fail:
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf, int full_backref)
|
|
|
|
{
|
|
|
|
return __btrfs_mod_ref(trans, root, buf, full_backref, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf, int full_backref)
|
|
|
|
{
|
|
|
|
return __btrfs_mod_ref(trans, root, buf, full_backref, 0);
|
2008-09-24 01:14:14 +08:00
|
|
|
}
|
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
static int write_one_cache_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_root *extent_root = root->fs_info->extent_root;
|
2007-10-16 04:14:19 +08:00
|
|
|
unsigned long bi;
|
|
|
|
struct extent_buffer *leaf;
|
2007-04-27 04:46:15 +08:00
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
|
2007-06-23 02:16:25 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto fail;
|
2007-04-27 04:46:15 +08:00
|
|
|
BUG_ON(ret);
|
2007-10-16 04:14:19 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
|
|
|
|
write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2007-06-23 02:16:25 +08:00
|
|
|
fail:
|
2007-04-27 04:46:15 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
static struct btrfs_block_group_cache *
|
|
|
|
next_block_group(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
spin_lock(&root->fs_info->block_group_cache_lock);
|
|
|
|
node = rb_next(&cache->cache_node);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
if (node) {
|
|
|
|
cache = rb_entry(node, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(cache);
|
2009-07-22 22:07:05 +08:00
|
|
|
} else
|
|
|
|
cache = NULL;
|
|
|
|
spin_unlock(&root->fs_info->block_group_cache_lock);
|
|
|
|
return cache;
|
|
|
|
}
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
static int cache_save_setup(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = block_group->fs_info->tree_root;
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
u64 alloc_hint = 0;
|
2010-12-04 02:17:53 +08:00
|
|
|
int dcs = BTRFS_DC_ERROR;
|
2010-06-22 02:48:16 +08:00
|
|
|
int num_pages = 0;
|
|
|
|
int retries = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this block group is smaller than 100 megs don't bother caching the
|
|
|
|
* block group.
|
|
|
|
*/
|
|
|
|
if (block_group->key.offset < (100 * 1024 * 1024)) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->disk_cache_state = BTRFS_DC_WRITTEN;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
again:
|
|
|
|
inode = lookup_free_space_inode(root, block_group, path);
|
|
|
|
if (IS_ERR(inode) && PTR_ERR(inode) != -ENOENT) {
|
|
|
|
ret = PTR_ERR(inode);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
BUG_ON(retries);
|
|
|
|
retries++;
|
|
|
|
|
|
|
|
if (block_group->ro)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
ret = create_free_space_inode(root, trans, block_group, path);
|
|
|
|
if (ret)
|
|
|
|
goto out_free;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2011-10-06 20:58:24 +08:00
|
|
|
/* We've already setup this transaction, go ahead and exit */
|
|
|
|
if (block_group->cache_generation == trans->transid &&
|
|
|
|
i_size_read(inode)) {
|
|
|
|
dcs = BTRFS_DC_SETUP;
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
/*
|
|
|
|
* We want to set the generation to 0, that way if anything goes wrong
|
|
|
|
* from here on out we know not to trust this cache when we load up next
|
|
|
|
* time.
|
|
|
|
*/
|
|
|
|
BTRFS_I(inode)->generation = 0;
|
|
|
|
ret = btrfs_update_inode(trans, root, inode);
|
|
|
|
WARN_ON(ret);
|
|
|
|
|
|
|
|
if (i_size_read(inode) > 0) {
|
|
|
|
ret = btrfs_truncate_free_space_cache(root, trans, path,
|
|
|
|
inode);
|
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->cached != BTRFS_CACHE_FINISHED) {
|
2010-12-04 02:17:53 +08:00
|
|
|
/* We're not cached, don't bother trying to write stuff out */
|
|
|
|
dcs = BTRFS_DC_WRITTEN;
|
2010-06-22 02:48:16 +08:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
goto out_put;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
num_pages = (int)div64_u64(block_group->key.offset, 1024 * 1024 * 1024);
|
|
|
|
if (!num_pages)
|
|
|
|
num_pages = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Just to make absolutely sure we have enough space, we're going to
|
|
|
|
* preallocate 12 pages worth of space for each block group. In
|
|
|
|
* practice we ought to use at most 8, but we need extra space so we can
|
|
|
|
* add our header and have a terminator between the extents and the
|
|
|
|
* bitmaps.
|
|
|
|
*/
|
|
|
|
num_pages *= 16;
|
|
|
|
num_pages *= PAGE_CACHE_SIZE;
|
|
|
|
|
2011-10-06 20:58:24 +08:00
|
|
|
ret = btrfs_check_data_free_space(inode, num_pages);
|
2010-06-22 02:48:16 +08:00
|
|
|
if (ret)
|
|
|
|
goto out_put;
|
|
|
|
|
|
|
|
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
|
|
|
|
num_pages, num_pages,
|
|
|
|
&alloc_hint);
|
2011-10-06 20:58:24 +08:00
|
|
|
if (!ret)
|
2010-12-04 02:17:53 +08:00
|
|
|
dcs = BTRFS_DC_SETUP;
|
2011-10-06 20:58:24 +08:00
|
|
|
btrfs_free_reserved_data_space(inode, num_pages);
|
2011-08-30 22:19:10 +08:00
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
out_put:
|
|
|
|
iput(inode);
|
|
|
|
out_free:
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
out:
|
|
|
|
spin_lock(&block_group->lock);
|
2011-12-14 05:04:54 +08:00
|
|
|
if (!ret && dcs == BTRFS_DC_SETUP)
|
2011-10-06 20:58:24 +08:00
|
|
|
block_group->cache_generation = trans->transid;
|
2010-12-04 02:17:53 +08:00
|
|
|
block_group->disk_cache_state = dcs;
|
2010-06-22 02:48:16 +08:00
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-10-16 04:15:19 +08:00
|
|
|
int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
2007-04-27 04:46:15 +08:00
|
|
|
{
|
2009-07-22 22:07:05 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-04-27 04:46:15 +08:00
|
|
|
int err = 0;
|
|
|
|
struct btrfs_path *path;
|
2007-10-16 04:15:19 +08:00
|
|
|
u64 last = 0;
|
2007-04-27 04:46:15 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
again:
|
|
|
|
while (1) {
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
|
|
|
while (cache) {
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_CLEAR)
|
|
|
|
break;
|
|
|
|
cache = next_block_group(root, cache);
|
|
|
|
}
|
|
|
|
if (!cache) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
err = cache_save_setup(cache, trans, path);
|
|
|
|
last = cache->key.objectid + cache->key.offset;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2009-07-22 22:07:05 +08:00
|
|
|
if (last == 0) {
|
|
|
|
err = btrfs_run_delayed_refs(trans, root,
|
|
|
|
(unsigned long)-1);
|
|
|
|
BUG_ON(err);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2007-06-23 02:16:25 +08:00
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
|
|
|
while (cache) {
|
2010-06-22 02:48:16 +08:00
|
|
|
if (cache->disk_cache_state == BTRFS_DC_CLEAR) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
if (cache->dirty)
|
|
|
|
break;
|
|
|
|
cache = next_block_group(root, cache);
|
|
|
|
}
|
|
|
|
if (!cache) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2010-07-03 00:14:14 +08:00
|
|
|
if (cache->disk_cache_state == BTRFS_DC_SETUP)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_NEED_WRITE;
|
2008-09-26 22:05:48 +08:00
|
|
|
cache->dirty = 0;
|
2009-07-22 22:07:05 +08:00
|
|
|
last = cache->key.objectid + cache->key.offset;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2009-07-22 22:07:05 +08:00
|
|
|
err = write_one_cache_group(trans, root, path, cache);
|
|
|
|
BUG_ON(err);
|
|
|
|
btrfs_put_block_group(cache);
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2009-07-22 22:07:05 +08:00
|
|
|
|
2010-07-03 00:14:14 +08:00
|
|
|
while (1) {
|
|
|
|
/*
|
|
|
|
* I don't think this is needed since we're just marking our
|
|
|
|
* preallocated extent as written, but just in case it can't
|
|
|
|
* hurt.
|
|
|
|
*/
|
|
|
|
if (last == 0) {
|
|
|
|
err = btrfs_run_delayed_refs(trans, root,
|
|
|
|
(unsigned long)-1);
|
|
|
|
BUG_ON(err);
|
|
|
|
}
|
|
|
|
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, last);
|
|
|
|
while (cache) {
|
|
|
|
/*
|
|
|
|
* Really this shouldn't happen, but it could if we
|
|
|
|
* couldn't write the entire preallocated extent and
|
|
|
|
* splitting the extent resulted in a new block.
|
|
|
|
*/
|
|
|
|
if (cache->dirty) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_NEED_WRITE)
|
|
|
|
break;
|
|
|
|
cache = next_block_group(root, cache);
|
|
|
|
}
|
|
|
|
if (!cache) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_write_out_cache(root, trans, cache, path);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we didn't have an error then the cache state is still
|
|
|
|
* NEED_WRITE, so we can set it to WRITTEN.
|
|
|
|
*/
|
|
|
|
if (cache->disk_cache_state == BTRFS_DC_NEED_WRITE)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_WRITTEN;
|
|
|
|
last = cache->key.objectid + cache->key.offset;
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
btrfs_free_path(path);
|
2009-07-22 22:07:05 +08:00
|
|
|
return 0;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
int readonly = 0;
|
|
|
|
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
|
|
|
|
if (!block_group || block_group->ro)
|
|
|
|
readonly = 1;
|
|
|
|
if (block_group)
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2008-12-12 05:30:39 +08:00
|
|
|
return readonly;
|
|
|
|
}
|
|
|
|
|
2008-03-26 04:50:33 +08:00
|
|
|
static int update_space_info(struct btrfs_fs_info *info, u64 flags,
|
|
|
|
u64 total_bytes, u64 bytes_used,
|
|
|
|
struct btrfs_space_info **space_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *found;
|
2010-05-16 22:46:24 +08:00
|
|
|
int i;
|
|
|
|
int factor;
|
|
|
|
|
|
|
|
if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
2008-03-26 04:50:33 +08:00
|
|
|
|
|
|
|
found = __find_space_info(info, flags);
|
|
|
|
if (found) {
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_lock(&found->lock);
|
2008-03-26 04:50:33 +08:00
|
|
|
found->total_bytes += total_bytes;
|
2010-10-15 02:52:27 +08:00
|
|
|
found->disk_total += total_bytes * factor;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->bytes_used += bytes_used;
|
2010-05-16 22:46:24 +08:00
|
|
|
found->disk_used += bytes_used * factor;
|
2008-04-26 04:53:30 +08:00
|
|
|
found->full = 0;
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&found->lock);
|
2008-03-26 04:50:33 +08:00
|
|
|
*space_info = found;
|
|
|
|
return 0;
|
|
|
|
}
|
2008-11-13 03:34:12 +08:00
|
|
|
found = kzalloc(sizeof(*found), GFP_NOFS);
|
2008-03-26 04:50:33 +08:00
|
|
|
if (!found)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
INIT_LIST_HEAD(&found->block_groups[i]);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
init_rwsem(&found->groups_sem);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_lock_init(&found->lock);
|
2010-05-16 22:46:24 +08:00
|
|
|
found->flags = flags & (BTRFS_BLOCK_GROUP_DATA |
|
|
|
|
BTRFS_BLOCK_GROUP_SYSTEM |
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA);
|
2008-03-26 04:50:33 +08:00
|
|
|
found->total_bytes = total_bytes;
|
2010-10-15 02:52:27 +08:00
|
|
|
found->disk_total = total_bytes * factor;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->bytes_used = bytes_used;
|
2010-05-16 22:46:24 +08:00
|
|
|
found->disk_used = bytes_used * factor;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->bytes_pinned = 0;
|
2008-09-26 22:05:48 +08:00
|
|
|
found->bytes_reserved = 0;
|
2008-11-13 03:34:12 +08:00
|
|
|
found->bytes_readonly = 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
found->bytes_may_use = 0;
|
2008-03-26 04:50:33 +08:00
|
|
|
found->full = 0;
|
2011-04-16 04:05:44 +08:00
|
|
|
found->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
2011-04-12 08:20:11 +08:00
|
|
|
found->chunk_alloc = 0;
|
2011-06-08 04:07:44 +08:00
|
|
|
found->flush = 0;
|
|
|
|
init_waitqueue_head(&found->wait);
|
2008-03-26 04:50:33 +08:00
|
|
|
*space_info = found;
|
2009-03-11 00:39:20 +08:00
|
|
|
list_add_rcu(&found->list, &info->space_info);
|
2008-03-26 04:50:33 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-04 04:29:03 +08:00
|
|
|
static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
|
|
|
|
{
|
|
|
|
u64 extra_flags = flags & (BTRFS_BLOCK_GROUP_RAID0 |
|
2008-04-04 04:29:03 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
2008-04-16 22:49:51 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID10 |
|
2008-04-04 04:29:03 +08:00
|
|
|
BTRFS_BLOCK_GROUP_DUP);
|
2008-04-04 04:29:03 +08:00
|
|
|
if (extra_flags) {
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
fs_info->avail_data_alloc_bits |= extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
fs_info->avail_metadata_alloc_bits |= extra_flags;
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
fs_info->avail_system_alloc_bits |= extra_flags;
|
|
|
|
}
|
|
|
|
}
|
2008-03-26 04:50:33 +08:00
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags)
|
2008-04-29 03:29:52 +08:00
|
|
|
{
|
2010-12-14 03:56:23 +08:00
|
|
|
/*
|
|
|
|
* we add in the count of missing devices because we want
|
|
|
|
* to make sure that any RAID levels on a degraded FS
|
|
|
|
* continue to be honored.
|
|
|
|
*/
|
|
|
|
u64 num_devices = root->fs_info->fs_devices->rw_devices +
|
|
|
|
root->fs_info->fs_devices->missing_devices;
|
2008-05-07 23:43:44 +08:00
|
|
|
|
|
|
|
if (num_devices == 1)
|
|
|
|
flags &= ~(BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID0);
|
|
|
|
if (num_devices < 4)
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
if ((flags & BTRFS_BLOCK_GROUP_DUP) &&
|
|
|
|
(flags & (BTRFS_BLOCK_GROUP_RAID1 |
|
2008-05-07 23:43:44 +08:00
|
|
|
BTRFS_BLOCK_GROUP_RAID10))) {
|
2008-04-29 03:29:52 +08:00
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_DUP;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
if ((flags & BTRFS_BLOCK_GROUP_RAID1) &&
|
2008-05-07 23:43:44 +08:00
|
|
|
(flags & BTRFS_BLOCK_GROUP_RAID10)) {
|
2008-04-29 03:29:52 +08:00
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_RAID1;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
|
|
|
if ((flags & BTRFS_BLOCK_GROUP_RAID0) &&
|
|
|
|
((flags & BTRFS_BLOCK_GROUP_RAID1) |
|
|
|
|
(flags & BTRFS_BLOCK_GROUP_RAID10) |
|
|
|
|
(flags & BTRFS_BLOCK_GROUP_DUP)))
|
|
|
|
flags &= ~BTRFS_BLOCK_GROUP_RAID0;
|
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:46:24 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
flags |= root->fs_info->avail_data_alloc_bits &
|
|
|
|
root->fs_info->data_alloc_profile;
|
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
|
|
|
flags |= root->fs_info->avail_system_alloc_bits &
|
|
|
|
root->fs_info->system_alloc_profile;
|
|
|
|
else if (flags & BTRFS_BLOCK_GROUP_METADATA)
|
|
|
|
flags |= root->fs_info->avail_metadata_alloc_bits &
|
|
|
|
root->fs_info->metadata_alloc_profile;
|
|
|
|
return btrfs_reduce_alloc_profile(root, flags);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 18:07:31 +08:00
|
|
|
u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:24 +08:00
|
|
|
u64 flags;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
if (data)
|
|
|
|
flags = BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
else if (root == root->fs_info->chunk_root)
|
|
|
|
flags = BTRFS_BLOCK_GROUP_SYSTEM;
|
2009-09-12 04:12:44 +08:00
|
|
|
else
|
2010-05-16 22:46:24 +08:00
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
return get_alloc_profile(root, flags);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *inode)
|
|
|
|
{
|
|
|
|
BTRFS_I(inode)->space_info = __find_space_info(root->fs_info,
|
2010-05-16 22:46:25 +08:00
|
|
|
BTRFS_BLOCK_GROUP_DATA);
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
/*
|
|
|
|
* This will check the space that the inode allocates from to make sure we have
|
|
|
|
* enough space for bytes.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *data_sinfo;
|
2010-05-16 22:48:47 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2010-03-19 22:38:13 +08:00
|
|
|
u64 used;
|
2010-06-22 02:48:16 +08:00
|
|
|
int ret = 0, committed = 0, alloc_chunk = 1;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
|
|
|
/* make sure bytes are sectorsize aligned */
|
|
|
|
bytes = (bytes + root->sectorsize - 1) & ~((u64)root->sectorsize - 1);
|
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
if (root == root->fs_info->tree_root ||
|
|
|
|
BTRFS_I(inode)->location.objectid == BTRFS_FREE_INO_OBJECTID) {
|
2010-06-22 02:48:16 +08:00
|
|
|
alloc_chunk = 0;
|
|
|
|
committed = 1;
|
|
|
|
}
|
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
data_sinfo = BTRFS_I(inode)->space_info;
|
2009-09-23 02:45:50 +08:00
|
|
|
if (!data_sinfo)
|
|
|
|
goto alloc;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
again:
|
|
|
|
/* make sure we have enough space to handle the data first */
|
|
|
|
spin_lock(&data_sinfo->lock);
|
2010-05-16 22:49:58 +08:00
|
|
|
used = data_sinfo->bytes_used + data_sinfo->bytes_reserved +
|
|
|
|
data_sinfo->bytes_pinned + data_sinfo->bytes_readonly +
|
|
|
|
data_sinfo->bytes_may_use;
|
2010-03-19 22:38:13 +08:00
|
|
|
|
|
|
|
if (used + bytes > data_sinfo->total_bytes) {
|
2009-02-20 23:59:53 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
/*
|
|
|
|
* if we don't have enough free bytes in this space then we need
|
|
|
|
* to alloc a new chunk.
|
|
|
|
*/
|
2010-06-22 02:48:16 +08:00
|
|
|
if (!data_sinfo->full && alloc_chunk) {
|
2009-02-21 00:00:09 +08:00
|
|
|
u64 alloc_target;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
data_sinfo->force_alloc = CHUNK_ALLOC_FORCE;
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_unlock(&data_sinfo->lock);
|
2009-09-23 02:45:50 +08:00
|
|
|
alloc:
|
2009-02-21 00:00:09 +08:00
|
|
|
alloc_target = btrfs_get_alloc_profile(root, 1);
|
2011-04-14 00:54:33 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
2010-05-16 22:48:46 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
ret = do_chunk_alloc(trans, root->fs_info->extent_root,
|
|
|
|
bytes + 2 * 1024 * 1024,
|
2011-04-16 04:05:44 +08:00
|
|
|
alloc_target,
|
|
|
|
CHUNK_ALLOC_NO_FORCE);
|
2009-02-21 00:00:09 +08:00
|
|
|
btrfs_end_transaction(trans, root);
|
2011-01-05 18:07:18 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
if (ret != -ENOSPC)
|
|
|
|
return ret;
|
|
|
|
else
|
|
|
|
goto commit_trans;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-09-23 02:45:50 +08:00
|
|
|
if (!data_sinfo) {
|
|
|
|
btrfs_set_inode_space_info(root, inode);
|
|
|
|
data_sinfo = BTRFS_I(inode)->space_info;
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2011-05-26 01:10:16 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have less pinned bytes than we want to allocate then
|
|
|
|
* don't bother committing the transaction, it won't help us.
|
|
|
|
*/
|
|
|
|
if (data_sinfo->bytes_pinned < bytes)
|
|
|
|
committed = 1;
|
2009-02-21 00:00:09 +08:00
|
|
|
spin_unlock(&data_sinfo->lock);
|
|
|
|
|
2009-02-20 23:59:53 +08:00
|
|
|
/* commit the current transaction and try again */
|
2011-01-05 18:07:18 +08:00
|
|
|
commit_trans:
|
2011-04-12 05:25:13 +08:00
|
|
|
if (!committed &&
|
|
|
|
!atomic_read(&root->fs_info->open_ioctl_trans)) {
|
2009-02-20 23:59:53 +08:00
|
|
|
committed = 1;
|
2011-04-14 00:54:33 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
2010-05-16 22:48:46 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2009-02-20 23:59:53 +08:00
|
|
|
ret = btrfs_commit_transaction(trans, root);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
goto again;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
data_sinfo->bytes_may_use += bytes;
|
|
|
|
spin_unlock(&data_sinfo->lock);
|
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
return 0;
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
|
|
|
/*
|
2011-07-27 05:00:46 +08:00
|
|
|
* Called if we need to clear a data reservation for this inode.
|
2009-02-21 00:00:09 +08:00
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
|
2009-10-08 08:44:34 +08:00
|
|
|
{
|
2010-05-16 22:48:47 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2009-02-21 00:00:09 +08:00
|
|
|
struct btrfs_space_info *data_sinfo;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
/* make sure bytes are sectorsize aligned */
|
|
|
|
bytes = (bytes + root->sectorsize - 1) & ~((u64)root->sectorsize - 1);
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
data_sinfo = BTRFS_I(inode)->space_info;
|
|
|
|
spin_lock(&data_sinfo->lock);
|
|
|
|
data_sinfo->bytes_may_use -= bytes;
|
|
|
|
spin_unlock(&data_sinfo->lock);
|
2009-10-08 08:44:34 +08:00
|
|
|
}
|
|
|
|
|
2009-04-22 05:40:57 +08:00
|
|
|
static void force_metadata_allocation(struct btrfs_fs_info *info)
|
2009-10-08 08:44:34 +08:00
|
|
|
{
|
2009-04-22 05:40:57 +08:00
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2009-04-22 05:40:57 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(found, head, list) {
|
|
|
|
if (found->flags & BTRFS_BLOCK_GROUP_METADATA)
|
2011-04-16 04:05:44 +08:00
|
|
|
found->force_alloc = CHUNK_ALLOC_FORCE;
|
2009-10-08 08:44:34 +08:00
|
|
|
}
|
2009-04-22 05:40:57 +08:00
|
|
|
rcu_read_unlock();
|
2009-10-08 08:44:34 +08:00
|
|
|
}
|
|
|
|
|
2010-10-27 01:37:56 +08:00
|
|
|
static int should_alloc_chunk(struct btrfs_root *root,
|
2011-04-16 04:05:44 +08:00
|
|
|
struct btrfs_space_info *sinfo, u64 alloc_bytes,
|
|
|
|
int force)
|
2009-10-09 01:34:05 +08:00
|
|
|
{
|
2011-07-27 05:00:46 +08:00
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
|
2011-04-16 04:05:44 +08:00
|
|
|
u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
|
2010-10-27 01:37:56 +08:00
|
|
|
u64 thresh;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
if (force == CHUNK_ALLOC_FORCE)
|
|
|
|
return 1;
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
/*
|
|
|
|
* We need to take into account the global rsv because for all intents
|
|
|
|
* and purposes it's used space. Don't worry about locking the
|
|
|
|
* global_rsv, it doesn't change except when the transaction commits.
|
|
|
|
*/
|
|
|
|
num_allocated += global_rsv->size;
|
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
/*
|
|
|
|
* in limited mode, we want to have some free space up to
|
|
|
|
* about 1% of the FS size.
|
|
|
|
*/
|
|
|
|
if (force == CHUNK_ALLOC_LIMITED) {
|
2011-04-13 21:41:04 +08:00
|
|
|
thresh = btrfs_super_total_bytes(root->fs_info->super_copy);
|
2011-04-16 04:05:44 +08:00
|
|
|
thresh = max_t(u64, 64 * 1024 * 1024,
|
|
|
|
div_factor_fine(thresh, 1));
|
|
|
|
|
|
|
|
if (num_bytes - num_allocated < thresh)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we have two similar checks here, one based on percentage
|
|
|
|
* and once based on a hard number of 256MB. The idea
|
|
|
|
* is that if we have a good amount of free
|
|
|
|
* room, don't allocate a chunk. A good mount is
|
|
|
|
* less than 80% utilized of the chunks we have allocated,
|
|
|
|
* or more than 256MB free
|
|
|
|
*/
|
|
|
|
if (num_allocated + alloc_bytes + 256 * 1024 * 1024 < num_bytes)
|
2010-05-16 22:46:25 +08:00
|
|
|
return 0;
|
2009-10-08 08:44:34 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
if (num_allocated + alloc_bytes < div_factor(num_bytes, 8))
|
2010-05-16 22:46:25 +08:00
|
|
|
return 0;
|
2009-10-09 01:34:05 +08:00
|
|
|
|
2011-04-13 21:41:04 +08:00
|
|
|
thresh = btrfs_super_total_bytes(root->fs_info->super_copy);
|
2011-04-16 04:05:44 +08:00
|
|
|
|
|
|
|
/* 256MB or 5% of the FS */
|
2010-10-27 01:37:56 +08:00
|
|
|
thresh = max_t(u64, 256 * 1024 * 1024, div_factor_fine(thresh, 5));
|
|
|
|
|
|
|
|
if (num_bytes > thresh && sinfo->bytes_used < div_factor(num_bytes, 3))
|
2010-10-16 03:23:48 +08:00
|
|
|
return 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
return 1;
|
2009-10-09 01:34:05 +08:00
|
|
|
}
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
static int do_chunk_alloc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *extent_root, u64 alloc_bytes,
|
2008-05-25 02:04:53 +08:00
|
|
|
u64 flags, int force)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2008-03-25 03:01:59 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-04-22 05:40:57 +08:00
|
|
|
struct btrfs_fs_info *fs_info = extent_root->fs_info;
|
2011-04-12 08:20:11 +08:00
|
|
|
int wait_for_alloc = 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
int ret = 0;
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
flags = btrfs_reduce_alloc_profile(extent_root, flags);
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
space_info = __find_space_info(extent_root->fs_info, flags);
|
2008-03-26 04:50:33 +08:00
|
|
|
if (!space_info) {
|
|
|
|
ret = update_space_info(extent_root->fs_info, flags,
|
|
|
|
0, 0, &space_info);
|
|
|
|
BUG_ON(ret);
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
BUG_ON(!space_info);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-04-12 08:20:11 +08:00
|
|
|
again:
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
if (space_info->force_alloc)
|
2011-04-16 04:05:44 +08:00
|
|
|
force = space_info->force_alloc;
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
if (space_info->full) {
|
|
|
|
spin_unlock(&space_info->lock);
|
2011-04-12 08:20:11 +08:00
|
|
|
return 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, force)) {
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2011-04-12 08:20:11 +08:00
|
|
|
return 0;
|
|
|
|
} else if (space_info->chunk_alloc) {
|
|
|
|
wait_for_alloc = 1;
|
|
|
|
} else {
|
|
|
|
space_info->chunk_alloc = 1;
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
2011-04-16 04:05:44 +08:00
|
|
|
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-04-12 08:20:11 +08:00
|
|
|
mutex_lock(&fs_info->chunk_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The chunk_mutex is held throughout the entirety of a chunk
|
|
|
|
* allocation, so once we've acquired the chunk_mutex we know that the
|
|
|
|
* other guy is done and we need to recheck and see if we should
|
|
|
|
* allocate.
|
|
|
|
*/
|
|
|
|
if (wait_for_alloc) {
|
|
|
|
mutex_unlock(&fs_info->chunk_mutex);
|
|
|
|
wait_for_alloc = 0;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
/*
|
|
|
|
* If we have mixed data/metadata chunks we want to make sure we keep
|
|
|
|
* allocating mixed chunks instead of individual chunks.
|
|
|
|
*/
|
|
|
|
if (btrfs_mixed_space_info(space_info))
|
|
|
|
flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
|
2009-04-22 05:40:57 +08:00
|
|
|
/*
|
|
|
|
* if we're doing a data chunk, go ahead and make sure that
|
|
|
|
* we keep a reasonable number of metadata chunks allocated in the
|
|
|
|
* FS as well.
|
|
|
|
*/
|
2009-09-12 04:12:44 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA && fs_info->metadata_ratio) {
|
2009-04-22 05:40:57 +08:00
|
|
|
fs_info->data_chunk_allocations++;
|
|
|
|
if (!(fs_info->data_chunk_allocations %
|
|
|
|
fs_info->metadata_ratio))
|
|
|
|
force_metadata_allocation(fs_info);
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
2008-11-18 10:11:30 +08:00
|
|
|
ret = btrfs_alloc_chunk(trans, extent_root, flags);
|
2011-07-13 01:57:59 +08:00
|
|
|
if (ret < 0 && ret != -ENOSPC)
|
|
|
|
goto out;
|
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (ret)
|
2008-03-25 03:01:59 +08:00
|
|
|
space_info->full = 1;
|
2010-05-16 22:46:25 +08:00
|
|
|
else
|
|
|
|
ret = 1;
|
2011-04-12 08:20:11 +08:00
|
|
|
|
2011-04-16 04:05:44 +08:00
|
|
|
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
2011-04-12 08:20:11 +08:00
|
|
|
space_info->chunk_alloc = 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2011-07-13 01:57:59 +08:00
|
|
|
out:
|
2008-11-13 03:34:12 +08:00
|
|
|
mutex_unlock(&extent_root->fs_info->chunk_mutex);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return ret;
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
|
|
|
/*
|
2010-05-16 22:46:25 +08:00
|
|
|
* shrink metadata reservation for delalloc
|
2009-09-12 04:12:44 +08:00
|
|
|
*/
|
2011-11-04 10:54:25 +08:00
|
|
|
static int shrink_delalloc(struct btrfs_root *root, u64 to_reclaim,
|
2011-10-15 01:56:58 +08:00
|
|
|
bool wait_ordered)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
2010-05-16 22:48:47 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2010-10-16 03:18:40 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2011-11-04 10:54:25 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 reserved;
|
|
|
|
u64 max_reclaim;
|
|
|
|
u64 reclaimed = 0;
|
2011-01-22 05:10:01 +08:00
|
|
|
long time_left;
|
2011-10-15 02:02:10 +08:00
|
|
|
unsigned long nr_pages = (2 * 1024 * 1024) >> PAGE_CACHE_SHIFT;
|
2011-01-22 05:10:01 +08:00
|
|
|
int loops = 0;
|
2011-03-12 20:08:42 +08:00
|
|
|
unsigned long progress;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2011-11-04 10:54:25 +08:00
|
|
|
trans = (struct btrfs_trans_handle *)current->journal_info;
|
2010-05-16 22:48:47 +08:00
|
|
|
block_rsv = &root->fs_info->delalloc_block_rsv;
|
2010-10-16 03:18:40 +08:00
|
|
|
space_info = block_rsv->space_info;
|
2010-10-27 01:40:45 +08:00
|
|
|
|
|
|
|
smp_mb();
|
2011-07-27 05:00:46 +08:00
|
|
|
reserved = space_info->bytes_may_use;
|
2011-03-12 20:08:42 +08:00
|
|
|
progress = space_info->reservation_progress;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
if (reserved == 0)
|
|
|
|
return 0;
|
2011-05-21 04:20:30 +08:00
|
|
|
|
2011-06-08 04:07:44 +08:00
|
|
|
smp_mb();
|
|
|
|
if (root->fs_info->delalloc_bytes == 0) {
|
|
|
|
if (trans)
|
|
|
|
return 0;
|
|
|
|
btrfs_wait_ordered_extents(root, 0, 0);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
max_reclaim = min(reserved, to_reclaim);
|
2011-10-15 02:02:10 +08:00
|
|
|
nr_pages = max_t(unsigned long, nr_pages,
|
|
|
|
max_reclaim >> PAGE_CACHE_SHIFT);
|
2011-01-22 05:10:01 +08:00
|
|
|
while (loops < 1024) {
|
2010-10-27 01:40:45 +08:00
|
|
|
/* have the flusher threads jump in and do some IO */
|
|
|
|
smp_mb();
|
|
|
|
nr_pages = min_t(unsigned long, nr_pages,
|
|
|
|
root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT);
|
|
|
|
writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages);
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2010-10-16 03:18:40 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-07-27 05:00:46 +08:00
|
|
|
if (reserved > space_info->bytes_may_use)
|
|
|
|
reclaimed += reserved - space_info->bytes_may_use;
|
|
|
|
reserved = space_info->bytes_may_use;
|
2010-10-16 03:18:40 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2011-03-12 20:08:42 +08:00
|
|
|
loops++;
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (reserved == 0 || reclaimed >= max_reclaim)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (trans && trans->transaction->blocked)
|
|
|
|
return -EAGAIN;
|
2010-10-27 01:40:45 +08:00
|
|
|
|
2011-10-15 01:56:58 +08:00
|
|
|
if (wait_ordered && !trans) {
|
|
|
|
btrfs_wait_ordered_extents(root, 0, 0);
|
|
|
|
} else {
|
|
|
|
time_left = schedule_timeout_interruptible(1);
|
2011-01-22 05:10:01 +08:00
|
|
|
|
2011-10-15 01:56:58 +08:00
|
|
|
/* We were interrupted, exit */
|
|
|
|
if (time_left)
|
|
|
|
break;
|
|
|
|
}
|
2011-01-22 05:10:01 +08:00
|
|
|
|
2011-03-12 20:08:42 +08:00
|
|
|
/* we've kicked the IO a few times, if anything has been freed,
|
|
|
|
* exit. There is no sense in looping here for a long time
|
|
|
|
* when we really need to commit the transaction, or there are
|
|
|
|
* just too many writers without enough free space
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (loops > 3) {
|
|
|
|
smp_mb();
|
|
|
|
if (progress != space_info->reservation_progress)
|
|
|
|
break;
|
|
|
|
}
|
2010-10-27 01:40:45 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2011-10-15 01:56:58 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
return reclaimed >= to_reclaim;
|
|
|
|
}
|
|
|
|
|
2011-11-04 10:54:25 +08:00
|
|
|
/**
|
|
|
|
* maybe_commit_transaction - possibly commit the transaction if its ok to
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @bytes - the number of bytes we want to reserve
|
|
|
|
* @force - force the commit
|
|
|
|
*
|
|
|
|
* This will check to make sure that committing the transaction will actually
|
|
|
|
* get us somewhere and then commit the transaction if it does. Otherwise it
|
|
|
|
* will return -ENOSPC.
|
|
|
|
*/
|
|
|
|
static int may_commit_transaction(struct btrfs_root *root,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
u64 bytes, int force)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *delayed_rsv = &root->fs_info->delayed_block_rsv;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
|
|
|
|
trans = (struct btrfs_trans_handle *)current->journal_info;
|
|
|
|
if (trans)
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
if (force)
|
|
|
|
goto commit;
|
|
|
|
|
|
|
|
/* See if there is enough pinned space to make this reservation */
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (space_info->bytes_pinned >= bytes) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
goto commit;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if there is some space in the delayed insertion reservation for
|
|
|
|
* this reservation.
|
|
|
|
*/
|
|
|
|
if (space_info != delayed_rsv->space_info)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
spin_lock(&delayed_rsv->lock);
|
|
|
|
if (delayed_rsv->size < bytes) {
|
|
|
|
spin_unlock(&delayed_rsv->lock);
|
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
spin_unlock(&delayed_rsv->lock);
|
|
|
|
|
|
|
|
commit:
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
return btrfs_commit_transaction(trans, root);
|
|
|
|
}
|
|
|
|
|
2011-08-31 00:34:28 +08:00
|
|
|
/**
|
|
|
|
* reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @block_rsv - the block_rsv we're allocating for
|
|
|
|
* @orig_bytes - the number of bytes we want
|
|
|
|
* @flush - wether or not we can flush to make our reservation
|
2010-10-16 04:52:49 +08:00
|
|
|
*
|
2011-08-31 00:34:28 +08:00
|
|
|
* This will reserve orgi_bytes number of bytes from the space info associated
|
|
|
|
* with the block_rsv. If there is not enough space it will make an attempt to
|
|
|
|
* flush out space to make room. It will do this by flushing delalloc if
|
|
|
|
* possible or committing the transaction. If flush is 0 then no attempts to
|
|
|
|
* regain reservations will be made and this will fail if there is not enough
|
|
|
|
* space already.
|
2010-10-16 04:52:49 +08:00
|
|
|
*/
|
2011-08-31 00:34:28 +08:00
|
|
|
static int reserve_metadata_bytes(struct btrfs_root *root,
|
2010-10-16 04:52:49 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv,
|
2011-10-19 00:15:48 +08:00
|
|
|
u64 orig_bytes, int flush)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *space_info = block_rsv->space_info;
|
2011-09-27 05:12:22 +08:00
|
|
|
u64 used;
|
2010-10-16 04:52:49 +08:00
|
|
|
u64 num_bytes = orig_bytes;
|
|
|
|
int retries = 0;
|
|
|
|
int ret = 0;
|
2010-10-27 00:52:53 +08:00
|
|
|
bool committed = false;
|
2011-06-08 04:07:44 +08:00
|
|
|
bool flushing = false;
|
2011-10-15 01:56:58 +08:00
|
|
|
bool wait_ordered = false;
|
2011-08-31 00:34:28 +08:00
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
again:
|
2011-06-08 04:07:44 +08:00
|
|
|
ret = 0;
|
2010-10-16 04:52:49 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-06-08 04:07:44 +08:00
|
|
|
/*
|
|
|
|
* We only want to wait if somebody other than us is flushing and we are
|
|
|
|
* actually alloed to flush.
|
|
|
|
*/
|
|
|
|
while (flush && !flushing && space_info->flush) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
/*
|
|
|
|
* If we have a trans handle we can't wait because the flusher
|
|
|
|
* may have to commit the transaction, which would mean we would
|
|
|
|
* deadlock since we are waiting for the flusher to finish, but
|
|
|
|
* hold the current transaction open.
|
|
|
|
*/
|
2011-11-04 10:54:25 +08:00
|
|
|
if (current->journal_info)
|
2011-06-08 04:07:44 +08:00
|
|
|
return -EAGAIN;
|
|
|
|
ret = wait_event_interruptible(space_info->wait,
|
|
|
|
!space_info->flush);
|
|
|
|
/* Must have been interrupted, return */
|
|
|
|
if (ret)
|
|
|
|
return -EINTR;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = -ENOSPC;
|
2011-09-27 05:12:22 +08:00
|
|
|
used = space_info->bytes_used + space_info->bytes_reserved +
|
|
|
|
space_info->bytes_pinned + space_info->bytes_readonly +
|
|
|
|
space_info->bytes_may_use;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
/*
|
|
|
|
* The idea here is that we've not already over-reserved the block group
|
|
|
|
* then we can go ahead and save our reservation first and then start
|
|
|
|
* flushing if we need to. Otherwise if we've already overcommitted
|
|
|
|
* lets start flushing stuff first and then come back and try to make
|
|
|
|
* our reservation.
|
|
|
|
*/
|
2011-09-27 05:12:22 +08:00
|
|
|
if (used <= space_info->total_bytes) {
|
|
|
|
if (used + orig_bytes <= space_info->total_bytes) {
|
2011-07-27 05:00:46 +08:00
|
|
|
space_info->bytes_may_use += orig_bytes;
|
2010-10-16 04:52:49 +08:00
|
|
|
ret = 0;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Ok set num_bytes to orig_bytes since we aren't
|
|
|
|
* overocmmitted, this way we only try and reclaim what
|
|
|
|
* we need.
|
|
|
|
*/
|
|
|
|
num_bytes = orig_bytes;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Ok we're over committed, set num_bytes to the overcommitted
|
|
|
|
* amount plus the amount of bytes that we need for this
|
|
|
|
* reservation.
|
|
|
|
*/
|
2011-10-15 01:56:58 +08:00
|
|
|
wait_ordered = true;
|
2011-09-27 05:12:22 +08:00
|
|
|
num_bytes = used - space_info->total_bytes +
|
2010-10-16 04:52:49 +08:00
|
|
|
(orig_bytes * (retries + 1));
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-10-19 00:15:48 +08:00
|
|
|
if (ret) {
|
2011-09-27 05:12:22 +08:00
|
|
|
u64 profile = btrfs_get_alloc_profile(root, 0);
|
|
|
|
u64 avail;
|
|
|
|
|
2011-10-19 01:07:31 +08:00
|
|
|
/*
|
|
|
|
* If we have a lot of space that's pinned, don't bother doing
|
|
|
|
* the overcommit dance yet and just commit the transaction.
|
|
|
|
*/
|
|
|
|
avail = (space_info->total_bytes - space_info->bytes_used) * 8;
|
|
|
|
do_div(avail, 10);
|
2011-11-04 10:54:25 +08:00
|
|
|
if (space_info->bytes_pinned >= avail && flush && !committed) {
|
2011-10-19 01:07:31 +08:00
|
|
|
space_info->flush = 1;
|
|
|
|
flushing = true;
|
|
|
|
spin_unlock(&space_info->lock);
|
2011-11-04 10:54:25 +08:00
|
|
|
ret = may_commit_transaction(root, space_info,
|
|
|
|
orig_bytes, 1);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
committed = true;
|
|
|
|
goto again;
|
2011-10-19 01:07:31 +08:00
|
|
|
}
|
|
|
|
|
2011-09-27 05:12:22 +08:00
|
|
|
spin_lock(&root->fs_info->free_chunk_lock);
|
|
|
|
avail = root->fs_info->free_chunk_space;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have dup, raid1 or raid10 then only half of the free
|
|
|
|
* space is actually useable.
|
|
|
|
*/
|
|
|
|
if (profile & (BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
avail >>= 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we aren't flushing don't let us overcommit too much, say
|
|
|
|
* 1/8th of the space. If we can flush, let it overcommit up to
|
|
|
|
* 1/2 of the space.
|
|
|
|
*/
|
|
|
|
if (flush)
|
|
|
|
avail >>= 3;
|
|
|
|
else
|
|
|
|
avail >>= 1;
|
|
|
|
spin_unlock(&root->fs_info->free_chunk_lock);
|
|
|
|
|
2011-10-06 04:35:28 +08:00
|
|
|
if (used + num_bytes < space_info->total_bytes + avail) {
|
2011-09-27 05:12:22 +08:00
|
|
|
space_info->bytes_may_use += orig_bytes;
|
|
|
|
ret = 0;
|
2011-10-15 01:56:58 +08:00
|
|
|
} else {
|
|
|
|
wait_ordered = true;
|
2011-09-27 05:12:22 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
/*
|
|
|
|
* Couldn't make our reservation, save our place so while we're trying
|
|
|
|
* to reclaim space we can actually use it instead of somebody else
|
|
|
|
* stealing it from us.
|
|
|
|
*/
|
2011-06-08 04:07:44 +08:00
|
|
|
if (ret && flush) {
|
|
|
|
flushing = true;
|
|
|
|
space_info->flush = 1;
|
2010-10-16 04:52:49 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-06-08 04:07:44 +08:00
|
|
|
if (!ret || !flush)
|
2010-10-16 04:52:49 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
/*
|
|
|
|
* We do synchronous shrinking since we don't actually unreserve
|
|
|
|
* metadata until after the IO is completed.
|
|
|
|
*/
|
2011-11-04 10:54:25 +08:00
|
|
|
ret = shrink_delalloc(root, num_bytes, wait_ordered);
|
2011-06-08 04:07:44 +08:00
|
|
|
if (ret < 0)
|
2010-10-16 04:52:49 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2011-07-28 03:57:44 +08:00
|
|
|
ret = 0;
|
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
/*
|
|
|
|
* So if we were overcommitted it's possible that somebody else flushed
|
|
|
|
* out enough space and we simply didn't have enough space to reclaim,
|
|
|
|
* so go back around and try again.
|
|
|
|
*/
|
|
|
|
if (retries < 2) {
|
2011-10-15 01:56:58 +08:00
|
|
|
wait_ordered = true;
|
2010-10-16 04:52:49 +08:00
|
|
|
retries++;
|
|
|
|
goto again;
|
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2010-10-16 04:52:49 +08:00
|
|
|
ret = -ENOSPC;
|
2011-07-28 03:57:44 +08:00
|
|
|
if (committed)
|
|
|
|
goto out;
|
|
|
|
|
2011-11-04 10:54:25 +08:00
|
|
|
ret = may_commit_transaction(root, space_info, orig_bytes, 0);
|
2010-10-27 00:52:53 +08:00
|
|
|
if (!ret) {
|
|
|
|
committed = true;
|
2010-10-16 04:52:49 +08:00
|
|
|
goto again;
|
2010-10-27 00:52:53 +08:00
|
|
|
}
|
2010-10-16 04:52:49 +08:00
|
|
|
|
|
|
|
out:
|
2011-06-08 04:07:44 +08:00
|
|
|
if (flushing) {
|
2010-10-16 04:52:49 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-06-08 04:07:44 +08:00
|
|
|
space_info->flush = 0;
|
|
|
|
wake_up_all(&space_info->wait);
|
2010-10-16 04:52:49 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct btrfs_block_rsv *get_block_rsv(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
|
|
|
{
|
2011-08-30 23:31:29 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv = NULL;
|
|
|
|
|
|
|
|
if (root->ref_cows || root == root->fs_info->csum_root)
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = trans->block_rsv;
|
2011-08-30 23:31:29 +08:00
|
|
|
|
|
|
|
if (!block_rsv)
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = root->block_rsv;
|
|
|
|
|
|
|
|
if (!block_rsv)
|
|
|
|
block_rsv = &root->fs_info->empty_block_rsv;
|
|
|
|
|
|
|
|
return block_rsv;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
int ret = -ENOSPC;
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
if (block_rsv->reserved >= num_bytes) {
|
|
|
|
block_rsv->reserved -= num_bytes;
|
|
|
|
if (block_rsv->reserved < block_rsv->size)
|
|
|
|
block_rsv->full = 0;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void block_rsv_add_bytes(struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes, int update_size)
|
|
|
|
{
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
block_rsv->reserved += num_bytes;
|
|
|
|
if (update_size)
|
|
|
|
block_rsv->size += num_bytes;
|
|
|
|
else if (block_rsv->reserved >= block_rsv->size)
|
|
|
|
block_rsv->full = 1;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
}
|
|
|
|
|
2011-04-20 21:52:26 +08:00
|
|
|
static void block_rsv_release_bytes(struct btrfs_block_rsv *block_rsv,
|
|
|
|
struct btrfs_block_rsv *dest, u64 num_bytes)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info = block_rsv->space_info;
|
|
|
|
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
if (num_bytes == (u64)-1)
|
|
|
|
num_bytes = block_rsv->size;
|
|
|
|
block_rsv->size -= num_bytes;
|
|
|
|
if (block_rsv->reserved >= block_rsv->size) {
|
|
|
|
num_bytes = block_rsv->reserved - block_rsv->size;
|
|
|
|
block_rsv->reserved = block_rsv->size;
|
|
|
|
block_rsv->full = 1;
|
|
|
|
} else {
|
|
|
|
num_bytes = 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
|
|
|
|
if (num_bytes > 0) {
|
|
|
|
if (dest) {
|
2011-01-25 05:43:19 +08:00
|
|
|
spin_lock(&dest->lock);
|
|
|
|
if (!dest->full) {
|
|
|
|
u64 bytes_to_add;
|
|
|
|
|
|
|
|
bytes_to_add = dest->size - dest->reserved;
|
|
|
|
bytes_to_add = min(num_bytes, bytes_to_add);
|
|
|
|
dest->reserved += bytes_to_add;
|
|
|
|
if (dest->reserved >= dest->size)
|
|
|
|
dest->full = 1;
|
|
|
|
num_bytes -= bytes_to_add;
|
|
|
|
}
|
|
|
|
spin_unlock(&dest->lock);
|
|
|
|
}
|
|
|
|
if (num_bytes) {
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_lock(&space_info->lock);
|
2011-07-27 05:00:46 +08:00
|
|
|
space_info->bytes_may_use -= num_bytes;
|
2011-03-12 20:08:42 +08:00
|
|
|
space_info->reservation_progress++;
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2009-02-20 23:59:53 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static int block_rsv_migrate_bytes(struct btrfs_block_rsv *src,
|
|
|
|
struct btrfs_block_rsv *dst, u64 num_bytes)
|
|
|
|
{
|
|
|
|
int ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = block_rsv_use_bytes(src, num_bytes);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv_add_bytes(dst, num_bytes, 1);
|
2009-09-12 04:12:44 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv)
|
2009-09-12 04:12:44 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
memset(rsv, 0, sizeof(*rsv));
|
|
|
|
spin_lock_init(&rsv->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = kmalloc(sizeof(*block_rsv), GFP_NOFS);
|
|
|
|
if (!block_rsv)
|
|
|
|
return NULL;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
btrfs_init_block_rsv(block_rsv);
|
|
|
|
block_rsv->space_info = __find_space_info(fs_info,
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
return block_rsv;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
void btrfs_free_block_rsv(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv)
|
|
|
|
{
|
2011-08-09 00:50:18 +08:00
|
|
|
btrfs_block_rsv_release(root, rsv, (u64)-1);
|
|
|
|
kfree(rsv);
|
2009-09-12 04:12:44 +08:00
|
|
|
}
|
|
|
|
|
2011-11-11 09:45:05 +08:00
|
|
|
static inline int __block_rsv_add(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes, int flush)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (num_bytes == 0)
|
|
|
|
return 0;
|
2010-10-16 04:52:49 +08:00
|
|
|
|
2011-11-11 09:45:05 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!ret) {
|
|
|
|
block_rsv_add_bytes(block_rsv, num_bytes, 1);
|
|
|
|
return 0;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-11-11 09:45:05 +08:00
|
|
|
int btrfs_block_rsv_add(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
return __block_rsv_add(root, block_rsv, num_bytes, 1);
|
|
|
|
}
|
|
|
|
|
Btrfs: fix delayed insertion reservation
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.
The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.
With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-05 07:56:02 +08:00
|
|
|
int btrfs_block_rsv_add_noflush(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
2011-11-11 09:45:05 +08:00
|
|
|
return __block_rsv_add(root, block_rsv, num_bytes, 0);
|
Btrfs: fix delayed insertion reservation
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.
The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.
With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-05 07:56:02 +08:00
|
|
|
}
|
|
|
|
|
2011-08-31 00:34:28 +08:00
|
|
|
int btrfs_block_rsv_check(struct btrfs_root *root,
|
2011-10-19 00:15:48 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv, int min_factor)
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
u64 num_bytes = 0;
|
|
|
|
int ret = -ENOSPC;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!block_rsv)
|
|
|
|
return 0;
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_lock(&block_rsv->lock);
|
2011-10-19 00:15:48 +08:00
|
|
|
num_bytes = div_factor(block_rsv->size, min_factor);
|
|
|
|
if (block_rsv->reserved >= num_bytes)
|
|
|
|
ret = 0;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-10-19 00:15:48 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-11-18 17:43:00 +08:00
|
|
|
static inline int __btrfs_block_rsv_refill(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 min_reserved, int flush)
|
2011-10-19 00:15:48 +08:00
|
|
|
{
|
|
|
|
u64 num_bytes = 0;
|
|
|
|
int ret = -ENOSPC;
|
|
|
|
|
|
|
|
if (!block_rsv)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
num_bytes = min_reserved;
|
2011-08-09 01:33:21 +08:00
|
|
|
if (block_rsv->reserved >= num_bytes)
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = 0;
|
2011-08-09 01:33:21 +08:00
|
|
|
else
|
2010-05-16 22:46:25 +08:00
|
|
|
num_bytes -= block_rsv->reserved;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
2011-08-09 01:33:21 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
|
2011-11-18 17:43:00 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
|
2011-08-09 00:50:18 +08:00
|
|
|
if (!ret) {
|
|
|
|
block_rsv_add_bytes(block_rsv, num_bytes, 0);
|
|
|
|
return 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-09-12 04:12:44 +08:00
|
|
|
|
2011-08-09 01:33:21 +08:00
|
|
|
return ret;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
2011-11-18 17:43:00 +08:00
|
|
|
int btrfs_block_rsv_refill(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 min_reserved)
|
|
|
|
{
|
|
|
|
return __btrfs_block_rsv_refill(root, block_rsv, min_reserved, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 min_reserved)
|
|
|
|
{
|
|
|
|
return __btrfs_block_rsv_refill(root, block_rsv, min_reserved, 0);
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
|
|
|
|
struct btrfs_block_rsv *dst_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_block_rsv_release(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
|
|
|
if (global_rsv->full || global_rsv == block_rsv ||
|
|
|
|
block_rsv->space_info != global_rsv->space_info)
|
|
|
|
global_rsv = NULL;
|
|
|
|
block_rsv_release_bytes(block_rsv, global_rsv, num_bytes);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-16 22:49:58 +08:00
|
|
|
* helper to calculate size of global block reservation.
|
|
|
|
* the desired value is sum of space used by extent tree,
|
|
|
|
* checksum tree and root tree
|
2009-02-21 00:00:09 +08:00
|
|
|
*/
|
2010-05-16 22:49:58 +08:00
|
|
|
static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:49:58 +08:00
|
|
|
struct btrfs_space_info *sinfo;
|
|
|
|
u64 num_bytes;
|
|
|
|
u64 meta_used;
|
|
|
|
u64 data_used;
|
2011-04-13 21:41:04 +08:00
|
|
|
int csum_size = btrfs_super_csum_size(fs_info->super_copy);
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
data_used = sinfo->bytes_used;
|
|
|
|
spin_unlock(&sinfo->lock);
|
2009-09-23 02:45:50 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
spin_lock(&sinfo->lock);
|
2010-10-16 03:13:32 +08:00
|
|
|
if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
data_used = 0;
|
2010-05-16 22:49:58 +08:00
|
|
|
meta_used = sinfo->bytes_used;
|
|
|
|
spin_unlock(&sinfo->lock);
|
2010-03-19 22:38:13 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) *
|
|
|
|
csum_size * 2;
|
|
|
|
num_bytes += div64_u64(data_used + meta_used, 50);
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
if (num_bytes * 3 > meta_used)
|
|
|
|
num_bytes = div64_u64(meta_used, 3);
|
2010-03-19 22:38:13 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
return ALIGN(num_bytes, fs_info->extent_root->leafsize << 10);
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
|
|
|
|
struct btrfs_space_info *sinfo = block_rsv->space_info;
|
|
|
|
u64 num_bytes;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
num_bytes = calc_global_metadata_size(fs_info);
|
2009-09-23 02:45:50 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
spin_lock(&sinfo->lock);
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
block_rsv->size = num_bytes;
|
2009-02-20 23:59:53 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
num_bytes = sinfo->bytes_used + sinfo->bytes_pinned +
|
2010-10-16 03:13:32 +08:00
|
|
|
sinfo->bytes_reserved + sinfo->bytes_readonly +
|
|
|
|
sinfo->bytes_may_use;
|
2010-05-16 22:49:58 +08:00
|
|
|
|
|
|
|
if (sinfo->total_bytes > num_bytes) {
|
|
|
|
num_bytes = sinfo->total_bytes - num_bytes;
|
|
|
|
block_rsv->reserved += num_bytes;
|
2011-07-27 05:00:46 +08:00
|
|
|
sinfo->bytes_may_use += num_bytes;
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
if (block_rsv->reserved >= block_rsv->size) {
|
|
|
|
num_bytes = block_rsv->reserved - block_rsv->size;
|
2011-07-27 05:00:46 +08:00
|
|
|
sinfo->bytes_may_use -= num_bytes;
|
2011-03-12 20:08:42 +08:00
|
|
|
sinfo->reservation_progress++;
|
2010-05-16 22:49:58 +08:00
|
|
|
block_rsv->reserved = block_rsv->size;
|
|
|
|
block_rsv->full = 1;
|
|
|
|
}
|
2011-05-05 19:13:16 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
spin_unlock(&block_rsv->lock);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
|
|
|
|
fs_info->chunk_block_rsv.space_info = space_info;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
2010-05-16 22:49:58 +08:00
|
|
|
fs_info->global_block_rsv.space_info = space_info;
|
|
|
|
fs_info->delalloc_block_rsv.space_info = space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
fs_info->trans_block_rsv.space_info = space_info;
|
|
|
|
fs_info->empty_block_rsv.space_info = space_info;
|
2011-11-04 10:54:25 +08:00
|
|
|
fs_info->delayed_block_rsv.space_info = space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
fs_info->extent_root->block_rsv = &fs_info->global_block_rsv;
|
|
|
|
fs_info->csum_root->block_rsv = &fs_info->global_block_rsv;
|
|
|
|
fs_info->dev_root->block_rsv = &fs_info->global_block_rsv;
|
|
|
|
fs_info->tree_root->block_rsv = &fs_info->global_block_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv;
|
2010-05-16 22:49:58 +08:00
|
|
|
|
|
|
|
update_global_block_rsv(fs_info);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:49:58 +08:00
|
|
|
block_rsv_release_bytes(&fs_info->global_block_rsv, NULL, (u64)-1);
|
|
|
|
WARN_ON(fs_info->delalloc_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
|
|
|
|
WARN_ON(fs_info->trans_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->trans_block_rsv.reserved > 0);
|
|
|
|
WARN_ON(fs_info->chunk_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->chunk_block_rsv.reserved > 0);
|
2011-11-04 10:54:25 +08:00
|
|
|
WARN_ON(fs_info->delayed_block_rsv.size > 0);
|
|
|
|
WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
2009-02-21 00:00:09 +08:00
|
|
|
{
|
2010-05-16 22:48:46 +08:00
|
|
|
if (!trans->bytes_reserved)
|
|
|
|
return;
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2011-10-15 02:40:17 +08:00
|
|
|
btrfs_block_rsv_release(root, trans->block_rsv, trans->bytes_reserved);
|
2010-05-16 22:48:46 +08:00
|
|
|
trans->bytes_reserved = 0;
|
|
|
|
}
|
2009-02-21 00:00:09 +08:00
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_block_rsv *src_rsv = get_block_rsv(trans, root);
|
|
|
|
struct btrfs_block_rsv *dst_rsv = root->orphan_block_rsv;
|
|
|
|
|
|
|
|
/*
|
2011-05-03 22:40:22 +08:00
|
|
|
* We need to hold space in order to delete our orphan item once we've
|
|
|
|
* added it, so this takes the reservation so we can release it later
|
|
|
|
* when we are truly done with the orphan item.
|
2010-05-16 22:49:58 +08:00
|
|
|
*/
|
2011-05-28 19:00:39 +08:00
|
|
|
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
|
2010-05-16 22:49:58 +08:00
|
|
|
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
|
2009-02-21 00:00:09 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
void btrfs_orphan_release_metadata(struct inode *inode)
|
2009-04-22 05:40:57 +08:00
|
|
|
{
|
2010-05-16 22:49:58 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-05-28 19:00:39 +08:00
|
|
|
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
|
2010-05-16 22:49:58 +08:00
|
|
|
btrfs_block_rsv_release(root, root->orphan_block_rsv, num_bytes);
|
|
|
|
}
|
2009-04-22 05:40:57 +08:00
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_pending_snapshot *pending)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = pending->root;
|
|
|
|
struct btrfs_block_rsv *src_rsv = get_block_rsv(trans, root);
|
|
|
|
struct btrfs_block_rsv *dst_rsv = &pending->block_rsv;
|
|
|
|
/*
|
|
|
|
* two for root back/forward refs, two for directory entries
|
|
|
|
* and one for root of the snapshot.
|
|
|
|
*/
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
u64 num_bytes = btrfs_calc_trans_metadata_size(root, 5);
|
2010-05-16 22:48:46 +08:00
|
|
|
dst_rsv->space_info = src_rsv->space_info;
|
|
|
|
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
|
2009-04-22 05:40:57 +08:00
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* drop_outstanding_extent - drop an outstanding extent
|
|
|
|
* @inode: the inode we're dropping the extent for
|
|
|
|
*
|
|
|
|
* This is called when we are freeing up an outstanding extent, either called
|
|
|
|
* after an error or after an extent is written. This will return the number of
|
|
|
|
* reserved extents that need to be freed. This must be called with
|
|
|
|
* BTRFS_I(inode)->lock held.
|
|
|
|
*/
|
2011-07-15 23:16:44 +08:00
|
|
|
static unsigned drop_outstanding_extent(struct inode *inode)
|
|
|
|
{
|
2011-11-09 04:47:34 +08:00
|
|
|
unsigned drop_inode_space = 0;
|
2011-07-15 23:16:44 +08:00
|
|
|
unsigned dropped_extents = 0;
|
|
|
|
|
|
|
|
BUG_ON(!BTRFS_I(inode)->outstanding_extents);
|
|
|
|
BTRFS_I(inode)->outstanding_extents--;
|
|
|
|
|
2011-11-09 04:47:34 +08:00
|
|
|
if (BTRFS_I(inode)->outstanding_extents == 0 &&
|
|
|
|
BTRFS_I(inode)->delalloc_meta_reserved) {
|
|
|
|
drop_inode_space = 1;
|
|
|
|
BTRFS_I(inode)->delalloc_meta_reserved = 0;
|
|
|
|
}
|
|
|
|
|
2011-07-15 23:16:44 +08:00
|
|
|
/*
|
|
|
|
* If we have more or the same amount of outsanding extents than we have
|
|
|
|
* reserved then we need to leave the reserved extents count alone.
|
|
|
|
*/
|
|
|
|
if (BTRFS_I(inode)->outstanding_extents >=
|
|
|
|
BTRFS_I(inode)->reserved_extents)
|
2011-11-09 04:47:34 +08:00
|
|
|
return drop_inode_space;
|
2011-07-15 23:16:44 +08:00
|
|
|
|
|
|
|
dropped_extents = BTRFS_I(inode)->reserved_extents -
|
|
|
|
BTRFS_I(inode)->outstanding_extents;
|
|
|
|
BTRFS_I(inode)->reserved_extents -= dropped_extents;
|
2011-11-09 04:47:34 +08:00
|
|
|
return dropped_extents + drop_inode_space;
|
2011-07-15 23:16:44 +08:00
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* calc_csum_metadata_size - return the amount of metada space that must be
|
|
|
|
* reserved/free'd for the given bytes.
|
|
|
|
* @inode: the inode we're manipulating
|
|
|
|
* @num_bytes: the number of bytes in question
|
|
|
|
* @reserve: 1 if we are reserving space, 0 if we are freeing space
|
|
|
|
*
|
|
|
|
* This adjusts the number of csum_bytes in the inode and then returns the
|
|
|
|
* correct amount of metadata that must either be reserved or freed. We
|
|
|
|
* calculate how many checksums we can fit into one leaf and then divide the
|
|
|
|
* number of bytes that will need to be checksumed by this value to figure out
|
|
|
|
* how many checksums will be required. If we are adding bytes then the number
|
|
|
|
* may go up and we will return the number of additional bytes that must be
|
|
|
|
* reserved. If it is going down we will return the number of bytes that must
|
|
|
|
* be freed.
|
|
|
|
*
|
|
|
|
* This must be called with BTRFS_I(inode)->lock held.
|
|
|
|
*/
|
|
|
|
static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes,
|
|
|
|
int reserve)
|
2008-03-25 03:01:59 +08:00
|
|
|
{
|
2011-08-04 22:25:02 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
u64 csum_size;
|
|
|
|
int num_csums_per_leaf;
|
|
|
|
int num_csums;
|
|
|
|
int old_csums;
|
|
|
|
|
|
|
|
if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM &&
|
|
|
|
BTRFS_I(inode)->csum_bytes == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
old_csums = (int)div64_u64(BTRFS_I(inode)->csum_bytes, root->sectorsize);
|
|
|
|
if (reserve)
|
|
|
|
BTRFS_I(inode)->csum_bytes += num_bytes;
|
|
|
|
else
|
|
|
|
BTRFS_I(inode)->csum_bytes -= num_bytes;
|
|
|
|
csum_size = BTRFS_LEAF_DATA_SIZE(root) - sizeof(struct btrfs_item);
|
|
|
|
num_csums_per_leaf = (int)div64_u64(csum_size,
|
|
|
|
sizeof(struct btrfs_csum_item) +
|
|
|
|
sizeof(struct btrfs_disk_key));
|
|
|
|
num_csums = (int)div64_u64(BTRFS_I(inode)->csum_bytes, root->sectorsize);
|
|
|
|
num_csums = num_csums + num_csums_per_leaf - 1;
|
|
|
|
num_csums = num_csums / num_csums_per_leaf;
|
|
|
|
|
|
|
|
old_csums = old_csums + num_csums_per_leaf - 1;
|
|
|
|
old_csums = old_csums / num_csums_per_leaf;
|
|
|
|
|
|
|
|
/* No change, no need to reserve more */
|
|
|
|
if (old_csums == num_csums)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (reserve)
|
|
|
|
return btrfs_calc_trans_metadata_size(root,
|
|
|
|
num_csums - old_csums);
|
|
|
|
|
|
|
|
return btrfs_calc_trans_metadata_size(root, old_csums - num_csums);
|
2010-05-16 22:48:47 +08:00
|
|
|
}
|
2008-11-13 03:34:12 +08:00
|
|
|
|
2010-05-16 22:48:47 +08:00
|
|
|
int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
|
2011-07-15 23:16:44 +08:00
|
|
|
u64 to_reserve = 0;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
u64 csum_bytes;
|
2011-07-15 23:16:44 +08:00
|
|
|
unsigned nr_extents = 0;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
int extra_reserve = 0;
|
2011-08-30 22:19:10 +08:00
|
|
|
int flush = 1;
|
2010-05-16 22:48:47 +08:00
|
|
|
int ret;
|
2008-03-25 03:01:59 +08:00
|
|
|
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
/* Need to be holding the i_mutex here if we aren't free space cache */
|
2011-08-30 22:19:10 +08:00
|
|
|
if (btrfs_is_free_space_inode(root, inode))
|
|
|
|
flush = 0;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
else
|
|
|
|
WARN_ON(!mutex_is_locked(&inode->i_mutex));
|
2011-08-30 22:19:10 +08:00
|
|
|
|
|
|
|
if (flush && btrfs_transaction_in_commit(root->fs_info))
|
2010-05-16 22:48:47 +08:00
|
|
|
schedule_timeout(1);
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2010-05-16 22:48:47 +08:00
|
|
|
num_bytes = ALIGN(num_bytes, root->sectorsize);
|
2010-10-16 04:52:49 +08:00
|
|
|
|
2011-07-15 23:16:44 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
|
|
|
BTRFS_I(inode)->outstanding_extents++;
|
|
|
|
|
|
|
|
if (BTRFS_I(inode)->outstanding_extents >
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
BTRFS_I(inode)->reserved_extents)
|
2011-07-15 23:16:44 +08:00
|
|
|
nr_extents = BTRFS_I(inode)->outstanding_extents -
|
|
|
|
BTRFS_I(inode)->reserved_extents;
|
2011-01-26 05:30:38 +08:00
|
|
|
|
2011-11-09 04:47:34 +08:00
|
|
|
/*
|
|
|
|
* Add an item to reserve for updating the inode when we complete the
|
|
|
|
* delalloc io.
|
|
|
|
*/
|
|
|
|
if (!BTRFS_I(inode)->delalloc_meta_reserved) {
|
|
|
|
nr_extents++;
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
extra_reserve = 1;
|
2008-03-26 04:50:33 +08:00
|
|
|
}
|
2011-11-09 04:47:34 +08:00
|
|
|
|
|
|
|
to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
|
2011-08-04 22:25:02 +08:00
|
|
|
to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
csum_bytes = BTRFS_I(inode)->csum_bytes;
|
2011-07-15 23:16:44 +08:00
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
2011-01-26 05:30:38 +08:00
|
|
|
|
2011-10-19 00:15:48 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
|
2011-07-15 23:16:44 +08:00
|
|
|
if (ret) {
|
2011-08-20 03:45:52 +08:00
|
|
|
u64 to_free = 0;
|
2011-07-15 23:16:44 +08:00
|
|
|
unsigned dropped;
|
2011-08-20 03:45:52 +08:00
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
2011-07-15 23:16:44 +08:00
|
|
|
dropped = drop_outstanding_extent(inode);
|
2011-08-20 03:45:52 +08:00
|
|
|
/*
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
* If the inodes csum_bytes is the same as the original
|
|
|
|
* csum_bytes then we know we haven't raced with any free()ers
|
|
|
|
* so we can just reduce our inodes csum bytes and carry on.
|
|
|
|
* Otherwise we have to do the normal free thing to account for
|
|
|
|
* the case that the free side didn't free up its reserve
|
|
|
|
* because of this outstanding reservation.
|
2011-08-20 03:45:52 +08:00
|
|
|
*/
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
if (BTRFS_I(inode)->csum_bytes == csum_bytes)
|
|
|
|
calc_csum_metadata_size(inode, num_bytes, 0);
|
|
|
|
else
|
|
|
|
to_free = calc_csum_metadata_size(inode, num_bytes, 0);
|
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
|
|
|
if (dropped)
|
|
|
|
to_free += btrfs_calc_trans_metadata_size(root, dropped);
|
|
|
|
|
2011-08-20 03:45:52 +08:00
|
|
|
if (to_free)
|
|
|
|
btrfs_block_rsv_release(root, block_rsv, to_free);
|
2010-05-16 22:48:47 +08:00
|
|
|
return ret;
|
2011-07-15 23:16:44 +08:00
|
|
|
}
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
|
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-10 00:18:51 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
|
|
|
if (extra_reserve) {
|
|
|
|
BTRFS_I(inode)->delalloc_meta_reserved = 1;
|
|
|
|
nr_extents--;
|
|
|
|
}
|
|
|
|
BTRFS_I(inode)->reserved_extents += nr_extents;
|
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
|
|
|
|
2010-05-16 22:48:47 +08:00
|
|
|
block_rsv_add_bytes(block_rsv, to_reserve, 1);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* btrfs_delalloc_release_metadata - release a metadata reservation for an inode
|
|
|
|
* @inode: the inode to release the reservation for
|
|
|
|
* @num_bytes: the number of bytes we're releasing
|
|
|
|
*
|
|
|
|
* This will release the metadata reservation for an inode. This can be called
|
|
|
|
* once we complete IO for a given set of bytes to release their metadata
|
|
|
|
* reservations.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-07-15 23:16:44 +08:00
|
|
|
u64 to_free = 0;
|
|
|
|
unsigned dropped;
|
2010-05-16 22:48:47 +08:00
|
|
|
|
|
|
|
num_bytes = ALIGN(num_bytes, root->sectorsize);
|
2011-08-04 22:25:02 +08:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
2011-07-15 23:16:44 +08:00
|
|
|
dropped = drop_outstanding_extent(inode);
|
2009-04-22 05:40:57 +08:00
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
to_free = calc_csum_metadata_size(inode, num_bytes, 0);
|
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
2011-07-15 23:16:44 +08:00
|
|
|
if (dropped > 0)
|
|
|
|
to_free += btrfs_calc_trans_metadata_size(root, dropped);
|
2010-05-16 22:48:47 +08:00
|
|
|
|
|
|
|
btrfs_block_rsv_release(root, &root->fs_info->delalloc_block_rsv,
|
|
|
|
to_free);
|
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* btrfs_delalloc_reserve_space - reserve data and metadata space for delalloc
|
|
|
|
* @inode: inode we're writing to
|
|
|
|
* @num_bytes: the number of bytes we want to allocate
|
|
|
|
*
|
|
|
|
* This will do the following things
|
|
|
|
*
|
|
|
|
* o reserve space in the data space info for num_bytes
|
|
|
|
* o reserve space in the metadata space info based on number of outstanding
|
|
|
|
* extents and how much csums will be needed
|
|
|
|
* o add to the inodes ->delalloc_bytes
|
|
|
|
* o add it to the fs_info's delalloc inodes list.
|
|
|
|
*
|
|
|
|
* This will return 0 for success and -ENOSPC if there is no space left.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = btrfs_check_data_free_space(inode, num_bytes);
|
2009-01-06 10:25:51 +08:00
|
|
|
if (ret)
|
2010-05-16 22:48:47 +08:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = btrfs_delalloc_reserve_metadata(inode, num_bytes);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_free_reserved_data_space(inode, num_bytes);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-08-04 22:25:02 +08:00
|
|
|
/**
|
|
|
|
* btrfs_delalloc_release_space - release data and metadata space for delalloc
|
|
|
|
* @inode: inode we're releasing space for
|
|
|
|
* @num_bytes: the number of bytes we want to free up
|
|
|
|
*
|
|
|
|
* This must be matched with a call to btrfs_delalloc_reserve_space. This is
|
|
|
|
* called in the case that we don't need the metadata AND data reservations
|
|
|
|
* anymore. So if there is an error or we insert an inline extent.
|
|
|
|
*
|
|
|
|
* This function will release the metadata space that was not used and will
|
|
|
|
* decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
|
|
|
|
* list if there are no delalloc bytes left.
|
|
|
|
*/
|
2010-05-16 22:48:47 +08:00
|
|
|
void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes)
|
|
|
|
{
|
|
|
|
btrfs_delalloc_release_metadata(inode, num_bytes);
|
|
|
|
btrfs_free_reserved_data_space(inode, num_bytes);
|
2008-03-25 03:01:59 +08:00
|
|
|
}
|
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
static int update_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 bytenr, u64 num_bytes, int alloc)
|
2007-04-27 04:46:15 +08:00
|
|
|
{
|
2010-06-22 02:48:16 +08:00
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
2007-04-27 04:46:15 +08:00
|
|
|
struct btrfs_fs_info *info = root->fs_info;
|
2007-10-16 04:15:53 +08:00
|
|
|
u64 total = num_bytes;
|
2007-04-27 04:46:15 +08:00
|
|
|
u64 old_val;
|
2007-10-16 04:15:53 +08:00
|
|
|
u64 byte_in_group;
|
2010-06-22 02:48:16 +08:00
|
|
|
int factor;
|
2007-05-08 08:03:49 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
/* block accounting for super block */
|
|
|
|
spin_lock(&info->delalloc_lock);
|
2011-04-13 21:41:04 +08:00
|
|
|
old_val = btrfs_super_bytes_used(info->super_copy);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (alloc)
|
|
|
|
old_val += num_bytes;
|
|
|
|
else
|
|
|
|
old_val -= num_bytes;
|
2011-04-13 21:41:04 +08:00
|
|
|
btrfs_set_super_bytes_used(info->super_copy, old_val);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
spin_unlock(&info->delalloc_lock);
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (total) {
|
2007-10-16 04:15:53 +08:00
|
|
|
cache = btrfs_lookup_block_group(info, bytenr);
|
2008-11-13 03:19:50 +08:00
|
|
|
if (!cache)
|
2007-04-27 04:46:15 +08:00
|
|
|
return -1;
|
2010-05-16 22:46:24 +08:00
|
|
|
if (cache->flags & (BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
2010-08-26 04:54:15 +08:00
|
|
|
/*
|
|
|
|
* If this block group has free space cache written out, we
|
|
|
|
* need to make sure to load it if we are removing space. This
|
|
|
|
* is because we need the unpinning stage to actually add the
|
|
|
|
* space back to the block group, otherwise we will leak space.
|
|
|
|
*/
|
|
|
|
if (!alloc && cache->cached == BTRFS_CACHE_NO)
|
2010-12-08 22:15:11 +08:00
|
|
|
cache_block_group(cache, trans, NULL, 1);
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
byte_in_group = bytenr - cache->key.objectid;
|
|
|
|
WARN_ON(byte_in_group > cache->key.offset);
|
2007-04-27 04:46:15 +08:00
|
|
|
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock(&cache->lock);
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2011-10-04 02:07:49 +08:00
|
|
|
if (btrfs_test_opt(root, SPACE_CACHE) &&
|
2010-06-22 02:48:16 +08:00
|
|
|
cache->disk_cache_state < BTRFS_DC_CLEAR)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_CLEAR;
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
cache->dirty = 1;
|
2007-04-27 04:46:15 +08:00
|
|
|
old_val = btrfs_block_group_used(&cache->item);
|
2007-10-16 04:15:53 +08:00
|
|
|
num_bytes = min(total, cache->key.offset - byte_in_group);
|
2007-04-27 22:08:34 +08:00
|
|
|
if (alloc) {
|
2007-10-16 04:15:53 +08:00
|
|
|
old_val += num_bytes;
|
2009-09-12 04:11:19 +08:00
|
|
|
btrfs_set_block_group_used(&cache->item, old_val);
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
cache->space_info->bytes_reserved -= num_bytes;
|
2010-05-16 22:46:24 +08:00
|
|
|
cache->space_info->bytes_used += num_bytes;
|
|
|
|
cache->space_info->disk_used += num_bytes * factor;
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
2007-04-27 22:08:34 +08:00
|
|
|
} else {
|
2007-10-16 04:15:53 +08:00
|
|
|
old_val -= num_bytes;
|
2008-07-23 11:06:41 +08:00
|
|
|
btrfs_set_block_group_used(&cache->item, old_val);
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->pinned += num_bytes;
|
|
|
|
cache->space_info->bytes_pinned += num_bytes;
|
2008-03-25 03:01:59 +08:00
|
|
|
cache->space_info->bytes_used -= num_bytes;
|
2010-05-16 22:46:24 +08:00
|
|
|
cache->space_info->disk_used -= num_bytes * factor;
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_unlock(&cache->lock);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
set_extent_dirty(info->pinned_extents,
|
|
|
|
bytenr, bytenr + num_bytes - 1,
|
|
|
|
GFP_NOFS | __GFP_NOFAIL);
|
2007-04-27 22:08:34 +08:00
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2007-10-16 04:15:53 +08:00
|
|
|
total -= num_bytes;
|
|
|
|
bytenr += num_bytes;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
static u64 first_logical_byte(struct btrfs_root *root, u64 search_start)
|
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
2008-12-12 05:30:39 +08:00
|
|
|
u64 bytenr;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
|
|
|
cache = btrfs_lookup_first_block_group(root->fs_info, search_start);
|
|
|
|
if (!cache)
|
2008-05-07 23:43:44 +08:00
|
|
|
return 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
bytenr = cache->key.objectid;
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
2008-12-12 05:30:39 +08:00
|
|
|
|
|
|
|
return bytenr;
|
2008-05-07 23:43:44 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static int pin_down_extent(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache,
|
|
|
|
u64 bytenr, u64 num_bytes, int reserved)
|
2007-11-17 03:57:08 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
cache->pinned += num_bytes;
|
|
|
|
cache->space_info->bytes_pinned += num_bytes;
|
|
|
|
if (reserved) {
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
cache->space_info->bytes_reserved -= num_bytes;
|
|
|
|
}
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&cache->space_info->lock);
|
Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 01:57:01 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
set_extent_dirty(root->fs_info->pinned_extents, bytenr,
|
|
|
|
bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
|
|
|
|
return 0;
|
|
|
|
}
|
Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 01:57:01 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
/*
|
|
|
|
* this function must be called within transaction
|
|
|
|
*/
|
|
|
|
int btrfs_pin_extent(struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, int reserved)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 01:57:01 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, bytenr);
|
|
|
|
BUG_ON(!cache);
|
|
|
|
|
|
|
|
pin_down_extent(root, cache, bytenr, num_bytes, reserved);
|
|
|
|
|
|
|
|
btrfs_put_block_group(cache);
|
2009-09-12 04:11:19 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
/*
|
|
|
|
* this function must be called within transaction
|
|
|
|
*/
|
|
|
|
int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, bytenr);
|
|
|
|
BUG_ON(!cache);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pull in the free space cache (if any) so that our pin
|
|
|
|
* removes the free space from the cache. We have load_only set
|
|
|
|
* to one because the slow code to read in the free extents does check
|
|
|
|
* the pinned extents.
|
|
|
|
*/
|
|
|
|
cache_block_group(cache, trans, root, 1);
|
|
|
|
|
|
|
|
pin_down_extent(root, cache, bytenr, num_bytes, 0);
|
|
|
|
|
|
|
|
/* remove us from the free space cache (if we're there at all) */
|
|
|
|
btrfs_remove_free_space(cache, bytenr, num_bytes);
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
/**
|
|
|
|
* btrfs_update_reserved_bytes - update the block_group and space info counters
|
|
|
|
* @cache: The cache we are manipulating
|
|
|
|
* @num_bytes: The number of bytes in question
|
|
|
|
* @reserve: One of the reservation enums
|
|
|
|
*
|
|
|
|
* This is called by the allocator when it reserves space, or by somebody who is
|
|
|
|
* freeing space that was never actually used on disk. For example if you
|
|
|
|
* reserve some space for a new leaf in transaction A and before transaction A
|
|
|
|
* commits you free that leaf, you call this with reserve set to 0 in order to
|
|
|
|
* clear the reservation.
|
|
|
|
*
|
|
|
|
* Metadata reservations should be called with RESERVE_ALLOC so we do the proper
|
|
|
|
* ENOSPC accounting. For data we handle the reservation through clearing the
|
|
|
|
* delalloc bits in the io_tree. We have to do this since we could end up
|
|
|
|
* allocating less disk space for the amount of data we have reserved in the
|
|
|
|
* case of compression.
|
|
|
|
*
|
|
|
|
* If this is a reservation and the block group has become read only we cannot
|
|
|
|
* make the reservation and return -EAGAIN, otherwise this function always
|
|
|
|
* succeeds.
|
2010-05-16 22:46:25 +08:00
|
|
|
*/
|
2011-07-27 05:00:46 +08:00
|
|
|
static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
|
|
|
|
u64 num_bytes, int reserve)
|
2009-09-12 04:11:19 +08:00
|
|
|
{
|
2011-07-27 05:00:46 +08:00
|
|
|
struct btrfs_space_info *space_info = cache->space_info;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret = 0;
|
2011-07-27 05:00:46 +08:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
if (reserve != RESERVE_FREE) {
|
2010-05-16 22:46:25 +08:00
|
|
|
if (cache->ro) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
} else {
|
2011-07-27 05:00:46 +08:00
|
|
|
cache->reserved += num_bytes;
|
|
|
|
space_info->bytes_reserved += num_bytes;
|
|
|
|
if (reserve == RESERVE_ALLOC) {
|
|
|
|
BUG_ON(space_info->bytes_may_use < num_bytes);
|
|
|
|
space_info->bytes_may_use -= num_bytes;
|
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
2011-07-27 05:00:46 +08:00
|
|
|
} else {
|
|
|
|
if (cache->ro)
|
|
|
|
space_info->bytes_readonly += num_bytes;
|
|
|
|
cache->reserved -= num_bytes;
|
|
|
|
space_info->bytes_reserved -= num_bytes;
|
|
|
|
space_info->reservation_progress++;
|
2007-11-17 03:57:08 +08:00
|
|
|
}
|
2011-07-27 05:00:46 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
return ret;
|
2007-11-17 03:57:08 +08:00
|
|
|
}
|
2007-04-27 04:46:15 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
int btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root)
|
2008-09-26 22:05:48 +08:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *next;
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
2008-09-26 22:05:48 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
down_write(&fs_info->extent_commit_sem);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
list_for_each_entry_safe(caching_ctl, next,
|
|
|
|
&fs_info->caching_block_groups, list) {
|
|
|
|
cache = caching_ctl->block_group;
|
|
|
|
if (block_group_cache_done(cache)) {
|
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
|
|
|
list_del_init(&caching_ctl->list);
|
|
|
|
put_caching_control(caching_ctl);
|
2008-09-26 22:05:48 +08:00
|
|
|
} else {
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = caching_ctl->progress;
|
2008-09-26 22:05:48 +08:00
|
|
|
}
|
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
if (fs_info->pinned_extents == &fs_info->freed_extents[0])
|
|
|
|
fs_info->pinned_extents = &fs_info->freed_extents[1];
|
|
|
|
else
|
|
|
|
fs_info->pinned_extents = &fs_info->freed_extents[0];
|
|
|
|
|
|
|
|
up_write(&fs_info->extent_commit_sem);
|
2010-05-16 22:49:58 +08:00
|
|
|
|
|
|
|
update_global_block_rsv(fs_info);
|
2008-09-26 22:05:48 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
|
2007-06-29 03:57:36 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
|
|
|
u64 len;
|
2007-06-29 03:57:36 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
while (start <= end) {
|
|
|
|
if (!cache ||
|
|
|
|
start >= cache->key.objectid + cache->key.offset) {
|
|
|
|
if (cache)
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, start);
|
|
|
|
BUG_ON(!cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
len = cache->key.objectid + cache->key.offset - start;
|
|
|
|
len = min(len, end + 1 - start);
|
|
|
|
|
|
|
|
if (start < cache->last_byte_to_unpin) {
|
|
|
|
len = min(len, cache->last_byte_to_unpin - start);
|
|
|
|
btrfs_add_free_space(cache, start, len);
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
start += len;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
cache->pinned -= len;
|
|
|
|
cache->space_info->bytes_pinned -= len;
|
2011-08-05 22:25:38 +08:00
|
|
|
if (cache->ro)
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->space_info->bytes_readonly += len;
|
2009-09-12 04:11:19 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&cache->space_info->lock);
|
2007-06-29 03:57:36 +08:00
|
|
|
}
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
if (cache)
|
|
|
|
btrfs_put_block_group(cache);
|
2007-06-29 03:57:36 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_root *root)
|
2007-03-07 09:08:01 +08:00
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct extent_io_tree *unpin;
|
2007-10-16 04:15:26 +08:00
|
|
|
u64 start;
|
|
|
|
u64 end;
|
2007-03-07 09:08:01 +08:00
|
|
|
int ret;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (fs_info->pinned_extents == &fs_info->freed_extents[0])
|
|
|
|
unpin = &fs_info->freed_extents[1];
|
|
|
|
else
|
|
|
|
unpin = &fs_info->freed_extents[0];
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2007-10-16 04:15:26 +08:00
|
|
|
ret = find_first_extent_bit(unpin, 0, &start, &end,
|
|
|
|
EXTENT_DIRTY);
|
|
|
|
if (ret)
|
2007-03-07 09:08:01 +08:00
|
|
|
break;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
if (btrfs_test_opt(root, DISCARD))
|
|
|
|
ret = btrfs_discard_extent(root, start,
|
|
|
|
end + 1 - start, NULL);
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2007-10-16 04:15:26 +08:00
|
|
|
clear_extent_dirty(unpin, start, end, GFP_NOFS);
|
2009-09-12 04:11:19 +08:00
|
|
|
unpin_extent_range(root, start, end);
|
2009-03-13 23:00:37 +08:00
|
|
|
cond_resched();
|
2007-03-07 09:08:01 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2007-03-23 00:13:20 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
|
|
|
u64 root_objectid, u64 owner_objectid,
|
|
|
|
u64 owner_offset, int refs_to_drop,
|
|
|
|
struct btrfs_delayed_extent_op *extent_op)
|
2007-03-07 09:08:01 +08:00
|
|
|
{
|
2007-03-13 04:22:34 +08:00
|
|
|
struct btrfs_key key;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_path *path;
|
2007-03-21 08:35:03 +08:00
|
|
|
struct btrfs_fs_info *info = root->fs_info;
|
|
|
|
struct btrfs_root *extent_root = info->extent_root;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_item *ei;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
2007-03-07 09:08:01 +08:00
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int is_data;
|
2008-02-19 05:33:44 +08:00
|
|
|
int extent_slot = 0;
|
|
|
|
int found_extent = 0;
|
|
|
|
int num_to_del = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 item_size;
|
|
|
|
u64 refs;
|
2007-03-08 00:50:24 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
path = btrfs_alloc_path();
|
2007-06-23 02:16:25 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2007-04-05 22:38:44 +08:00
|
|
|
|
2008-04-22 00:01:38 +08:00
|
|
|
path->reada = 1;
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
|
|
|
|
BUG_ON(!is_data && refs_to_drop != 1);
|
|
|
|
|
|
|
|
ret = lookup_extent_backref(trans, extent_root, path, &iref,
|
|
|
|
bytenr, num_bytes, parent,
|
|
|
|
root_objectid, owner_objectid,
|
|
|
|
owner_offset);
|
2007-12-11 22:25:06 +08:00
|
|
|
if (ret == 0) {
|
2008-02-19 05:33:44 +08:00
|
|
|
extent_slot = path->slots[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
while (extent_slot >= 0) {
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &key,
|
2008-02-19 05:33:44 +08:00
|
|
|
extent_slot);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.objectid != bytenr)
|
2008-02-19 05:33:44 +08:00
|
|
|
break;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (key.type == BTRFS_EXTENT_ITEM_KEY &&
|
|
|
|
key.offset == num_bytes) {
|
2008-02-19 05:33:44 +08:00
|
|
|
found_extent = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (path->slots[0] - extent_slot > 5)
|
|
|
|
break;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
extent_slot--;
|
2008-02-19 05:33:44 +08:00
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
item_size = btrfs_item_size_nr(path->nodes[0], extent_slot);
|
|
|
|
if (found_extent && item_size < sizeof(*ei))
|
|
|
|
found_extent = 0;
|
|
|
|
#endif
|
2008-09-24 01:14:14 +08:00
|
|
|
if (!found_extent) {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
BUG_ON(iref);
|
2009-03-13 22:10:06 +08:00
|
|
|
ret = remove_extent_backref(trans, extent_root, path,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
NULL, refs_to_drop,
|
|
|
|
is_data);
|
2008-09-24 01:14:14 +08:00
|
|
|
BUG_ON(ret);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
key.objectid = bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = num_bytes;
|
|
|
|
|
2008-09-24 01:14:14 +08:00
|
|
|
ret = btrfs_search_slot(trans, extent_root,
|
|
|
|
&key, path, -1, 1);
|
2008-11-13 03:19:50 +08:00
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "umm, got %d back from search"
|
2009-01-06 10:25:51 +08:00
|
|
|
", was looking for %llu\n", ret,
|
|
|
|
(unsigned long long)bytenr);
|
2011-07-13 23:03:50 +08:00
|
|
|
if (ret > 0)
|
|
|
|
btrfs_print_leaf(extent_root,
|
|
|
|
path->nodes[0]);
|
2008-11-13 03:19:50 +08:00
|
|
|
}
|
2008-09-24 01:14:14 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
extent_slot = path->slots[0];
|
|
|
|
}
|
2007-12-11 22:25:06 +08:00
|
|
|
} else {
|
|
|
|
btrfs_print_leaf(extent_root, path->nodes[0]);
|
|
|
|
WARN_ON(1);
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "btrfs unable to find ref byte nr %llu "
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
"parent %llu root %llu owner %llu offset %llu\n",
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)bytenr,
|
2009-03-13 22:10:06 +08:00
|
|
|
(unsigned long long)parent,
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)root_objectid,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
(unsigned long long)owner_objectid,
|
|
|
|
(unsigned long long)owner_offset);
|
2007-12-11 22:25:06 +08:00
|
|
|
}
|
2007-10-16 04:14:19 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
item_size = btrfs_item_size_nr(leaf, extent_slot);
|
|
|
|
#ifdef BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
if (item_size < sizeof(*ei)) {
|
|
|
|
BUG_ON(found_extent || extent_slot != path->slots[0]);
|
|
|
|
ret = convert_extent_item_v0(trans, extent_root, path,
|
|
|
|
owner_objectid, 0);
|
|
|
|
BUG_ON(ret < 0);
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->leave_spinning = 1;
|
|
|
|
|
|
|
|
key.objectid = bytenr;
|
|
|
|
key.type = BTRFS_EXTENT_ITEM_KEY;
|
|
|
|
key.offset = num_bytes;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, extent_root, &key, path,
|
|
|
|
-1, 1);
|
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "umm, got %d back from search"
|
|
|
|
", was looking for %llu\n", ret,
|
|
|
|
(unsigned long long)bytenr);
|
|
|
|
btrfs_print_leaf(extent_root, path->nodes[0]);
|
|
|
|
}
|
|
|
|
BUG_ON(ret);
|
|
|
|
extent_slot = path->slots[0];
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
item_size = btrfs_item_size_nr(leaf, extent_slot);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
BUG_ON(item_size < sizeof(*ei));
|
2008-02-19 05:33:44 +08:00
|
|
|
ei = btrfs_item_ptr(leaf, extent_slot,
|
2007-03-15 02:14:43 +08:00
|
|
|
struct btrfs_extent_item);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (owner_objectid < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
struct btrfs_tree_block_info *bi;
|
|
|
|
BUG_ON(item_size < sizeof(*ei) + sizeof(*bi));
|
|
|
|
bi = (struct btrfs_tree_block_info *)(ei + 1);
|
|
|
|
WARN_ON(owner_objectid != btrfs_tree_block_level(leaf, bi));
|
|
|
|
}
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
refs = btrfs_extent_refs(leaf, ei);
|
2009-03-13 22:10:06 +08:00
|
|
|
BUG_ON(refs < refs_to_drop);
|
|
|
|
refs -= refs_to_drop;
|
2007-10-16 04:14:19 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (refs > 0) {
|
|
|
|
if (extent_op)
|
|
|
|
__run_delayed_extent_op(extent_op, leaf, ei);
|
|
|
|
/*
|
|
|
|
* In the case of inline back ref, reference count will
|
|
|
|
* be updated by remove_extent_backref
|
2008-02-19 05:33:44 +08:00
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (iref) {
|
|
|
|
BUG_ON(!found_extent);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_refs(leaf, ei, refs);
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
}
|
|
|
|
if (found_extent) {
|
|
|
|
ret = remove_extent_backref(trans, extent_root, path,
|
|
|
|
iref, refs_to_drop,
|
|
|
|
is_data);
|
2008-02-19 05:33:44 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
if (found_extent) {
|
|
|
|
BUG_ON(is_data && refs_to_drop !=
|
|
|
|
extent_data_ref_count(root, path, iref));
|
|
|
|
if (iref) {
|
|
|
|
BUG_ON(path->slots[0] != extent_slot);
|
|
|
|
} else {
|
|
|
|
BUG_ON(path->slots[0] != extent_slot + 1);
|
|
|
|
path->slots[0] = extent_slot;
|
|
|
|
num_to_del = 2;
|
|
|
|
}
|
2007-03-25 23:35:08 +08:00
|
|
|
}
|
2009-03-13 23:00:37 +08:00
|
|
|
|
2008-02-19 05:33:44 +08:00
|
|
|
ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
|
|
|
|
num_to_del);
|
2008-09-24 01:14:14 +08:00
|
|
|
BUG_ON(ret);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-08-12 21:13:26 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (is_data) {
|
2008-12-10 22:10:46 +08:00
|
|
|
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
|
|
|
|
BUG_ON(ret);
|
2009-04-01 01:47:50 +08:00
|
|
|
} else {
|
|
|
|
invalidate_mapping_pages(info->btree_inode->i_mapping,
|
|
|
|
bytenr >> PAGE_CACHE_SHIFT,
|
|
|
|
(bytenr + num_bytes - 1) >> PAGE_CACHE_SHIFT);
|
2008-12-10 22:10:46 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = update_block_group(trans, root, bytenr, num_bytes, 0);
|
2008-12-17 02:51:01 +08:00
|
|
|
BUG_ON(ret);
|
2007-03-07 09:08:01 +08:00
|
|
|
}
|
2007-04-02 23:20:42 +08:00
|
|
|
btrfs_free_path(path);
|
2007-03-07 09:08:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:11:24 +08:00
|
|
|
/*
|
2010-05-16 22:46:25 +08:00
|
|
|
* when we free an block, it is possible (and likely) that we free the last
|
2009-03-13 22:11:24 +08:00
|
|
|
* delayed ref for that extent as well. This searches the delayed ref tree for
|
|
|
|
* a given extent, and if there are no other delayed refs to be processed, it
|
|
|
|
* removes it from the tree.
|
|
|
|
*/
|
|
|
|
static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_head *head;
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
struct btrfs_delayed_ref_node *ref;
|
|
|
|
struct rb_node *node;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret = 0;
|
2009-03-13 22:11:24 +08:00
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
spin_lock(&delayed_refs->lock);
|
|
|
|
head = btrfs_find_delayed_ref_head(trans, bytenr);
|
|
|
|
if (!head)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
node = rb_prev(&head->node.rb_node);
|
|
|
|
if (!node)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
|
|
|
|
|
|
|
|
/* there are still entries for this ref, we can't drop it */
|
|
|
|
if (ref->bytenr == bytenr)
|
|
|
|
goto out;
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (head->extent_op) {
|
|
|
|
if (!head->must_insert_reserved)
|
|
|
|
goto out;
|
|
|
|
kfree(head->extent_op);
|
|
|
|
head->extent_op = NULL;
|
|
|
|
}
|
|
|
|
|
2009-03-13 22:11:24 +08:00
|
|
|
/*
|
|
|
|
* waiting for the lock here would deadlock. If someone else has it
|
|
|
|
* locked they are already in the process of dropping it anyway
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&head->mutex))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* at this point we have a head with no other entries. Go
|
|
|
|
* ahead and process it.
|
|
|
|
*/
|
|
|
|
head->node.in_tree = 0;
|
|
|
|
rb_erase(&head->node.rb_node, &delayed_refs->root);
|
2009-03-13 22:17:05 +08:00
|
|
|
|
2009-03-13 22:11:24 +08:00
|
|
|
delayed_refs->num_entries--;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we don't take a ref on the node because we're removing it from the
|
|
|
|
* tree, so we just steal the ref the tree was holding.
|
|
|
|
*/
|
2009-03-13 22:17:05 +08:00
|
|
|
delayed_refs->num_heads--;
|
|
|
|
if (list_empty(&head->cluster))
|
|
|
|
delayed_refs->num_heads_ready--;
|
|
|
|
|
|
|
|
list_del_init(&head->cluster);
|
2009-03-13 22:11:24 +08:00
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(head->extent_op);
|
|
|
|
if (head->must_insert_reserved)
|
|
|
|
ret = 1;
|
|
|
|
|
|
|
|
mutex_unlock(&head->mutex);
|
2009-03-13 22:11:24 +08:00
|
|
|
btrfs_put_delayed_ref(&head->node);
|
2010-05-16 22:46:25 +08:00
|
|
|
return ret;
|
2009-03-13 22:11:24 +08:00
|
|
|
out:
|
|
|
|
spin_unlock(&delayed_refs->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
|
|
|
u64 parent, int last_ref)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
ret = btrfs_add_delayed_tree_ref(trans, buf->start, buf->len,
|
|
|
|
parent, root->root_key.objectid,
|
|
|
|
btrfs_header_level(buf),
|
|
|
|
BTRFS_DROP_DELAYED_REF, NULL);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!last_ref)
|
|
|
|
return;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, buf->start);
|
|
|
|
|
|
|
|
if (btrfs_header_generation(buf) == trans->transid) {
|
|
|
|
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
ret = check_ref_cleanup(trans, root, buf->start);
|
|
|
|
if (!ret)
|
2011-08-05 22:25:38 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
|
|
|
|
pin_down_extent(root, cache, buf->start, buf->len, 1);
|
2011-08-05 22:25:38 +08:00
|
|
|
goto out;
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
|
|
|
|
|
|
|
|
btrfs_add_free_space(cache, buf->start, buf->len);
|
2011-07-27 05:00:46 +08:00
|
|
|
btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE);
|
2010-05-16 22:46:25 +08:00
|
|
|
}
|
|
|
|
out:
|
2011-03-17 01:42:43 +08:00
|
|
|
/*
|
|
|
|
* Deleting the buffer, clear the corrupt flag since it doesn't matter
|
|
|
|
* anymore.
|
|
|
|
*/
|
|
|
|
clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
|
2010-05-16 22:46:25 +08:00
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
}
|
|
|
|
|
2008-06-26 04:01:30 +08:00
|
|
|
int btrfs_free_extent(struct btrfs_trans_handle *trans,
|
2008-09-24 01:14:14 +08:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u64 root_objectid, u64 owner, u64 offset)
|
2008-06-26 04:01:30 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2009-03-13 22:10:06 +08:00
|
|
|
/*
|
|
|
|
* tree log blocks never actually go into the extent allocation
|
|
|
|
* tree, just update pinning info and exit early.
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (root_objectid == BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID);
|
2009-03-13 23:00:37 +08:00
|
|
|
/* unlocks the pinned mutex */
|
2009-09-12 04:11:19 +08:00
|
|
|
btrfs_pin_extent(root, bytenr, num_bytes, 1);
|
2009-03-13 22:10:06 +08:00
|
|
|
ret = 0;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
ret = btrfs_add_delayed_tree_ref(trans, bytenr, num_bytes,
|
|
|
|
parent, root_objectid, (int)owner,
|
|
|
|
BTRFS_DROP_DELAYED_REF, NULL);
|
2009-03-13 22:11:24 +08:00
|
|
|
BUG_ON(ret);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
} else {
|
|
|
|
ret = btrfs_add_delayed_data_ref(trans, bytenr, num_bytes,
|
|
|
|
parent, root_objectid, owner,
|
|
|
|
offset, BTRFS_DROP_DELAYED_REF, NULL);
|
|
|
|
BUG_ON(ret);
|
2009-03-13 22:10:06 +08:00
|
|
|
}
|
2008-06-26 04:01:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2007-12-01 00:30:34 +08:00
|
|
|
static u64 stripe_align(struct btrfs_root *root, u64 val)
|
|
|
|
{
|
|
|
|
u64 mask = ((u64)root->stripesize - 1);
|
|
|
|
u64 ret = (val + mask) & ~mask;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
|
|
|
* when we wait for progress in the block group caching, its because
|
|
|
|
* our allocation attempt failed at least once. So, we must sleep
|
|
|
|
* and let some progress happen before we try again.
|
|
|
|
*
|
|
|
|
* This function will sleep at least once waiting for new free space to
|
|
|
|
* show up, and then it will check the block group free space numbers
|
|
|
|
* for our min num_bytes. Another option is to have it go ahead
|
|
|
|
* and look in the rbtree for a free extent of a given size, but this
|
|
|
|
* is a good start.
|
|
|
|
*/
|
|
|
|
static noinline int
|
|
|
|
wait_block_group_cache_progress(struct btrfs_block_group_cache *cache,
|
|
|
|
u64 num_bytes)
|
|
|
|
{
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
caching_ctl = get_caching_control(cache);
|
|
|
|
if (!caching_ctl)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
return 0;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
|
2011-03-29 13:46:06 +08:00
|
|
|
(cache->free_space_ctl->free_space >= num_bytes));
|
2009-09-12 04:11:19 +08:00
|
|
|
|
|
|
|
put_caching_control(caching_ctl);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int
|
|
|
|
wait_block_group_cache_done(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
|
|
|
|
caching_ctl = get_caching_control(cache);
|
|
|
|
if (!caching_ctl)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
wait_event(caching_ctl->wait, block_group_cache_done(cache));
|
|
|
|
|
|
|
|
put_caching_control(caching_ctl);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
static int get_block_group_index(struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
int index;
|
|
|
|
if (cache->flags & BTRFS_BLOCK_GROUP_RAID10)
|
|
|
|
index = 0;
|
|
|
|
else if (cache->flags & BTRFS_BLOCK_GROUP_RAID1)
|
|
|
|
index = 1;
|
|
|
|
else if (cache->flags & BTRFS_BLOCK_GROUP_DUP)
|
|
|
|
index = 2;
|
|
|
|
else if (cache->flags & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
index = 3;
|
|
|
|
else
|
|
|
|
index = 4;
|
|
|
|
return index;
|
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
enum btrfs_loop_type {
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
LOOP_FIND_IDEAL = 0,
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
LOOP_CACHING_NOWAIT = 1,
|
|
|
|
LOOP_CACHING_WAIT = 2,
|
|
|
|
LOOP_ALLOC_CHUNK = 3,
|
|
|
|
LOOP_NO_EMPTY_SIZE = 4,
|
|
|
|
};
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
/*
|
|
|
|
* walks the btree of allocated extents and find a hole of a given size.
|
|
|
|
* The key ins is changed to record the hole:
|
|
|
|
* ins->objectid == block start
|
2007-03-16 00:56:47 +08:00
|
|
|
* ins->flags = BTRFS_EXTENT_ITEM_KEY
|
2007-02-26 23:40:21 +08:00
|
|
|
* ins->offset == number of blocks
|
|
|
|
* Any available blocks before search_start are skipped.
|
|
|
|
*/
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int find_free_extent(struct btrfs_trans_handle *trans,
|
2008-01-03 23:01:48 +08:00
|
|
|
struct btrfs_root *orig_root,
|
|
|
|
u64 num_bytes, u64 empty_size,
|
|
|
|
u64 search_start, u64 search_end,
|
|
|
|
u64 hint_byte, struct btrfs_key *ins,
|
2011-06-19 04:26:38 +08:00
|
|
|
u64 data)
|
2007-02-26 23:40:21 +08:00
|
|
|
{
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
int ret = 0;
|
2009-01-06 10:25:51 +08:00
|
|
|
struct btrfs_root *root = orig_root->fs_info->extent_root;
|
2009-04-03 21:47:43 +08:00
|
|
|
struct btrfs_free_cluster *last_ptr = NULL;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
struct btrfs_block_group_cache *block_group = NULL;
|
2011-12-08 09:08:40 +08:00
|
|
|
struct btrfs_block_group_cache *used_block_group;
|
2008-03-25 03:02:07 +08:00
|
|
|
int empty_cluster = 2 * 1024 * 1024;
|
2008-05-25 02:04:53 +08:00
|
|
|
int allowed_chunk_alloc = 0;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
int done_chunk_alloc = 0;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-04-03 21:47:43 +08:00
|
|
|
int loop = 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
int index = 0;
|
2011-07-27 05:00:46 +08:00
|
|
|
int alloc_type = (data & BTRFS_BLOCK_GROUP_DATA) ?
|
|
|
|
RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
bool found_uncached_bg = false;
|
2009-09-12 04:11:20 +08:00
|
|
|
bool failed_cluster_refill = false;
|
2009-10-06 22:04:28 +08:00
|
|
|
bool failed_alloc = false;
|
2010-09-17 04:19:09 +08:00
|
|
|
bool use_cluster = true;
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
bool have_caching_bg = false;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
u64 ideal_cache_percent = 0;
|
|
|
|
u64 ideal_cache_offset = 0;
|
2007-02-26 23:40:21 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
WARN_ON(num_bytes < root->sectorsize);
|
2007-04-05 03:27:52 +08:00
|
|
|
btrfs_set_key_type(ins, BTRFS_EXTENT_ITEM_KEY);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
ins->objectid = 0;
|
|
|
|
ins->offset = 0;
|
2007-04-05 03:27:52 +08:00
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
space_info = __find_space_info(root->fs_info, data);
|
2010-03-20 04:49:55 +08:00
|
|
|
if (!space_info) {
|
2011-06-19 04:26:38 +08:00
|
|
|
printk(KERN_ERR "No space info for %llu\n", data);
|
2010-03-20 04:49:55 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
/*
|
|
|
|
* If the space info is for both data and metadata it means we have a
|
|
|
|
* small filesystem and we can't use the clustering stuff.
|
|
|
|
*/
|
|
|
|
if (btrfs_mixed_space_info(space_info))
|
|
|
|
use_cluster = false;
|
|
|
|
|
2008-05-25 02:04:53 +08:00
|
|
|
if (orig_root->ref_cows || empty_size)
|
|
|
|
allowed_chunk_alloc = 1;
|
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
if (data & BTRFS_BLOCK_GROUP_METADATA && use_cluster) {
|
2009-04-03 21:47:43 +08:00
|
|
|
last_ptr = &root->fs_info->meta_alloc_cluster;
|
2009-02-12 22:41:38 +08:00
|
|
|
if (!btrfs_test_opt(root, SSD))
|
|
|
|
empty_cluster = 64 * 1024;
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
|
|
|
|
2010-09-17 04:19:09 +08:00
|
|
|
if ((data & BTRFS_BLOCK_GROUP_DATA) && use_cluster &&
|
|
|
|
btrfs_test_opt(root, SSD)) {
|
2009-04-03 21:47:43 +08:00
|
|
|
last_ptr = &root->fs_info->data_alloc_cluster;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-03-25 03:02:07 +08:00
|
|
|
if (last_ptr) {
|
2009-04-03 21:47:43 +08:00
|
|
|
spin_lock(&last_ptr->lock);
|
|
|
|
if (last_ptr->block_group)
|
|
|
|
hint_byte = last_ptr->window_start;
|
|
|
|
spin_unlock(&last_ptr->lock);
|
2008-03-25 03:02:07 +08:00
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
|
2008-05-07 23:43:44 +08:00
|
|
|
search_start = max(search_start, first_logical_byte(root, 0));
|
2008-03-25 03:02:07 +08:00
|
|
|
search_start = max(search_start, hint_byte);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (!last_ptr)
|
2009-04-03 21:47:43 +08:00
|
|
|
empty_cluster = 0;
|
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
if (search_start == hint_byte) {
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
ideal_cache:
|
2009-04-03 22:14:19 +08:00
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info,
|
|
|
|
search_start);
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group = block_group;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
|
|
|
* we don't want to use the block group if it doesn't match our
|
|
|
|
* allocation bits, or if its not cached.
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
*
|
|
|
|
* However if we are re-searching with an ideal block group
|
|
|
|
* picked out then we don't care that the block group is cached.
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
*/
|
|
|
|
if (block_group && block_group_bits(block_group, data) &&
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
(block_group->cached != BTRFS_CACHE_NO ||
|
|
|
|
search_start == ideal_cache_offset)) {
|
2009-04-03 22:14:19 +08:00
|
|
|
down_read(&space_info->groups_sem);
|
2009-06-05 03:34:51 +08:00
|
|
|
if (list_empty(&block_group->list) ||
|
|
|
|
block_group->ro) {
|
|
|
|
/*
|
|
|
|
* someone is removing this block group,
|
|
|
|
* we can't jump into the have_block_group
|
|
|
|
* target because our list pointers are not
|
|
|
|
* valid
|
|
|
|
*/
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
up_read(&space_info->groups_sem);
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
} else {
|
2010-05-16 22:46:24 +08:00
|
|
|
index = get_block_group_index(block_group);
|
2009-06-05 03:34:51 +08:00
|
|
|
goto have_block_group;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
} else if (block_group) {
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
2008-11-08 07:17:11 +08:00
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
search:
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
have_caching_bg = false;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_read(&space_info->groups_sem);
|
2010-05-16 22:46:24 +08:00
|
|
|
list_for_each_entry(block_group, &space_info->block_groups[index],
|
|
|
|
list) {
|
2009-04-03 22:14:18 +08:00
|
|
|
u64 offset;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
int cached;
|
2008-11-11 05:13:54 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group = block_group;
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_get_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
search_start = block_group->key.objectid;
|
2008-11-08 07:17:11 +08:00
|
|
|
|
2010-12-14 04:06:46 +08:00
|
|
|
/*
|
|
|
|
* this can happen if we end up cycling through all the
|
|
|
|
* raid types, but we want to make sure we only allocate
|
|
|
|
* for the proper type.
|
|
|
|
*/
|
|
|
|
if (!block_group_bits(block_group, data)) {
|
|
|
|
u64 extra = BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if they asked for extra copies and this block group
|
|
|
|
* doesn't provide them, bail. This does allow us to
|
|
|
|
* fill raid0 from raid1.
|
|
|
|
*/
|
|
|
|
if ((data & extra) && !(block_group->flags & extra))
|
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
have_block_group:
|
2011-11-15 02:52:14 +08:00
|
|
|
cached = block_group_cache_done(block_group);
|
|
|
|
if (unlikely(!cached)) {
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
u64 free_percent;
|
|
|
|
|
2011-11-15 02:52:14 +08:00
|
|
|
found_uncached_bg = true;
|
2010-12-08 22:15:11 +08:00
|
|
|
ret = cache_block_group(block_group, trans,
|
|
|
|
orig_root, 1);
|
2010-08-26 04:54:15 +08:00
|
|
|
if (block_group->cached == BTRFS_CACHE_FINISHED)
|
2011-11-15 02:52:14 +08:00
|
|
|
goto alloc;
|
2010-08-26 04:54:15 +08:00
|
|
|
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
free_percent = btrfs_block_group_used(&block_group->item);
|
|
|
|
free_percent *= 100;
|
|
|
|
free_percent = div64_u64(free_percent,
|
|
|
|
block_group->key.offset);
|
|
|
|
free_percent = 100 - free_percent;
|
|
|
|
if (free_percent > ideal_cache_percent &&
|
|
|
|
likely(!block_group->ro)) {
|
|
|
|
ideal_cache_offset = block_group->key.objectid;
|
|
|
|
ideal_cache_percent = free_percent;
|
|
|
|
}
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
2011-07-01 02:42:28 +08:00
|
|
|
* The caching workers are limited to 2 threads, so we
|
|
|
|
* can queue as much work as we care to.
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
*/
|
2011-07-01 02:42:28 +08:00
|
|
|
if (loop > LOOP_FIND_IDEAL) {
|
2010-12-08 22:15:11 +08:00
|
|
|
ret = cache_block_group(block_group, trans,
|
|
|
|
orig_root, 0);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
BUG_ON(ret);
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
/*
|
|
|
|
* If loop is set for cached only, try the next block
|
|
|
|
* group.
|
|
|
|
*/
|
|
|
|
if (loop == LOOP_FIND_IDEAL)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
|
2011-11-15 02:52:14 +08:00
|
|
|
alloc:
|
2008-11-21 01:16:16 +08:00
|
|
|
if (unlikely(block_group->ro))
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2011-05-28 19:00:39 +08:00
|
|
|
spin_lock(&block_group->free_space_ctl->tree_lock);
|
2011-05-13 23:07:12 +08:00
|
|
|
if (cached &&
|
2011-05-28 19:00:39 +08:00
|
|
|
block_group->free_space_ctl->free_space <
|
2011-12-01 02:43:00 +08:00
|
|
|
num_bytes + empty_cluster + empty_size) {
|
2011-05-28 19:00:39 +08:00
|
|
|
spin_unlock(&block_group->free_space_ctl->tree_lock);
|
2011-05-13 23:07:12 +08:00
|
|
|
goto loop;
|
|
|
|
}
|
2011-05-28 19:00:39 +08:00
|
|
|
spin_unlock(&block_group->free_space_ctl->tree_lock);
|
2011-05-13 23:07:12 +08:00
|
|
|
|
2009-09-12 04:11:20 +08:00
|
|
|
/*
|
2011-12-08 08:50:42 +08:00
|
|
|
* Ok we want to try and use the cluster allocator, so
|
|
|
|
* lets look there
|
2009-09-12 04:11:20 +08:00
|
|
|
*/
|
2011-12-08 08:50:42 +08:00
|
|
|
if (last_ptr) {
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* the refill lock keeps out other
|
|
|
|
* people trying to start a new cluster
|
|
|
|
*/
|
|
|
|
spin_lock(&last_ptr->refill_lock);
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group = last_ptr->block_group;
|
|
|
|
if (used_block_group != block_group &&
|
|
|
|
(!used_block_group ||
|
|
|
|
used_block_group->ro ||
|
|
|
|
!block_group_bits(used_block_group, data))) {
|
|
|
|
used_block_group = block_group;
|
2009-06-05 03:34:51 +08:00
|
|
|
goto refill_cluster;
|
2011-12-08 09:08:40 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (used_block_group != block_group)
|
|
|
|
btrfs_get_block_group(used_block_group);
|
2009-06-05 03:34:51 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
offset = btrfs_alloc_from_cluster(used_block_group,
|
|
|
|
last_ptr, num_bytes, used_block_group->key.objectid);
|
2009-04-03 21:47:43 +08:00
|
|
|
if (offset) {
|
|
|
|
/* we have a block, we're done */
|
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
goto checks;
|
|
|
|
}
|
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
WARN_ON(last_ptr->block_group != used_block_group);
|
|
|
|
if (used_block_group != block_group) {
|
|
|
|
btrfs_put_block_group(used_block_group);
|
|
|
|
used_block_group = block_group;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
2009-06-05 03:34:51 +08:00
|
|
|
refill_cluster:
|
2011-12-08 09:08:40 +08:00
|
|
|
BUG_ON(used_block_group != block_group);
|
2011-12-08 08:50:42 +08:00
|
|
|
/* If we are on LOOP_NO_EMPTY_SIZE, we can't
|
|
|
|
* set up a new clusters, so lets just skip it
|
|
|
|
* and let the allocator find whatever block
|
|
|
|
* it can find. If we reach this point, we
|
|
|
|
* will have tried the cluster allocator
|
|
|
|
* plenty of times and not have found
|
|
|
|
* anything, so we are likely way too
|
|
|
|
* fragmented for the clustering stuff to find
|
|
|
|
* anything. */
|
|
|
|
if (loop >= LOOP_NO_EMPTY_SIZE) {
|
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
goto unclustered_alloc;
|
|
|
|
}
|
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* this cluster didn't work out, free it and
|
|
|
|
* start over
|
|
|
|
*/
|
|
|
|
btrfs_return_cluster_to_free_space(NULL, last_ptr);
|
|
|
|
|
|
|
|
/* allocate a cluster in this block group */
|
2009-06-10 08:28:34 +08:00
|
|
|
ret = btrfs_find_space_cluster(trans, root,
|
2009-04-03 21:47:43 +08:00
|
|
|
block_group, last_ptr,
|
2011-12-01 02:43:00 +08:00
|
|
|
search_start, num_bytes,
|
2009-04-03 21:47:43 +08:00
|
|
|
empty_cluster + empty_size);
|
|
|
|
if (ret == 0) {
|
|
|
|
/*
|
|
|
|
* now pull our allocation out of this
|
|
|
|
* cluster
|
|
|
|
*/
|
|
|
|
offset = btrfs_alloc_from_cluster(block_group,
|
|
|
|
last_ptr, num_bytes,
|
|
|
|
search_start);
|
|
|
|
if (offset) {
|
|
|
|
/* we found one, proceed */
|
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
goto checks;
|
|
|
|
}
|
2009-09-12 04:11:20 +08:00
|
|
|
} else if (!cached && loop > LOOP_CACHING_NOWAIT
|
|
|
|
&& !failed_cluster_refill) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
|
|
|
|
2009-09-12 04:11:20 +08:00
|
|
|
failed_cluster_refill = true;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
wait_block_group_cache_progress(block_group,
|
|
|
|
num_bytes + empty_cluster + empty_size);
|
|
|
|
goto have_block_group;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
/*
|
|
|
|
* at this point we either didn't find a cluster
|
|
|
|
* or we weren't able to allocate a block from our
|
|
|
|
* cluster. Free the cluster we've been trying
|
|
|
|
* to use, and go to the next block group
|
|
|
|
*/
|
2009-09-12 04:11:20 +08:00
|
|
|
btrfs_return_cluster_to_free_space(NULL, last_ptr);
|
2009-04-03 21:47:43 +08:00
|
|
|
spin_unlock(&last_ptr->refill_lock);
|
2009-09-12 04:11:20 +08:00
|
|
|
goto loop;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
|
|
|
|
2011-12-08 08:50:42 +08:00
|
|
|
unclustered_alloc:
|
2009-04-03 22:14:18 +08:00
|
|
|
offset = btrfs_find_space_for_alloc(block_group, search_start,
|
|
|
|
num_bytes, empty_size);
|
2009-10-06 22:04:28 +08:00
|
|
|
/*
|
|
|
|
* If we didn't find a chunk, and we haven't failed on this
|
|
|
|
* block group before, and this block group is in the middle of
|
|
|
|
* caching and we are ok with waiting, then go ahead and wait
|
|
|
|
* for progress to be made, and set failed_alloc to true.
|
|
|
|
*
|
|
|
|
* If failed_alloc is true then we've already waited on this
|
|
|
|
* block group once and should move on to the next block group.
|
|
|
|
*/
|
|
|
|
if (!offset && !failed_alloc && !cached &&
|
|
|
|
loop > LOOP_CACHING_NOWAIT) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
wait_block_group_cache_progress(block_group,
|
2009-10-06 22:04:28 +08:00
|
|
|
num_bytes + empty_size);
|
|
|
|
failed_alloc = true;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
goto have_block_group;
|
2009-10-06 22:04:28 +08:00
|
|
|
} else if (!offset) {
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
if (!cached)
|
|
|
|
have_caching_bg = true;
|
2009-10-06 22:04:28 +08:00
|
|
|
goto loop;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
checks:
|
2009-04-03 22:14:18 +08:00
|
|
|
search_start = stripe_align(root, offset);
|
2009-04-03 22:14:19 +08:00
|
|
|
/* move on to the next group */
|
2009-04-03 22:14:18 +08:00
|
|
|
if (search_start + num_bytes >= search_end) {
|
2011-12-08 09:08:40 +08:00
|
|
|
btrfs_add_free_space(used_block_group, offset, num_bytes);
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
2009-04-03 22:14:18 +08:00
|
|
|
}
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-04-03 22:14:19 +08:00
|
|
|
/* move on to the next group */
|
|
|
|
if (search_start + num_bytes >
|
2011-12-08 09:08:40 +08:00
|
|
|
used_block_group->key.objectid + used_block_group->key.offset) {
|
|
|
|
btrfs_add_free_space(used_block_group, offset, num_bytes);
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
2009-04-03 22:14:18 +08:00
|
|
|
}
|
2008-11-11 00:47:09 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
ins->objectid = search_start;
|
|
|
|
ins->offset = num_bytes;
|
2009-04-03 22:14:19 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
if (offset < search_start)
|
2011-12-08 09:08:40 +08:00
|
|
|
btrfs_add_free_space(used_block_group, offset,
|
2010-05-16 22:46:25 +08:00
|
|
|
search_start - offset);
|
|
|
|
BUG_ON(offset > search_start);
|
2009-04-03 22:14:19 +08:00
|
|
|
|
2011-12-08 09:08:40 +08:00
|
|
|
ret = btrfs_update_reserved_bytes(used_block_group, num_bytes,
|
2011-07-27 05:00:46 +08:00
|
|
|
alloc_type);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (ret == -EAGAIN) {
|
2011-12-08 09:08:40 +08:00
|
|
|
btrfs_add_free_space(used_block_group, offset, num_bytes);
|
2009-04-03 22:14:19 +08:00
|
|
|
goto loop;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
/* we are all good, lets return */
|
2009-04-03 22:14:19 +08:00
|
|
|
ins->objectid = search_start;
|
|
|
|
ins->offset = num_bytes;
|
2008-12-12 05:30:39 +08:00
|
|
|
|
2009-04-03 22:14:18 +08:00
|
|
|
if (offset < search_start)
|
2011-12-08 09:08:40 +08:00
|
|
|
btrfs_add_free_space(used_block_group, offset,
|
2009-04-03 22:14:18 +08:00
|
|
|
search_start - offset);
|
|
|
|
BUG_ON(offset > search_start);
|
2011-12-08 09:08:40 +08:00
|
|
|
if (used_block_group != block_group)
|
|
|
|
btrfs_put_block_group(used_block_group);
|
2011-05-12 03:26:06 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
break;
|
|
|
|
loop:
|
2009-09-12 04:11:20 +08:00
|
|
|
failed_cluster_refill = false;
|
2009-10-06 22:04:28 +08:00
|
|
|
failed_alloc = false;
|
2010-05-16 22:46:24 +08:00
|
|
|
BUG_ON(index != get_block_group_index(block_group));
|
2011-12-08 09:08:40 +08:00
|
|
|
if (used_block_group != block_group)
|
|
|
|
btrfs_put_block_group(used_block_group);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
|
|
|
up_read(&space_info->groups_sem);
|
|
|
|
|
Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-09-09 17:34:35 +08:00
|
|
|
if (!ins->objectid && loop >= LOOP_CACHING_WAIT && have_caching_bg)
|
|
|
|
goto search;
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
if (!ins->objectid && ++index < BTRFS_NR_RAID_TYPES)
|
|
|
|
goto search;
|
|
|
|
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
/* LOOP_FIND_IDEAL, only search caching/cached bg's, and don't wait for
|
|
|
|
* for them to make caching progress. Also
|
|
|
|
* determine the best possible bg to cache
|
|
|
|
* LOOP_CACHING_NOWAIT, search partially cached block groups, kicking
|
|
|
|
* caching kthreads as we move along
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
* LOOP_CACHING_WAIT, search everything, and wait if our bg is caching
|
|
|
|
* LOOP_ALLOC_CHUNK, force a chunk allocation and try again
|
|
|
|
* LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try
|
|
|
|
* again
|
2009-04-03 21:47:43 +08:00
|
|
|
*/
|
2011-05-28 04:11:38 +08:00
|
|
|
if (!ins->objectid && loop < LOOP_NO_EMPTY_SIZE) {
|
2010-05-16 22:46:24 +08:00
|
|
|
index = 0;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
if (loop == LOOP_FIND_IDEAL && found_uncached_bg) {
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
found_uncached_bg = false;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
loop++;
|
2011-07-01 02:42:28 +08:00
|
|
|
if (!ideal_cache_percent)
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
goto search;
|
Btrfs: find ideal block group for caching
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-11-11 10:23:48 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* 1 of the following 2 things have happened so far
|
|
|
|
*
|
|
|
|
* 1) We found an ideal block group for caching that
|
|
|
|
* is mostly full and will cache quickly, so we might
|
|
|
|
* as well wait for it.
|
|
|
|
*
|
|
|
|
* 2) We searched for cached only and we didn't find
|
|
|
|
* anything, and we didn't start any caching kthreads
|
|
|
|
* either, so chances are we will loop through and
|
|
|
|
* start a couple caching kthreads, and then come back
|
|
|
|
* around and just wait for them. This will be slower
|
|
|
|
* because we will have 2 caching kthreads reading at
|
|
|
|
* the same time when we could have just started one
|
|
|
|
* and waited for it to get far enough to give us an
|
|
|
|
* allocation, so go ahead and go to the wait caching
|
|
|
|
* loop.
|
|
|
|
*/
|
|
|
|
loop = LOOP_CACHING_WAIT;
|
|
|
|
search_start = ideal_cache_offset;
|
|
|
|
ideal_cache_percent = 0;
|
|
|
|
goto ideal_cache;
|
|
|
|
} else if (loop == LOOP_FIND_IDEAL) {
|
|
|
|
/*
|
|
|
|
* Didn't find a uncached bg, wait on anything we find
|
|
|
|
* next.
|
|
|
|
*/
|
|
|
|
loop = LOOP_CACHING_WAIT;
|
|
|
|
goto search;
|
|
|
|
}
|
|
|
|
|
2011-05-28 04:11:38 +08:00
|
|
|
loop++;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
|
|
|
if (loop == LOOP_ALLOC_CHUNK) {
|
2011-05-28 04:11:38 +08:00
|
|
|
if (allowed_chunk_alloc) {
|
|
|
|
ret = do_chunk_alloc(trans, root, num_bytes +
|
|
|
|
2 * 1024 * 1024, data,
|
|
|
|
CHUNK_ALLOC_LIMITED);
|
|
|
|
allowed_chunk_alloc = 0;
|
|
|
|
if (ret == 1)
|
|
|
|
done_chunk_alloc = 1;
|
|
|
|
} else if (!done_chunk_alloc &&
|
|
|
|
space_info->force_alloc ==
|
|
|
|
CHUNK_ALLOC_NO_FORCE) {
|
|
|
|
space_info->force_alloc = CHUNK_ALLOC_LIMITED;
|
|
|
|
}
|
2009-04-03 22:14:19 +08:00
|
|
|
|
2011-05-28 04:11:38 +08:00
|
|
|
/*
|
|
|
|
* We didn't allocate a chunk, go ahead and drop the
|
|
|
|
* empty size and loop again.
|
|
|
|
*/
|
|
|
|
if (!done_chunk_alloc)
|
|
|
|
loop = LOOP_NO_EMPTY_SIZE;
|
2009-04-03 22:14:19 +08:00
|
|
|
}
|
|
|
|
|
2011-05-28 04:11:38 +08:00
|
|
|
if (loop == LOOP_NO_EMPTY_SIZE) {
|
|
|
|
empty_size = 0;
|
|
|
|
empty_cluster = 0;
|
2009-04-03 21:47:43 +08:00
|
|
|
}
|
2011-05-28 04:11:38 +08:00
|
|
|
|
|
|
|
goto search;
|
2009-04-03 22:14:19 +08:00
|
|
|
} else if (!ins->objectid) {
|
|
|
|
ret = -ENOSPC;
|
2011-05-12 03:26:06 +08:00
|
|
|
} else if (ins->objectid) {
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
ret = 0;
|
2007-05-06 22:15:01 +08:00
|
|
|
}
|
|
|
|
|
2007-03-01 05:46:22 +08:00
|
|
|
return ret;
|
2007-02-26 23:40:21 +08:00
|
|
|
}
|
2008-04-29 03:29:52 +08:00
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
|
|
|
|
int dump_block_groups)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *cache;
|
2010-05-16 22:46:24 +08:00
|
|
|
int index = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_lock(&info->lock);
|
2011-07-27 05:00:46 +08:00
|
|
|
printk(KERN_INFO "space_info %llu has %llu free, is %sfull\n",
|
|
|
|
(unsigned long long)info->flags,
|
2009-01-06 10:25:51 +08:00
|
|
|
(unsigned long long)(info->total_bytes - info->bytes_used -
|
2009-09-12 04:12:44 +08:00
|
|
|
info->bytes_pinned - info->bytes_reserved -
|
2010-05-16 22:49:58 +08:00
|
|
|
info->bytes_readonly),
|
2009-01-06 10:25:51 +08:00
|
|
|
(info->full) ? "" : "not ");
|
2010-05-16 22:49:58 +08:00
|
|
|
printk(KERN_INFO "space_info total=%llu, used=%llu, pinned=%llu, "
|
|
|
|
"reserved=%llu, may_use=%llu, readonly=%llu\n",
|
2009-04-22 03:38:29 +08:00
|
|
|
(unsigned long long)info->total_bytes,
|
2010-05-16 22:49:58 +08:00
|
|
|
(unsigned long long)info->bytes_used,
|
2009-04-22 03:38:29 +08:00
|
|
|
(unsigned long long)info->bytes_pinned,
|
2010-05-16 22:49:58 +08:00
|
|
|
(unsigned long long)info->bytes_reserved,
|
2009-04-22 03:38:29 +08:00
|
|
|
(unsigned long long)info->bytes_may_use,
|
2010-05-16 22:49:58 +08:00
|
|
|
(unsigned long long)info->bytes_readonly);
|
2009-09-12 04:12:44 +08:00
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
|
|
|
if (!dump_block_groups)
|
|
|
|
return;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_read(&info->groups_sem);
|
2010-05-16 22:46:24 +08:00
|
|
|
again:
|
|
|
|
list_for_each_entry(cache, &info->block_groups[index], list) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
spin_lock(&cache->lock);
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_INFO "block group %llu has %llu bytes, %llu used "
|
|
|
|
"%llu pinned %llu reserved\n",
|
|
|
|
(unsigned long long)cache->key.objectid,
|
|
|
|
(unsigned long long)cache->key.offset,
|
|
|
|
(unsigned long long)btrfs_block_group_used(&cache->item),
|
|
|
|
(unsigned long long)cache->pinned,
|
|
|
|
(unsigned long long)cache->reserved);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
btrfs_dump_free_space(cache, bytes);
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
}
|
2010-05-16 22:46:24 +08:00
|
|
|
if (++index < BTRFS_NR_RAID_TYPES)
|
|
|
|
goto again;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
up_read(&info->groups_sem);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
}
|
2008-09-26 22:05:48 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 num_bytes, u64 min_alloc_size,
|
|
|
|
u64 empty_size, u64 hint_byte,
|
|
|
|
u64 search_end, struct btrfs_key *ins,
|
|
|
|
u64 data)
|
2007-02-26 23:40:21 +08:00
|
|
|
{
|
|
|
|
int ret;
|
2007-05-30 22:22:12 +08:00
|
|
|
u64 search_start = 0;
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2009-02-21 00:00:09 +08:00
|
|
|
data = btrfs_get_alloc_profile(root, data);
|
2008-04-14 21:46:10 +08:00
|
|
|
again:
|
2008-05-25 02:04:53 +08:00
|
|
|
/*
|
|
|
|
* the only place that sets empty_size is btrfs_realloc_node, which
|
|
|
|
* is not called recursively on allocations
|
|
|
|
*/
|
2009-12-08 05:45:59 +08:00
|
|
|
if (empty_size || root->ref_cows)
|
2008-03-25 03:01:59 +08:00
|
|
|
ret = do_chunk_alloc(trans, root->fs_info->extent_root,
|
2011-04-16 04:05:44 +08:00
|
|
|
num_bytes + 2 * 1024 * 1024, data,
|
|
|
|
CHUNK_ALLOC_NO_FORCE);
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2007-10-16 04:15:53 +08:00
|
|
|
WARN_ON(num_bytes < root->sectorsize);
|
|
|
|
ret = find_free_extent(trans, root, num_bytes, empty_size,
|
2010-05-16 22:46:25 +08:00
|
|
|
search_start, search_end, hint_byte,
|
|
|
|
ins, data);
|
2008-04-17 23:29:12 +08:00
|
|
|
|
2008-04-14 21:46:10 +08:00
|
|
|
if (ret == -ENOSPC && num_bytes > min_alloc_size) {
|
|
|
|
num_bytes = num_bytes >> 1;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
num_bytes = num_bytes & ~(root->sectorsize - 1);
|
2008-04-14 21:46:10 +08:00
|
|
|
num_bytes = max(num_bytes, min_alloc_size);
|
2008-05-25 02:04:53 +08:00
|
|
|
do_chunk_alloc(trans, root->fs_info->extent_root,
|
2011-04-16 04:05:44 +08:00
|
|
|
num_bytes, data, CHUNK_ALLOC_FORCE);
|
2008-04-14 21:46:10 +08:00
|
|
|
goto again;
|
|
|
|
}
|
2011-02-17 02:10:41 +08:00
|
|
|
if (ret == -ENOSPC && btrfs_test_opt(root, ENOSPC_DEBUG)) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_space_info *sinfo;
|
|
|
|
|
|
|
|
sinfo = __find_space_info(root->fs_info, data);
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "btrfs allocation failed flags %llu, "
|
|
|
|
"wanted %llu\n", (unsigned long long)data,
|
|
|
|
(unsigned long long)num_bytes);
|
2009-09-12 04:12:44 +08:00
|
|
|
dump_space_info(sinfo, num_bytes, 1);
|
2008-06-26 04:01:30 +08:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
trace_btrfs_reserved_extent_alloc(root, ins->objectid, ins->offset);
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return ret;
|
2008-07-18 00:53:50 +08:00
|
|
|
}
|
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
static int __btrfs_free_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len, int pin)
|
2008-08-02 03:11:20 +08:00
|
|
|
{
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
struct btrfs_block_group_cache *cache;
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
int ret = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(root->fs_info, start);
|
|
|
|
if (!cache) {
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "Unable to find block group for %llu\n",
|
|
|
|
(unsigned long long)start);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2011-03-24 18:24:27 +08:00
|
|
|
if (btrfs_test_opt(root, DISCARD))
|
|
|
|
ret = btrfs_discard_extent(root, start, len, NULL);
|
Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
2009-01-06 04:57:51 +08:00
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
if (pin)
|
|
|
|
pin_down_extent(root, cache, start, len, 1);
|
|
|
|
else {
|
|
|
|
btrfs_add_free_space(cache, start, len);
|
|
|
|
btrfs_update_reserved_bytes(cache, len, RESERVE_FREE);
|
|
|
|
}
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(cache);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 19:18:59 +08:00
|
|
|
trace_btrfs_reserved_extent_free(root, start, len);
|
|
|
|
|
2008-07-18 00:53:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-11-01 08:52:39 +08:00
|
|
|
int btrfs_free_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len)
|
|
|
|
{
|
|
|
|
return __btrfs_free_reserved_extent(root, start, len, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_free_and_pin_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len)
|
|
|
|
{
|
|
|
|
return __btrfs_free_reserved_extent(root, start, len, 1);
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins, int ref_mod)
|
2008-07-18 00:53:50 +08:00
|
|
|
{
|
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_extent_item *extent_item;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_extent_inline_ref *iref;
|
2008-07-18 00:53:50 +08:00
|
|
|
struct btrfs_path *path;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int type;
|
|
|
|
u32 size;
|
2007-08-09 08:17:12 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (parent > 0)
|
|
|
|
type = BTRFS_SHARED_DATA_REF_KEY;
|
|
|
|
else
|
|
|
|
type = BTRFS_EXTENT_DATA_REF_KEY;
|
2007-08-30 03:47:34 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
|
2007-12-11 22:25:06 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-02-02 03:51:59 +08:00
|
|
|
|
2009-03-13 23:00:37 +08:00
|
|
|
path->leave_spinning = 1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
|
|
|
|
ins, size);
|
2007-06-29 03:57:36 +08:00
|
|
|
BUG_ON(ret);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent_item = btrfs_item_ptr(leaf, path->slots[0],
|
2008-02-02 03:51:59 +08:00
|
|
|
struct btrfs_extent_item);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_set_extent_refs(leaf, extent_item, ref_mod);
|
|
|
|
btrfs_set_extent_generation(leaf, extent_item, trans->transid);
|
|
|
|
btrfs_set_extent_flags(leaf, extent_item,
|
|
|
|
flags | BTRFS_EXTENT_FLAG_DATA);
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref, type);
|
|
|
|
if (parent > 0) {
|
|
|
|
struct btrfs_shared_data_ref *ref;
|
|
|
|
ref = (struct btrfs_shared_data_ref *)(iref + 1);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
btrfs_set_shared_data_ref_count(leaf, ref, ref_mod);
|
|
|
|
} else {
|
|
|
|
struct btrfs_extent_data_ref *ref;
|
|
|
|
ref = (struct btrfs_extent_data_ref *)(&iref->offset);
|
|
|
|
btrfs_set_extent_data_ref_root(leaf, ref, root_objectid);
|
|
|
|
btrfs_set_extent_data_ref_objectid(leaf, ref, owner);
|
|
|
|
btrfs_set_extent_data_ref_offset(leaf, ref, offset);
|
|
|
|
btrfs_set_extent_data_ref_count(leaf, ref, ref_mod);
|
|
|
|
}
|
2008-02-02 03:51:59 +08:00
|
|
|
|
|
|
|
btrfs_mark_buffer_dirty(path->nodes[0]);
|
2007-12-11 22:25:06 +08:00
|
|
|
btrfs_free_path(path);
|
2007-10-16 04:14:48 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = update_block_group(trans, root, ins->objectid, ins->offset, 1);
|
2008-02-04 23:10:13 +08:00
|
|
|
if (ret) {
|
2009-01-06 10:25:51 +08:00
|
|
|
printk(KERN_ERR "btrfs update block group failed for %llu "
|
|
|
|
"%llu\n", (unsigned long long)ins->objectid,
|
|
|
|
(unsigned long long)ins->offset);
|
2008-02-04 23:10:13 +08:00
|
|
|
BUG();
|
|
|
|
}
|
2008-07-18 00:53:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
u64 flags, struct btrfs_disk_key *key,
|
|
|
|
int level, struct btrfs_key *ins)
|
2008-07-18 00:53:50 +08:00
|
|
|
{
|
|
|
|
int ret;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_extent_item *extent_item;
|
|
|
|
struct btrfs_tree_block_info *block_info;
|
|
|
|
struct btrfs_extent_inline_ref *iref;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
u32 size = sizeof(*extent_item) + sizeof(*block_info) + sizeof(*iref);
|
2008-09-24 01:14:13 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->leave_spinning = 1;
|
|
|
|
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
|
|
|
|
ins, size);
|
2009-03-13 22:10:06 +08:00
|
|
|
BUG_ON(ret);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
extent_item = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_extent_item);
|
|
|
|
btrfs_set_extent_refs(leaf, extent_item, 1);
|
|
|
|
btrfs_set_extent_generation(leaf, extent_item, trans->transid);
|
|
|
|
btrfs_set_extent_flags(leaf, extent_item,
|
|
|
|
flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
|
|
|
|
block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
|
|
|
|
|
|
|
|
btrfs_set_tree_block_key(leaf, block_info, key);
|
|
|
|
btrfs_set_tree_block_level(leaf, block_info, level);
|
|
|
|
|
|
|
|
iref = (struct btrfs_extent_inline_ref *)(block_info + 1);
|
|
|
|
if (parent > 0) {
|
|
|
|
BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref,
|
|
|
|
BTRFS_SHARED_BLOCK_REF_KEY);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, parent);
|
|
|
|
} else {
|
|
|
|
btrfs_set_extent_inline_ref_type(leaf, iref,
|
|
|
|
BTRFS_TREE_BLOCK_REF_KEY);
|
|
|
|
btrfs_set_extent_inline_ref_offset(leaf, iref, root_objectid);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
btrfs_free_path(path);
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
ret = update_block_group(trans, root, ins->objectid, ins->offset, 1);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "btrfs update block group failed for %llu "
|
|
|
|
"%llu\n", (unsigned long long)ins->objectid,
|
|
|
|
(unsigned long long)ins->offset);
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, struct btrfs_key *ins)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
|
|
|
|
|
|
|
|
ret = btrfs_add_delayed_data_ref(trans, ins->objectid, ins->offset,
|
|
|
|
0, root_objectid, owner, offset,
|
|
|
|
BTRFS_ADD_DELAYED_EXTENT, NULL);
|
2008-07-18 00:53:50 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-09-06 04:13:11 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is used by the tree logging recovery code. It records that
|
|
|
|
* an extent has been allocated and makes sure to clear the free
|
|
|
|
* space cache bits as well
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins)
|
2008-09-06 04:13:11 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
u64 start = ins->objectid;
|
|
|
|
u64 num_bytes = ins->offset;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, ins->objectid);
|
2010-12-08 22:15:11 +08:00
|
|
|
cache_block_group(block_group, trans, NULL, 0);
|
2009-09-12 04:11:19 +08:00
|
|
|
caching_ctl = get_caching_control(block_group);
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
if (!caching_ctl) {
|
|
|
|
BUG_ON(!block_group_cache_done(block_group));
|
|
|
|
ret = btrfs_remove_free_space(block_group, start, num_bytes);
|
|
|
|
BUG_ON(ret);
|
|
|
|
} else {
|
|
|
|
mutex_lock(&caching_ctl->mutex);
|
|
|
|
|
|
|
|
if (start >= caching_ctl->progress) {
|
|
|
|
ret = add_excluded_extent(root, start, num_bytes);
|
|
|
|
BUG_ON(ret);
|
|
|
|
} else if (start + num_bytes <= caching_ctl->progress) {
|
|
|
|
ret = btrfs_remove_free_space(block_group,
|
|
|
|
start, num_bytes);
|
|
|
|
BUG_ON(ret);
|
|
|
|
} else {
|
|
|
|
num_bytes = caching_ctl->progress - start;
|
|
|
|
ret = btrfs_remove_free_space(block_group,
|
|
|
|
start, num_bytes);
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
|
|
|
start = caching_ctl->progress;
|
|
|
|
num_bytes = ins->objectid + ins->offset -
|
|
|
|
caching_ctl->progress;
|
|
|
|
ret = add_excluded_extent(root, start, num_bytes);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&caching_ctl->mutex);
|
|
|
|
put_caching_control(caching_ctl);
|
|
|
|
}
|
|
|
|
|
2011-07-27 05:00:46 +08:00
|
|
|
ret = btrfs_update_reserved_bytes(block_group, ins->offset,
|
|
|
|
RESERVE_ALLOC_NO_ACCOUNT);
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(ret);
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
|
|
|
|
0, owner, offset, ins, 1);
|
2008-09-06 04:13:11 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-08-02 03:11:20 +08:00
|
|
|
struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2009-02-13 03:09:45 +08:00
|
|
|
u64 bytenr, u32 blocksize,
|
|
|
|
int level)
|
2008-08-02 03:11:20 +08:00
|
|
|
{
|
|
|
|
struct extent_buffer *buf;
|
|
|
|
|
|
|
|
buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
|
|
|
|
if (!buf)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
btrfs_set_header_generation(buf, trans->transid);
|
2011-07-27 04:11:19 +08:00
|
|
|
btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
|
2008-08-02 03:11:20 +08:00
|
|
|
btrfs_tree_lock(buf);
|
|
|
|
clean_tree_block(trans, root, buf);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
|
|
|
|
btrfs_set_lock_blocking(buf);
|
2008-08-02 03:11:20 +08:00
|
|
|
btrfs_set_buffer_uptodate(buf);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
|
2008-09-12 04:17:57 +08:00
|
|
|
if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
|
2009-11-12 17:33:26 +08:00
|
|
|
/*
|
|
|
|
* we allow two log transactions at a time, use different
|
|
|
|
* EXENT bit to differentiate dirty pages.
|
|
|
|
*/
|
|
|
|
if (root->log_transid % 2 == 0)
|
|
|
|
set_extent_dirty(&root->dirty_log_pages, buf->start,
|
|
|
|
buf->start + buf->len - 1, GFP_NOFS);
|
|
|
|
else
|
|
|
|
set_extent_new(&root->dirty_log_pages, buf->start,
|
|
|
|
buf->start + buf->len - 1, GFP_NOFS);
|
2008-09-12 04:17:57 +08:00
|
|
|
} else {
|
|
|
|
set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
|
2008-08-02 03:11:20 +08:00
|
|
|
buf->start + buf->len - 1, GFP_NOFS);
|
2008-09-12 04:17:57 +08:00
|
|
|
}
|
2008-08-02 03:11:20 +08:00
|
|
|
trans->blocks_used++;
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 22:25:08 +08:00
|
|
|
/* this returns a buffer locked for blocking */
|
2008-08-02 03:11:20 +08:00
|
|
|
return buf;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
static struct btrfs_block_rsv *
|
|
|
|
use_block_rsv(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u32 blocksize)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2011-01-25 05:43:20 +08:00
|
|
|
struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
block_rsv = get_block_rsv(trans, root);
|
|
|
|
|
|
|
|
if (block_rsv->size == 0) {
|
2011-10-19 00:15:48 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, blocksize, 0);
|
2011-01-25 05:43:20 +08:00
|
|
|
/*
|
|
|
|
* If we couldn't reserve metadata bytes try and use some from
|
|
|
|
* the global reserve.
|
|
|
|
*/
|
|
|
|
if (ret && block_rsv != global_rsv) {
|
|
|
|
ret = block_rsv_use_bytes(global_rsv, blocksize);
|
|
|
|
if (!ret)
|
|
|
|
return global_rsv;
|
2010-05-16 22:46:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2011-01-25 05:43:20 +08:00
|
|
|
} else if (ret) {
|
2010-05-16 22:46:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2011-01-25 05:43:20 +08:00
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
return block_rsv;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = block_rsv_use_bytes(block_rsv, blocksize);
|
|
|
|
if (!ret)
|
|
|
|
return block_rsv;
|
2011-01-25 05:43:20 +08:00
|
|
|
if (ret) {
|
2011-06-14 18:52:17 +08:00
|
|
|
static DEFINE_RATELIMIT_STATE(_rs,
|
|
|
|
DEFAULT_RATELIMIT_INTERVAL,
|
|
|
|
/*DEFAULT_RATELIMIT_BURST*/ 2);
|
|
|
|
if (__ratelimit(&_rs)) {
|
|
|
|
printk(KERN_DEBUG "btrfs: block rsv returned %d\n", ret);
|
|
|
|
WARN_ON(1);
|
|
|
|
}
|
2011-10-19 00:15:48 +08:00
|
|
|
ret = reserve_metadata_bytes(root, block_rsv, blocksize, 0);
|
2011-01-25 05:43:20 +08:00
|
|
|
if (!ret) {
|
|
|
|
return block_rsv;
|
|
|
|
} else if (ret && block_rsv != global_rsv) {
|
|
|
|
ret = block_rsv_use_bytes(global_rsv, blocksize);
|
|
|
|
if (!ret)
|
|
|
|
return global_rsv;
|
|
|
|
}
|
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
return ERR_PTR(-ENOSPC);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void unuse_block_rsv(struct btrfs_block_rsv *block_rsv, u32 blocksize)
|
|
|
|
{
|
|
|
|
block_rsv_add_bytes(block_rsv, blocksize, 0);
|
|
|
|
block_rsv_release_bytes(block_rsv, NULL, 0);
|
|
|
|
}
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
/*
|
2010-05-16 22:46:25 +08:00
|
|
|
* finds a free extent and does all the dirty work required for allocation
|
|
|
|
* returns the key for the extent through ins, and a tree buffer for
|
|
|
|
* the first block of the extent through buf.
|
|
|
|
*
|
2007-02-26 23:40:21 +08:00
|
|
|
* returns the tree buffer or NULL.
|
|
|
|
*/
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_root *root, u32 blocksize,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
struct btrfs_disk_key *key, int level,
|
|
|
|
u64 hint, u64 empty_size)
|
2007-02-26 23:40:21 +08:00
|
|
|
{
|
2007-03-13 04:22:34 +08:00
|
|
|
struct btrfs_key ins;
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *buf;
|
2010-05-16 22:46:25 +08:00
|
|
|
u64 flags = 0;
|
|
|
|
int ret;
|
|
|
|
|
2007-02-26 23:40:21 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
block_rsv = use_block_rsv(trans, root, blocksize);
|
|
|
|
if (IS_ERR(block_rsv))
|
|
|
|
return ERR_CAST(block_rsv);
|
|
|
|
|
|
|
|
ret = btrfs_reserve_extent(trans, root, blocksize, blocksize,
|
|
|
|
empty_size, hint, (u64)-1, &ins, 0);
|
2007-02-26 23:40:21 +08:00
|
|
|
if (ret) {
|
2010-05-16 22:46:25 +08:00
|
|
|
unuse_block_rsv(block_rsv, blocksize);
|
2007-06-23 02:16:25 +08:00
|
|
|
return ERR_PTR(ret);
|
2007-02-26 23:40:21 +08:00
|
|
|
}
|
2008-01-10 04:55:33 +08:00
|
|
|
|
2009-02-13 03:09:45 +08:00
|
|
|
buf = btrfs_init_new_buffer(trans, root, ins.objectid,
|
|
|
|
blocksize, level);
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(IS_ERR(buf));
|
|
|
|
|
|
|
|
if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) {
|
|
|
|
if (parent == 0)
|
|
|
|
parent = ins.objectid;
|
|
|
|
flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
|
|
|
} else
|
|
|
|
BUG_ON(parent > 0);
|
|
|
|
|
|
|
|
if (root_objectid != BTRFS_TREE_LOG_OBJECTID) {
|
|
|
|
struct btrfs_delayed_extent_op *extent_op;
|
|
|
|
extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
|
|
|
|
BUG_ON(!extent_op);
|
|
|
|
if (key)
|
|
|
|
memcpy(&extent_op->key, key, sizeof(extent_op->key));
|
|
|
|
else
|
|
|
|
memset(&extent_op->key, 0, sizeof(extent_op->key));
|
|
|
|
extent_op->flags_to_set = flags;
|
|
|
|
extent_op->update_key = 1;
|
|
|
|
extent_op->update_flags = 1;
|
|
|
|
extent_op->is_data = 0;
|
|
|
|
|
|
|
|
ret = btrfs_add_delayed_tree_ref(trans, ins.objectid,
|
|
|
|
ins.offset, parent, root_objectid,
|
|
|
|
level, BTRFS_ADD_DELAYED_EXTENT,
|
|
|
|
extent_op);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
2007-02-26 23:40:21 +08:00
|
|
|
return buf;
|
|
|
|
}
|
2007-03-07 09:08:01 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control {
|
|
|
|
u64 refs[BTRFS_MAX_LEVEL];
|
|
|
|
u64 flags[BTRFS_MAX_LEVEL];
|
|
|
|
struct btrfs_key update_progress;
|
|
|
|
int stage;
|
|
|
|
int level;
|
|
|
|
int shared_level;
|
|
|
|
int update_ref;
|
|
|
|
int keep_locks;
|
2009-09-22 03:55:59 +08:00
|
|
|
int reada_slot;
|
|
|
|
int reada_count;
|
2009-06-28 09:07:35 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
#define DROP_REFERENCE 1
|
|
|
|
#define UPDATE_BACKREF 2
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct walk_control *wc,
|
|
|
|
struct btrfs_path *path)
|
2007-03-27 18:33:00 +08:00
|
|
|
{
|
2009-09-22 03:55:59 +08:00
|
|
|
u64 bytenr;
|
|
|
|
u64 generation;
|
|
|
|
u64 refs;
|
2009-10-09 21:25:16 +08:00
|
|
|
u64 flags;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
u32 nritems;
|
2009-09-22 03:55:59 +08:00
|
|
|
u32 blocksize;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *eb;
|
2007-03-27 18:33:00 +08:00
|
|
|
int ret;
|
2009-09-22 03:55:59 +08:00
|
|
|
int slot;
|
|
|
|
int nread = 0;
|
2007-03-27 18:33:00 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (path->slots[wc->level] < wc->reada_slot) {
|
|
|
|
wc->reada_count = wc->reada_count * 2 / 3;
|
|
|
|
wc->reada_count = max(wc->reada_count, 2);
|
|
|
|
} else {
|
|
|
|
wc->reada_count = wc->reada_count * 3 / 2;
|
|
|
|
wc->reada_count = min_t(int, wc->reada_count,
|
|
|
|
BTRFS_NODEPTRS_PER_BLOCK(root));
|
|
|
|
}
|
2007-12-11 22:25:06 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
eb = path->nodes[wc->level];
|
|
|
|
nritems = btrfs_header_nritems(eb);
|
|
|
|
blocksize = btrfs_level_size(root, wc->level - 1);
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
for (slot = path->slots[wc->level]; slot < nritems; slot++) {
|
|
|
|
if (nread >= wc->reada_count)
|
|
|
|
break;
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2008-08-04 20:20:15 +08:00
|
|
|
cond_resched();
|
2009-09-22 03:55:59 +08:00
|
|
|
bytenr = btrfs_node_blockptr(eb, slot);
|
|
|
|
generation = btrfs_node_ptr_generation(eb, slot);
|
2008-08-04 20:20:15 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (slot == path->slots[wc->level])
|
|
|
|
goto reada;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (wc->stage == UPDATE_BACKREF &&
|
|
|
|
generation <= root->root_key.offset)
|
2009-02-04 22:27:02 +08:00
|
|
|
continue;
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
/* We don't lock the tree block, it's OK to be racy here */
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root, bytenr, blocksize,
|
|
|
|
&refs, &flags);
|
|
|
|
BUG_ON(ret);
|
|
|
|
BUG_ON(refs == 0);
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
if (refs == 1)
|
|
|
|
goto reada;
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
if (wc->level == 1 &&
|
|
|
|
(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
continue;
|
2009-09-22 03:55:59 +08:00
|
|
|
if (!wc->update_ref ||
|
|
|
|
generation <= root->root_key.offset)
|
|
|
|
continue;
|
|
|
|
btrfs_node_key_to_cpu(eb, &key, slot);
|
|
|
|
ret = btrfs_comp_cpu_keys(&key,
|
|
|
|
&wc->update_progress);
|
|
|
|
if (ret < 0)
|
|
|
|
continue;
|
2009-10-09 21:25:16 +08:00
|
|
|
} else {
|
|
|
|
if (wc->level == 1 &&
|
|
|
|
(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
continue;
|
2007-03-27 18:33:00 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
reada:
|
|
|
|
ret = readahead_tree_block(root, bytenr, blocksize,
|
|
|
|
generation);
|
|
|
|
if (ret)
|
2009-02-04 22:27:02 +08:00
|
|
|
break;
|
2009-09-22 03:55:59 +08:00
|
|
|
nread++;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->reada_slot = slot;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2008-10-30 02:49:05 +08:00
|
|
|
/*
|
2009-06-28 09:07:35 +08:00
|
|
|
* hepler to process tree block while walking down the tree.
|
|
|
|
*
|
|
|
|
* when wc->stage == UPDATE_BACKREF, this function updates
|
|
|
|
* back refs for pointers in the block.
|
|
|
|
*
|
|
|
|
* NOTE: return value 1 means we should stop walking down.
|
2008-10-30 02:49:05 +08:00
|
|
|
*/
|
2009-06-28 09:07:35 +08:00
|
|
|
static noinline int walk_down_proc(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
struct btrfs_root *root,
|
2009-06-28 09:07:35 +08:00
|
|
|
struct btrfs_path *path,
|
2009-10-09 21:25:16 +08:00
|
|
|
struct walk_control *wc, int lookup_info)
|
2008-10-30 02:49:05 +08:00
|
|
|
{
|
2009-06-28 09:07:35 +08:00
|
|
|
int level = wc->level;
|
|
|
|
struct extent_buffer *eb = path->nodes[level];
|
|
|
|
u64 flag = BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
2008-10-30 02:49:05 +08:00
|
|
|
int ret;
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (wc->stage == UPDATE_BACKREF &&
|
|
|
|
btrfs_header_owner(eb) != root->root_key.objectid)
|
|
|
|
return 1;
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/*
|
|
|
|
* when reference count of tree block is 1, it won't increase
|
|
|
|
* again. once full backref flag is set, we never clear it.
|
|
|
|
*/
|
2009-10-09 21:25:16 +08:00
|
|
|
if (lookup_info &&
|
|
|
|
((wc->stage == DROP_REFERENCE && wc->refs[level] != 1) ||
|
|
|
|
(wc->stage == UPDATE_BACKREF && !(wc->flags[level] & flag)))) {
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(!path->locks[level]);
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root,
|
|
|
|
eb->start, eb->len,
|
|
|
|
&wc->refs[level],
|
|
|
|
&wc->flags[level]);
|
|
|
|
BUG_ON(ret);
|
|
|
|
BUG_ON(wc->refs[level] == 0);
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
if (wc->refs[level] > 1)
|
|
|
|
return 1;
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (path->locks[level] && !wc->keep_locks) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
path->locks[level] = 0;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/* wc->stage == UPDATE_BACKREF */
|
|
|
|
if (!(wc->flags[level] & flag)) {
|
|
|
|
BUG_ON(!path->locks[level]);
|
|
|
|
ret = btrfs_inc_ref(trans, root, eb, 1);
|
2008-10-30 02:49:05 +08:00
|
|
|
BUG_ON(ret);
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = btrfs_dec_ref(trans, root, eb, 0);
|
|
|
|
BUG_ON(ret);
|
|
|
|
ret = btrfs_set_disk_extent_flags(trans, root, eb->start,
|
|
|
|
eb->len, flag, 0);
|
|
|
|
BUG_ON(ret);
|
|
|
|
wc->flags[level] |= flag;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the block is shared by multiple trees, so it's not good to
|
|
|
|
* keep the tree lock
|
|
|
|
*/
|
|
|
|
if (path->locks[level] && level > 0) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
path->locks[level] = 0;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
/*
|
|
|
|
* hepler to process tree block pointer.
|
|
|
|
*
|
|
|
|
* when wc->stage == DROP_REFERENCE, this function checks
|
|
|
|
* reference count of the block pointed to. if the block
|
|
|
|
* is shared and we need update back refs for the subtree
|
|
|
|
* rooted at the block, this function changes wc->stage to
|
|
|
|
* UPDATE_BACKREF. if the block is shared and there is no
|
|
|
|
* need to update back, this function drops the reference
|
|
|
|
* to the block.
|
|
|
|
*
|
|
|
|
* NOTE: return value 1 means we should stop walking down.
|
|
|
|
*/
|
|
|
|
static noinline int do_walk_down(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2009-10-09 21:25:16 +08:00
|
|
|
struct walk_control *wc, int *lookup_info)
|
2009-09-22 03:55:59 +08:00
|
|
|
{
|
|
|
|
u64 bytenr;
|
|
|
|
u64 generation;
|
|
|
|
u64 parent;
|
|
|
|
u32 blocksize;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *next;
|
|
|
|
int level = wc->level;
|
|
|
|
int reada = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
generation = btrfs_node_ptr_generation(path->nodes[level],
|
|
|
|
path->slots[level]);
|
|
|
|
/*
|
|
|
|
* if the lower level block was created before the snapshot
|
|
|
|
* was created, we know there is no need to update back refs
|
|
|
|
* for the subtree
|
|
|
|
*/
|
|
|
|
if (wc->stage == UPDATE_BACKREF &&
|
2009-10-09 21:25:16 +08:00
|
|
|
generation <= root->root_key.offset) {
|
|
|
|
*lookup_info = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
return 1;
|
2009-10-09 21:25:16 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
|
|
|
|
bytenr = btrfs_node_blockptr(path->nodes[level], path->slots[level]);
|
|
|
|
blocksize = btrfs_level_size(root, level - 1);
|
|
|
|
|
|
|
|
next = btrfs_find_tree_block(root, bytenr, blocksize);
|
|
|
|
if (!next) {
|
|
|
|
next = btrfs_find_create_tree_block(root, bytenr, blocksize);
|
2010-03-25 20:37:12 +08:00
|
|
|
if (!next)
|
|
|
|
return -ENOMEM;
|
2009-09-22 03:55:59 +08:00
|
|
|
reada = 1;
|
|
|
|
}
|
|
|
|
btrfs_tree_lock(next);
|
|
|
|
btrfs_set_lock_blocking(next);
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = btrfs_lookup_extent_info(trans, root, bytenr, blocksize,
|
|
|
|
&wc->refs[level - 1],
|
|
|
|
&wc->flags[level - 1]);
|
|
|
|
BUG_ON(ret);
|
|
|
|
BUG_ON(wc->refs[level - 1] == 0);
|
|
|
|
*lookup_info = 0;
|
2009-09-22 03:55:59 +08:00
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
2009-09-22 03:55:59 +08:00
|
|
|
if (wc->refs[level - 1] > 1) {
|
2009-10-09 21:25:16 +08:00
|
|
|
if (level == 1 &&
|
|
|
|
(wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
goto skip;
|
|
|
|
|
2009-09-22 03:55:59 +08:00
|
|
|
if (!wc->update_ref ||
|
|
|
|
generation <= root->root_key.offset)
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
btrfs_node_key_to_cpu(path->nodes[level], &key,
|
|
|
|
path->slots[level]);
|
|
|
|
ret = btrfs_comp_cpu_keys(&key, &wc->update_progress);
|
|
|
|
if (ret < 0)
|
|
|
|
goto skip;
|
|
|
|
|
|
|
|
wc->stage = UPDATE_BACKREF;
|
|
|
|
wc->shared_level = level - 1;
|
|
|
|
}
|
2009-10-09 21:25:16 +08:00
|
|
|
} else {
|
|
|
|
if (level == 1 &&
|
|
|
|
(wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
|
|
|
|
goto skip;
|
2009-09-22 03:55:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!btrfs_buffer_uptodate(next, generation)) {
|
|
|
|
btrfs_tree_unlock(next);
|
|
|
|
free_extent_buffer(next);
|
|
|
|
next = NULL;
|
2009-10-09 21:25:16 +08:00
|
|
|
*lookup_info = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!next) {
|
|
|
|
if (reada && level == 1)
|
|
|
|
reada_walk_down(trans, root, wc, path);
|
|
|
|
next = read_tree_block(root, bytenr, blocksize, generation);
|
2011-03-24 14:33:21 +08:00
|
|
|
if (!next)
|
|
|
|
return -EIO;
|
2009-09-22 03:55:59 +08:00
|
|
|
btrfs_tree_lock(next);
|
|
|
|
btrfs_set_lock_blocking(next);
|
|
|
|
}
|
|
|
|
|
|
|
|
level--;
|
|
|
|
BUG_ON(level != btrfs_header_level(next));
|
|
|
|
path->nodes[level] = next;
|
|
|
|
path->slots[level] = 0;
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->level = level;
|
|
|
|
if (wc->level == 1)
|
|
|
|
wc->reada_slot = 0;
|
|
|
|
return 0;
|
|
|
|
skip:
|
|
|
|
wc->refs[level - 1] = 0;
|
|
|
|
wc->flags[level - 1] = 0;
|
2009-10-09 21:25:16 +08:00
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
|
|
|
|
parent = path->nodes[level]->start;
|
|
|
|
} else {
|
|
|
|
BUG_ON(root->root_key.objectid !=
|
|
|
|
btrfs_header_owner(path->nodes[level]));
|
|
|
|
parent = 0;
|
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = btrfs_free_extent(trans, root, bytenr, blocksize, parent,
|
|
|
|
root->root_key.objectid, level - 1, 0);
|
|
|
|
BUG_ON(ret);
|
2009-09-22 03:55:59 +08:00
|
|
|
}
|
|
|
|
btrfs_tree_unlock(next);
|
|
|
|
free_extent_buffer(next);
|
2009-10-09 21:25:16 +08:00
|
|
|
*lookup_info = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/*
|
|
|
|
* hepler to process tree block while walking up the tree.
|
|
|
|
*
|
|
|
|
* when wc->stage == DROP_REFERENCE, this function drops
|
|
|
|
* reference count on the block.
|
|
|
|
*
|
|
|
|
* when wc->stage == UPDATE_BACKREF, this function changes
|
|
|
|
* wc->stage back to DROP_REFERENCE if we changed wc->stage
|
|
|
|
* to UPDATE_BACKREF previously while processing the block.
|
|
|
|
*
|
|
|
|
* NOTE: return value 1 means we should stop walking up.
|
|
|
|
*/
|
|
|
|
static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct walk_control *wc)
|
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret;
|
2009-06-28 09:07:35 +08:00
|
|
|
int level = wc->level;
|
|
|
|
struct extent_buffer *eb = path->nodes[level];
|
|
|
|
u64 parent = 0;
|
|
|
|
|
|
|
|
if (wc->stage == UPDATE_BACKREF) {
|
|
|
|
BUG_ON(wc->shared_level < level);
|
|
|
|
if (level < wc->shared_level)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = find_next_key(path, level + 1, &wc->update_progress);
|
|
|
|
if (ret > 0)
|
|
|
|
wc->update_ref = 0;
|
|
|
|
|
|
|
|
wc->stage = DROP_REFERENCE;
|
|
|
|
wc->shared_level = -1;
|
|
|
|
path->slots[level] = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check reference count again if the block isn't locked.
|
|
|
|
* we should start walking down the tree again if reference
|
|
|
|
* count is one.
|
|
|
|
*/
|
|
|
|
if (!path->locks[level]) {
|
|
|
|
BUG_ON(level == 0);
|
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
btrfs_set_lock_blocking(eb);
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root,
|
|
|
|
eb->start, eb->len,
|
|
|
|
&wc->refs[level],
|
|
|
|
&wc->flags[level]);
|
2008-10-30 02:49:05 +08:00
|
|
|
BUG_ON(ret);
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(wc->refs[level] == 0);
|
|
|
|
if (wc->refs[level] == 1) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(eb, path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
return 1;
|
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/* wc->stage == DROP_REFERENCE */
|
|
|
|
BUG_ON(wc->refs[level] > 1 && !path->locks[level]);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (wc->refs[level] == 1) {
|
|
|
|
if (level == 0) {
|
|
|
|
if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
|
|
|
|
ret = btrfs_dec_ref(trans, root, eb, 1);
|
|
|
|
else
|
|
|
|
ret = btrfs_dec_ref(trans, root, eb, 0);
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
/* make block locked assertion in clean_tree_block happy */
|
|
|
|
if (!path->locks[level] &&
|
|
|
|
btrfs_header_generation(eb) == trans->transid) {
|
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
btrfs_set_lock_blocking(eb);
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
clean_tree_block(trans, root, eb);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (eb == root->node) {
|
|
|
|
if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
|
|
|
|
parent = eb->start;
|
|
|
|
else
|
|
|
|
BUG_ON(root->root_key.objectid !=
|
|
|
|
btrfs_header_owner(eb));
|
|
|
|
} else {
|
|
|
|
if (wc->flags[level + 1] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
|
|
|
|
parent = path->nodes[level + 1]->start;
|
|
|
|
else
|
|
|
|
BUG_ON(root->root_key.objectid !=
|
|
|
|
btrfs_header_owner(path->nodes[level + 1]));
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
btrfs_free_tree_block(trans, root, eb, parent, wc->refs[level] == 1);
|
2009-06-28 09:07:35 +08:00
|
|
|
out:
|
|
|
|
wc->refs[level] = 0;
|
|
|
|
wc->flags[level] = 0;
|
2010-05-16 22:46:25 +08:00
|
|
|
return 0;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int walk_down_tree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct walk_control *wc)
|
|
|
|
{
|
|
|
|
int level = wc->level;
|
2009-10-09 21:25:16 +08:00
|
|
|
int lookup_info = 1;
|
2009-06-28 09:07:35 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while (level >= 0) {
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = walk_down_proc(trans, root, path, wc, lookup_info);
|
2009-06-28 09:07:35 +08:00
|
|
|
if (ret > 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (level == 0)
|
|
|
|
break;
|
|
|
|
|
2010-02-01 10:41:17 +08:00
|
|
|
if (path->slots[level] >=
|
|
|
|
btrfs_header_nritems(path->nodes[level]))
|
|
|
|
break;
|
|
|
|
|
2009-10-09 21:25:16 +08:00
|
|
|
ret = do_walk_down(trans, root, path, wc, &lookup_info);
|
2009-09-22 03:55:59 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
path->slots[level]++;
|
|
|
|
continue;
|
2010-03-25 20:37:12 +08:00
|
|
|
} else if (ret < 0)
|
|
|
|
return ret;
|
2009-09-22 03:55:59 +08:00
|
|
|
level = wc->level;
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
|
2008-01-03 23:01:48 +08:00
|
|
|
struct btrfs_root *root,
|
2008-10-30 02:49:05 +08:00
|
|
|
struct btrfs_path *path,
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control *wc, int max_level)
|
2007-03-10 19:35:47 +08:00
|
|
|
{
|
2009-06-28 09:07:35 +08:00
|
|
|
int level = wc->level;
|
2007-03-10 19:35:47 +08:00
|
|
|
int ret;
|
2007-08-08 03:52:19 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
path->slots[level] = btrfs_header_nritems(path->nodes[level]);
|
|
|
|
while (level < max_level && path->nodes[level]) {
|
|
|
|
wc->level = level;
|
|
|
|
if (path->slots[level] + 1 <
|
|
|
|
btrfs_header_nritems(path->nodes[level])) {
|
|
|
|
path->slots[level]++;
|
2007-03-10 19:35:47 +08:00
|
|
|
return 0;
|
|
|
|
} else {
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = walk_up_proc(trans, root, path, wc);
|
|
|
|
if (ret > 0)
|
|
|
|
return 0;
|
2009-02-04 22:27:02 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
if (path->locks[level]) {
|
2011-07-17 03:23:14 +08:00
|
|
|
btrfs_tree_unlock_rw(path->nodes[level],
|
|
|
|
path->locks[level]);
|
2009-06-28 09:07:35 +08:00
|
|
|
path->locks[level] = 0;
|
2008-10-30 02:49:05 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
free_extent_buffer(path->nodes[level]);
|
|
|
|
path->nodes[level] = NULL;
|
|
|
|
level++;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2007-03-13 23:09:37 +08:00
|
|
|
/*
|
2009-06-28 09:07:35 +08:00
|
|
|
* drop a subvolume tree.
|
|
|
|
*
|
|
|
|
* this function traverses the tree freeing any blocks that only
|
|
|
|
* referenced by the tree.
|
|
|
|
*
|
|
|
|
* when a shared tree block is found. this function decreases its
|
|
|
|
* reference count by one. if update_ref is true, this function
|
|
|
|
* also make sure backrefs for the shared block and all lower level
|
|
|
|
* blocks are properly updated.
|
2007-03-13 23:09:37 +08:00
|
|
|
*/
|
2011-08-09 15:11:13 +08:00
|
|
|
void btrfs_drop_snapshot(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv, int update_ref)
|
2007-03-10 19:35:47 +08:00
|
|
|
{
|
2007-04-02 23:20:42 +08:00
|
|
|
struct btrfs_path *path;
|
2009-06-28 09:07:35 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_root *tree_root = root->fs_info->tree_root;
|
2007-08-08 03:52:19 +08:00
|
|
|
struct btrfs_root_item *root_item = &root->root_item;
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control *wc;
|
|
|
|
struct btrfs_key key;
|
|
|
|
int err = 0;
|
|
|
|
int ret;
|
|
|
|
int level;
|
2007-03-10 19:35:47 +08:00
|
|
|
|
2007-04-02 23:20:42 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-08-09 15:11:13 +08:00
|
|
|
if (!path) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2007-03-10 19:35:47 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
wc = kzalloc(sizeof(*wc), GFP_NOFS);
|
2011-07-14 01:59:59 +08:00
|
|
|
if (!wc) {
|
|
|
|
btrfs_free_path(path);
|
2011-08-09 15:11:13 +08:00
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
2011-07-14 01:59:59 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(tree_root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
BUG_ON(IS_ERR(trans));
|
|
|
|
|
2010-05-16 22:49:59 +08:00
|
|
|
if (block_rsv)
|
|
|
|
trans->block_rsv = block_rsv;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2007-08-08 03:52:19 +08:00
|
|
|
if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
|
2009-06-28 09:07:35 +08:00
|
|
|
level = btrfs_header_level(root->node);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
path->nodes[level] = btrfs_lock_root_node(root);
|
|
|
|
btrfs_set_lock_blocking(path->nodes[level]);
|
2007-08-08 03:52:19 +08:00
|
|
|
path->slots[level] = 0;
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
memset(&wc->update_progress, 0,
|
|
|
|
sizeof(wc->update_progress));
|
2007-08-08 03:52:19 +08:00
|
|
|
} else {
|
|
|
|
btrfs_disk_key_to_cpu(&key, &root_item->drop_progress);
|
2009-06-28 09:07:35 +08:00
|
|
|
memcpy(&wc->update_progress, &key,
|
|
|
|
sizeof(wc->update_progress));
|
|
|
|
|
2007-08-08 04:15:09 +08:00
|
|
|
level = root_item->drop_level;
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(level == 0);
|
2007-08-08 04:15:09 +08:00
|
|
|
path->lowest_level = level;
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
path->lowest_level = 0;
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
2011-08-09 15:11:13 +08:00
|
|
|
goto out_free;
|
2007-08-08 03:52:19 +08:00
|
|
|
}
|
2009-09-22 03:55:59 +08:00
|
|
|
WARN_ON(ret > 0);
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2008-07-09 02:19:17 +08:00
|
|
|
/*
|
|
|
|
* unlock our path, this is safe because only this
|
|
|
|
* function is allowed to delete this snapshot
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
btrfs_unlock_up_safe(path, 0);
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
level = btrfs_header_level(root->node);
|
|
|
|
while (1) {
|
|
|
|
btrfs_tree_lock(path->nodes[level]);
|
|
|
|
btrfs_set_lock_blocking(path->nodes[level]);
|
|
|
|
|
|
|
|
ret = btrfs_lookup_extent_info(trans, root,
|
|
|
|
path->nodes[level]->start,
|
|
|
|
path->nodes[level]->len,
|
|
|
|
&wc->refs[level],
|
|
|
|
&wc->flags[level]);
|
|
|
|
BUG_ON(ret);
|
|
|
|
BUG_ON(wc->refs[level] == 0);
|
|
|
|
|
|
|
|
if (level == root_item->drop_level)
|
|
|
|
break;
|
|
|
|
|
|
|
|
btrfs_tree_unlock(path->nodes[level]);
|
|
|
|
WARN_ON(wc->refs[level] != 1);
|
|
|
|
level--;
|
|
|
|
}
|
2007-08-08 03:52:19 +08:00
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
wc->level = level;
|
|
|
|
wc->shared_level = -1;
|
|
|
|
wc->stage = DROP_REFERENCE;
|
|
|
|
wc->update_ref = update_ref;
|
|
|
|
wc->keep_locks = 0;
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = walk_down_tree(trans, root, path, wc);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
2007-03-10 19:35:47 +08:00
|
|
|
break;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
2007-03-13 23:09:37 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = walk_up_tree(trans, root, path, wc, BTRFS_MAX_LEVEL);
|
|
|
|
if (ret < 0) {
|
|
|
|
err = ret;
|
2007-03-10 19:35:47 +08:00
|
|
|
break;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (ret > 0) {
|
|
|
|
BUG_ON(wc->stage != DROP_REFERENCE);
|
2008-06-26 04:01:31 +08:00
|
|
|
break;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
if (wc->stage == DROP_REFERENCE) {
|
|
|
|
level = wc->level;
|
|
|
|
btrfs_node_key(path->nodes[level],
|
|
|
|
&root_item->drop_progress,
|
|
|
|
path->slots[level]);
|
|
|
|
root_item->drop_level = level;
|
|
|
|
}
|
|
|
|
|
|
|
|
BUG_ON(wc->level == 0);
|
2010-05-16 22:49:59 +08:00
|
|
|
if (btrfs_should_end_transaction(trans, tree_root)) {
|
2009-06-28 09:07:35 +08:00
|
|
|
ret = btrfs_update_root(trans, tree_root,
|
|
|
|
&root->root_key,
|
|
|
|
root_item);
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
2010-05-16 22:49:59 +08:00
|
|
|
btrfs_end_transaction_throttle(trans, tree_root);
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(tree_root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
BUG_ON(IS_ERR(trans));
|
2010-05-16 22:49:59 +08:00
|
|
|
if (block_rsv)
|
|
|
|
trans->block_rsv = block_rsv;
|
2009-03-13 22:17:05 +08:00
|
|
|
}
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(err);
|
|
|
|
|
|
|
|
ret = btrfs_del_root(trans, tree_root, &root->root_key);
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
2009-09-22 04:00:26 +08:00
|
|
|
if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
|
|
|
|
ret = btrfs_find_last_root(tree_root, root->root_key.objectid,
|
|
|
|
NULL, NULL);
|
|
|
|
BUG_ON(ret < 0);
|
|
|
|
if (ret > 0) {
|
2010-12-09 01:24:01 +08:00
|
|
|
/* if we fail to delete the orphan item this time
|
|
|
|
* around, it'll get picked up the next time.
|
|
|
|
*
|
|
|
|
* The most common failure here is just -ENOENT.
|
|
|
|
*/
|
|
|
|
btrfs_del_orphan_item(trans, tree_root,
|
|
|
|
root->root_key.objectid);
|
2009-09-22 04:00:26 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (root->in_radix) {
|
|
|
|
btrfs_free_fs_root(tree_root->fs_info, root);
|
|
|
|
} else {
|
|
|
|
free_extent_buffer(root->node);
|
|
|
|
free_extent_buffer(root->commit_root);
|
|
|
|
kfree(root);
|
|
|
|
}
|
2011-08-09 15:11:13 +08:00
|
|
|
out_free:
|
2010-05-16 22:49:59 +08:00
|
|
|
btrfs_end_transaction_throttle(trans, tree_root);
|
2009-06-28 09:07:35 +08:00
|
|
|
kfree(wc);
|
2007-04-02 23:20:42 +08:00
|
|
|
btrfs_free_path(path);
|
2011-08-09 15:11:13 +08:00
|
|
|
out:
|
|
|
|
if (err)
|
|
|
|
btrfs_std_error(root->fs_info, err);
|
|
|
|
return;
|
2007-03-10 19:35:47 +08:00
|
|
|
}
|
2007-04-27 04:46:15 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
/*
|
|
|
|
* drop subtree rooted at tree block 'node'.
|
|
|
|
*
|
|
|
|
* NOTE: this function will unlock and release tree block 'node'
|
|
|
|
*/
|
2008-10-30 02:49:05 +08:00
|
|
|
int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *node,
|
|
|
|
struct extent_buffer *parent)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
2009-06-28 09:07:35 +08:00
|
|
|
struct walk_control *wc;
|
2008-10-30 02:49:05 +08:00
|
|
|
int level;
|
|
|
|
int parent_level;
|
|
|
|
int ret = 0;
|
|
|
|
int wret;
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
BUG_ON(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID);
|
|
|
|
|
2008-10-30 02:49:05 +08:00
|
|
|
path = btrfs_alloc_path();
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
wc = kzalloc(sizeof(*wc), GFP_NOFS);
|
2011-03-23 16:14:16 +08:00
|
|
|
if (!wc) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2009-06-28 09:07:35 +08:00
|
|
|
|
2009-03-09 23:45:38 +08:00
|
|
|
btrfs_assert_tree_locked(parent);
|
2008-10-30 02:49:05 +08:00
|
|
|
parent_level = btrfs_header_level(parent);
|
|
|
|
extent_buffer_get(parent);
|
|
|
|
path->nodes[parent_level] = parent;
|
|
|
|
path->slots[parent_level] = btrfs_header_nritems(parent);
|
|
|
|
|
2009-03-09 23:45:38 +08:00
|
|
|
btrfs_assert_tree_locked(node);
|
2008-10-30 02:49:05 +08:00
|
|
|
level = btrfs_header_level(node);
|
|
|
|
path->nodes[level] = node;
|
|
|
|
path->slots[level] = 0;
|
2011-07-17 03:23:14 +08:00
|
|
|
path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
|
2009-06-28 09:07:35 +08:00
|
|
|
|
|
|
|
wc->refs[parent_level] = 1;
|
|
|
|
wc->flags[parent_level] = BTRFS_BLOCK_FLAG_FULL_BACKREF;
|
|
|
|
wc->level = level;
|
|
|
|
wc->shared_level = -1;
|
|
|
|
wc->stage = DROP_REFERENCE;
|
|
|
|
wc->update_ref = 0;
|
|
|
|
wc->keep_locks = 1;
|
2009-09-22 03:55:59 +08:00
|
|
|
wc->reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
|
2008-10-30 02:49:05 +08:00
|
|
|
|
|
|
|
while (1) {
|
2009-06-28 09:07:35 +08:00
|
|
|
wret = walk_down_tree(trans, root, path, wc);
|
|
|
|
if (wret < 0) {
|
2008-10-30 02:49:05 +08:00
|
|
|
ret = wret;
|
|
|
|
break;
|
2009-06-28 09:07:35 +08:00
|
|
|
}
|
2008-10-30 02:49:05 +08:00
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
wret = walk_up_tree(trans, root, path, wc, parent_level);
|
2008-10-30 02:49:05 +08:00
|
|
|
if (wret < 0)
|
|
|
|
ret = wret;
|
|
|
|
if (wret != 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2009-06-28 09:07:35 +08:00
|
|
|
kfree(wc);
|
2008-10-30 02:49:05 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
static u64 update_block_group_flags(struct btrfs_root *root, u64 flags)
|
|
|
|
{
|
|
|
|
u64 num_devices;
|
|
|
|
u64 stripped = BTRFS_BLOCK_GROUP_RAID0 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10;
|
|
|
|
|
2010-12-14 03:56:23 +08:00
|
|
|
/*
|
|
|
|
* we add in the count of missing devices because we want
|
|
|
|
* to make sure that any RAID levels on a degraded FS
|
|
|
|
* continue to be honored.
|
|
|
|
*/
|
|
|
|
num_devices = root->fs_info->fs_devices->rw_devices +
|
|
|
|
root->fs_info->fs_devices->missing_devices;
|
|
|
|
|
2008-04-29 03:29:52 +08:00
|
|
|
if (num_devices == 1) {
|
|
|
|
stripped |= BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
stripped = flags & ~stripped;
|
|
|
|
|
|
|
|
/* turn raid0 into single device chunks */
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_RAID0)
|
|
|
|
return stripped;
|
|
|
|
|
|
|
|
/* turn mirroring into duplication */
|
|
|
|
if (flags & (BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
return stripped | BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
return flags;
|
|
|
|
} else {
|
|
|
|
/* they already had raid on here, just return */
|
|
|
|
if (flags & stripped)
|
|
|
|
return flags;
|
|
|
|
|
|
|
|
stripped |= BTRFS_BLOCK_GROUP_DUP;
|
|
|
|
stripped = flags & ~stripped;
|
|
|
|
|
|
|
|
/* switch duplicated blocks with raid1 */
|
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DUP)
|
|
|
|
return stripped | BTRFS_BLOCK_GROUP_RAID1;
|
|
|
|
|
|
|
|
/* turn single device chunks into raid0 */
|
|
|
|
return stripped | BTRFS_BLOCK_GROUP_RAID0;
|
|
|
|
}
|
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
|
2011-07-15 18:34:36 +08:00
|
|
|
static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force)
|
2008-05-25 02:04:53 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *sinfo = cache->space_info;
|
|
|
|
u64 num_bytes;
|
2011-07-15 18:34:36 +08:00
|
|
|
u64 min_allocable_bytes;
|
2010-05-16 22:46:25 +08:00
|
|
|
int ret = -ENOSPC;
|
2008-05-25 02:04:53 +08:00
|
|
|
|
2008-07-23 11:06:41 +08:00
|
|
|
|
2011-07-15 18:34:36 +08:00
|
|
|
/*
|
|
|
|
* We need some metadata space and system metadata space for
|
|
|
|
* allocating chunks in some corner cases until we force to set
|
|
|
|
* it to be readonly.
|
|
|
|
*/
|
|
|
|
if ((sinfo->flags &
|
|
|
|
(BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) &&
|
|
|
|
!force)
|
|
|
|
min_allocable_bytes = 1 * 1024 * 1024;
|
|
|
|
else
|
|
|
|
min_allocable_bytes = 0;
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
spin_lock(&cache->lock);
|
2011-07-26 11:30:11 +08:00
|
|
|
|
|
|
|
if (cache->ro) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
num_bytes = cache->key.offset - cache->reserved - cache->pinned -
|
|
|
|
cache->bytes_super - btrfs_block_group_used(&cache->item);
|
|
|
|
|
|
|
|
if (sinfo->bytes_used + sinfo->bytes_reserved + sinfo->bytes_pinned +
|
2011-08-05 22:25:38 +08:00
|
|
|
sinfo->bytes_may_use + sinfo->bytes_readonly + num_bytes +
|
|
|
|
min_allocable_bytes <= sinfo->total_bytes) {
|
2010-05-16 22:46:25 +08:00
|
|
|
sinfo->bytes_readonly += num_bytes;
|
|
|
|
cache->ro = 1;
|
|
|
|
ret = 0;
|
|
|
|
}
|
2011-07-26 11:30:11 +08:00
|
|
|
out:
|
2010-05-16 22:46:25 +08:00
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
int btrfs_set_block_group_ro(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
2008-07-23 11:06:41 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 alloc_flags;
|
|
|
|
int ret;
|
2008-07-09 02:19:17 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(cache->ro);
|
2008-05-25 02:04:53 +08:00
|
|
|
|
2011-05-28 19:00:39 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
2010-05-16 22:46:25 +08:00
|
|
|
BUG_ON(IS_ERR(trans));
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
alloc_flags = update_block_group_flags(root, cache->flags);
|
|
|
|
if (alloc_flags != cache->flags)
|
2011-04-16 04:05:44 +08:00
|
|
|
do_chunk_alloc(trans, root, 2 * 1024 * 1024, alloc_flags,
|
|
|
|
CHUNK_ALLOC_FORCE);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-07-15 18:34:36 +08:00
|
|
|
ret = set_block_group_ro(cache, 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (!ret)
|
|
|
|
goto out;
|
|
|
|
alloc_flags = get_alloc_profile(root, cache->space_info->flags);
|
2011-04-16 04:05:44 +08:00
|
|
|
ret = do_chunk_alloc(trans, root, 2 * 1024 * 1024, alloc_flags,
|
|
|
|
CHUNK_ALLOC_FORCE);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2011-07-15 18:34:36 +08:00
|
|
|
ret = set_block_group_ro(cache, 0);
|
2010-05-16 22:46:25 +08:00
|
|
|
out:
|
|
|
|
btrfs_end_transaction(trans, root);
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
|
2011-02-17 02:57:04 +08:00
|
|
|
int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 type)
|
|
|
|
{
|
|
|
|
u64 alloc_flags = get_alloc_profile(root, type);
|
2011-04-16 04:05:44 +08:00
|
|
|
return do_chunk_alloc(trans, root, 2 * 1024 * 1024, alloc_flags,
|
|
|
|
CHUNK_ALLOC_FORCE);
|
2011-02-17 02:57:04 +08:00
|
|
|
}
|
|
|
|
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 18:07:31 +08:00
|
|
|
/*
|
|
|
|
* helper to account the unused space of all the readonly block group in the
|
|
|
|
* list. takes mirrors into account.
|
|
|
|
*/
|
|
|
|
static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
u64 free_bytes = 0;
|
|
|
|
int factor;
|
|
|
|
|
|
|
|
list_for_each_entry(block_group, groups_list, list) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
|
|
|
|
if (!block_group->ro) {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_DUP))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
|
|
|
|
|
|
|
free_bytes += (block_group->key.offset -
|
|
|
|
btrfs_block_group_used(&block_group->item)) *
|
|
|
|
factor;
|
|
|
|
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
return free_bytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* helper to account the unused space of all the readonly block group in the
|
|
|
|
* space_info. takes mirrors into account.
|
|
|
|
*/
|
|
|
|
u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
u64 free_bytes = 0;
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
|
|
|
|
for(i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
if (!list_empty(&sinfo->block_groups[i]))
|
|
|
|
free_bytes += __btrfs_get_ro_block_group_free_space(
|
|
|
|
&sinfo->block_groups[i]);
|
|
|
|
|
|
|
|
spin_unlock(&sinfo->lock);
|
|
|
|
|
|
|
|
return free_bytes;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:25 +08:00
|
|
|
int btrfs_set_block_group_rw(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
{
|
2010-05-16 22:46:25 +08:00
|
|
|
struct btrfs_space_info *sinfo = cache->space_info;
|
|
|
|
u64 num_bytes;
|
|
|
|
|
|
|
|
BUG_ON(!cache->ro);
|
|
|
|
|
|
|
|
spin_lock(&sinfo->lock);
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
num_bytes = cache->key.offset - cache->reserved - cache->pinned -
|
|
|
|
cache->bytes_super - btrfs_block_group_used(&cache->item);
|
|
|
|
sinfo->bytes_readonly -= num_bytes;
|
|
|
|
cache->ro = 0;
|
|
|
|
spin_unlock(&cache->lock);
|
|
|
|
spin_unlock(&sinfo->lock);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* checks to see if its even possible to relocate this block group.
|
|
|
|
*
|
|
|
|
* @return - -1 if it's not a good idea to relocate this block group, 0 if its
|
|
|
|
* ok to go ahead and try.
|
|
|
|
*/
|
|
|
|
int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr)
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
{
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
|
|
|
|
struct btrfs_device *device;
|
2011-08-03 18:15:25 +08:00
|
|
|
u64 min_free;
|
2011-08-20 20:29:51 +08:00
|
|
|
u64 dev_min = 1;
|
|
|
|
u64 dev_nr = 0;
|
2011-08-03 18:15:25 +08:00
|
|
|
int index;
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
int full = 0;
|
|
|
|
int ret = 0;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/* odd, couldn't find the block group, leave it alone */
|
|
|
|
if (!block_group)
|
|
|
|
return -1;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2011-08-03 18:15:25 +08:00
|
|
|
min_free = btrfs_block_group_used(&block_group->item);
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/* no bytes used, we're good */
|
2011-08-03 18:15:25 +08:00
|
|
|
if (!min_free)
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
goto out;
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
space_info = block_group->space_info;
|
|
|
|
spin_lock(&space_info->lock);
|
2008-12-12 23:03:38 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
full = space_info->full;
|
2008-12-12 23:03:38 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* if this is the last block group we have in this space, we can't
|
2009-09-23 02:48:44 +08:00
|
|
|
* relocate it unless we're able to allocate a new chunk below.
|
|
|
|
*
|
|
|
|
* Otherwise, we need to make sure we have room in the space to handle
|
|
|
|
* all of the extents from this block group. If we can, we're good
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
*/
|
2009-09-23 02:48:44 +08:00
|
|
|
if ((space_info->total_bytes != block_group->key.offset) &&
|
2011-08-03 18:15:25 +08:00
|
|
|
(space_info->bytes_used + space_info->bytes_reserved +
|
|
|
|
space_info->bytes_pinned + space_info->bytes_readonly +
|
|
|
|
min_free < space_info->total_bytes)) {
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
goto out;
|
2008-12-12 23:03:38 +08:00
|
|
|
}
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
2008-08-05 11:17:27 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* ok we don't have enough space, but maybe we have free space on our
|
|
|
|
* devices to allocate new chunks for relocation, so loop through our
|
|
|
|
* alloc devices and guess if we have enough space. However, if we
|
|
|
|
* were marked as full, then we know there aren't enough chunks, and we
|
|
|
|
* can just return.
|
|
|
|
*/
|
|
|
|
ret = -1;
|
|
|
|
if (full)
|
|
|
|
goto out;
|
2008-08-05 11:17:27 +08:00
|
|
|
|
2011-08-03 18:15:25 +08:00
|
|
|
/*
|
|
|
|
* index:
|
|
|
|
* 0: raid10
|
|
|
|
* 1: raid1
|
|
|
|
* 2: dup
|
|
|
|
* 3: raid0
|
|
|
|
* 4: single
|
|
|
|
*/
|
|
|
|
index = get_block_group_index(block_group);
|
|
|
|
if (index == 0) {
|
|
|
|
dev_min = 4;
|
2011-08-20 20:29:51 +08:00
|
|
|
/* Divide by 2 */
|
|
|
|
min_free >>= 1;
|
2011-08-03 18:15:25 +08:00
|
|
|
} else if (index == 1) {
|
|
|
|
dev_min = 2;
|
|
|
|
} else if (index == 2) {
|
2011-08-20 20:29:51 +08:00
|
|
|
/* Multiply by 2 */
|
|
|
|
min_free <<= 1;
|
2011-08-03 18:15:25 +08:00
|
|
|
} else if (index == 3) {
|
|
|
|
dev_min = fs_devices->rw_devices;
|
2011-08-20 20:29:51 +08:00
|
|
|
do_div(min_free, dev_min);
|
2011-08-03 18:15:25 +08:00
|
|
|
}
|
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
mutex_lock(&root->fs_info->chunk_mutex);
|
|
|
|
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
|
2011-01-05 18:07:26 +08:00
|
|
|
u64 dev_offset;
|
2009-03-13 22:10:06 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
/*
|
|
|
|
* check to make sure we can actually find a chunk with enough
|
|
|
|
* space to fit our block group in.
|
|
|
|
*/
|
|
|
|
if (device->total_bytes > device->bytes_used + min_free) {
|
2011-12-08 15:07:24 +08:00
|
|
|
ret = find_free_dev_extent(device, min_free,
|
2011-01-05 18:07:26 +08:00
|
|
|
&dev_offset, NULL);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
if (!ret)
|
2011-08-03 18:15:25 +08:00
|
|
|
dev_nr++;
|
|
|
|
|
|
|
|
if (dev_nr >= dev_min)
|
2008-01-04 03:14:39 +08:00
|
|
|
break;
|
2011-08-03 18:15:25 +08:00
|
|
|
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
ret = -1;
|
2008-01-05 05:47:16 +08:00
|
|
|
}
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
mutex_unlock(&root->fs_info->chunk_mutex);
|
2007-12-22 05:27:24 +08:00
|
|
|
out:
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-12 04:11:19 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2007-12-22 05:27:24 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-12-02 22:54:17 +08:00
|
|
|
static int find_first_block_group(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, struct btrfs_key *key)
|
2008-03-25 03:01:56 +08:00
|
|
|
{
|
2008-06-26 04:01:30 +08:00
|
|
|
int ret = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
struct btrfs_key found_key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
int slot;
|
2007-12-22 05:27:24 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
2008-06-26 04:01:30 +08:00
|
|
|
goto out;
|
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2008-03-25 03:01:56 +08:00
|
|
|
slot = path->slots[0];
|
2007-12-22 05:27:24 +08:00
|
|
|
leaf = path->nodes[0];
|
2008-03-25 03:01:56 +08:00
|
|
|
if (slot >= btrfs_header_nritems(leaf)) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret == 0)
|
|
|
|
continue;
|
|
|
|
if (ret < 0)
|
2008-06-26 04:01:30 +08:00
|
|
|
goto out;
|
2008-03-25 03:01:56 +08:00
|
|
|
break;
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, slot);
|
2007-12-22 05:27:24 +08:00
|
|
|
|
2008-03-25 03:01:56 +08:00
|
|
|
if (found_key.objectid >= key->objectid &&
|
2008-06-26 04:01:30 +08:00
|
|
|
found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2008-03-25 03:01:56 +08:00
|
|
|
path->slots[0]++;
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
2008-06-26 04:01:30 +08:00
|
|
|
out:
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
2007-12-22 05:27:24 +08:00
|
|
|
}
|
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
u64 last = 0;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
block_group = btrfs_lookup_first_block_group(info, last);
|
|
|
|
while (block_group) {
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->iref)
|
|
|
|
break;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
block_group = next_block_group(info->tree_root,
|
|
|
|
block_group);
|
|
|
|
}
|
|
|
|
if (!block_group) {
|
|
|
|
if (last == 0)
|
|
|
|
break;
|
|
|
|
last = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
inode = block_group->inode;
|
|
|
|
block_group->iref = 0;
|
|
|
|
block_group->inode = NULL;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
iput(inode);
|
|
|
|
last = block_group->key.objectid + block_group->key.offset;
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
int btrfs_free_block_groups(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
2009-03-11 00:39:20 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2009-09-12 04:11:19 +08:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
struct rb_node *n;
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
down_write(&info->extent_commit_sem);
|
|
|
|
while (!list_empty(&info->caching_block_groups)) {
|
|
|
|
caching_ctl = list_entry(info->caching_block_groups.next,
|
|
|
|
struct btrfs_caching_control, list);
|
|
|
|
list_del(&caching_ctl->list);
|
|
|
|
put_caching_control(caching_ctl);
|
|
|
|
}
|
|
|
|
up_write(&info->extent_commit_sem);
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
spin_lock(&info->block_group_cache_lock);
|
|
|
|
while ((n = rb_last(&info->block_group_cache_tree)) != NULL) {
|
|
|
|
block_group = rb_entry(n, struct btrfs_block_group_cache,
|
|
|
|
cache_node);
|
|
|
|
rb_erase(&block_group->cache_node,
|
|
|
|
&info->block_group_cache_tree);
|
2008-10-31 02:25:28 +08:00
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_write(&block_group->space_info->groups_sem);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
list_del(&block_group->list);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
up_write(&block_group->space_info->groups_sem);
|
2008-12-12 05:30:39 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (block_group->cached == BTRFS_CACHE_STARTED)
|
2009-09-12 04:11:19 +08:00
|
|
|
wait_block_group_cache_done(block_group);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2011-02-02 23:53:47 +08:00
|
|
|
/*
|
|
|
|
* We haven't cached this block group, which means we could
|
|
|
|
* possibly have excluded extents on this block group.
|
|
|
|
*/
|
|
|
|
if (block_group->cached == BTRFS_CACHE_NO)
|
|
|
|
free_excluded_extents(info->extent_root, block_group);
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
btrfs_remove_free_space_cache(block_group);
|
2009-11-14 04:12:59 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
2008-10-31 02:25:28 +08:00
|
|
|
|
|
|
|
spin_lock(&info->block_group_cache_lock);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
}
|
|
|
|
spin_unlock(&info->block_group_cache_lock);
|
2009-03-11 00:39:20 +08:00
|
|
|
|
|
|
|
/* now that all the block groups are freed, go through and
|
|
|
|
* free all the space_info structs. This is only called during
|
|
|
|
* the final stages of unmount, and so we know nobody is
|
|
|
|
* using them. We call synchronize_rcu() once before we start,
|
|
|
|
* just to be on the safe side.
|
|
|
|
*/
|
|
|
|
synchronize_rcu();
|
|
|
|
|
2010-05-16 22:49:58 +08:00
|
|
|
release_global_block_rsv(info);
|
|
|
|
|
2009-03-11 00:39:20 +08:00
|
|
|
while(!list_empty(&info->space_info)) {
|
|
|
|
space_info = list_entry(info->space_info.next,
|
|
|
|
struct btrfs_space_info,
|
|
|
|
list);
|
2010-05-16 22:46:25 +08:00
|
|
|
if (space_info->bytes_pinned > 0 ||
|
2011-07-27 05:00:46 +08:00
|
|
|
space_info->bytes_reserved > 0 ||
|
|
|
|
space_info->bytes_may_use > 0) {
|
2010-05-16 22:46:25 +08:00
|
|
|
WARN_ON(1);
|
|
|
|
dump_space_info(space_info, 0, 0);
|
|
|
|
}
|
2009-03-11 00:39:20 +08:00
|
|
|
list_del(&space_info->list);
|
|
|
|
kfree(space_info);
|
|
|
|
}
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
static void __link_block_group(struct btrfs_space_info *space_info,
|
|
|
|
struct btrfs_block_group_cache *cache)
|
|
|
|
{
|
|
|
|
int index = get_block_group_index(cache);
|
|
|
|
|
|
|
|
down_write(&space_info->groups_sem);
|
|
|
|
list_add_tail(&cache->list, &space_info->block_groups[index]);
|
|
|
|
up_write(&space_info->groups_sem);
|
|
|
|
}
|
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
int btrfs_read_block_groups(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
int ret;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
2007-05-06 22:15:01 +08:00
|
|
|
struct btrfs_fs_info *info = root->fs_info;
|
2008-03-25 03:01:59 +08:00
|
|
|
struct btrfs_space_info *space_info;
|
2007-04-27 04:46:15 +08:00
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key found_key;
|
2007-10-16 04:14:19 +08:00
|
|
|
struct extent_buffer *leaf;
|
2010-06-22 02:48:16 +08:00
|
|
|
int need_clear = 0;
|
|
|
|
u64 cache_gen;
|
2007-10-16 04:15:19 +08:00
|
|
|
|
2007-05-06 22:15:01 +08:00
|
|
|
root = info->extent_root;
|
2007-04-27 04:46:15 +08:00
|
|
|
key.objectid = 0;
|
2008-03-25 03:01:56 +08:00
|
|
|
key.offset = 0;
|
2007-04-27 04:46:15 +08:00
|
|
|
btrfs_set_key_type(&key, BTRFS_BLOCK_GROUP_ITEM_KEY);
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
2011-05-13 22:32:11 +08:00
|
|
|
path->reada = 1;
|
2007-04-27 04:46:15 +08:00
|
|
|
|
2011-04-13 21:41:04 +08:00
|
|
|
cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
|
2011-10-04 02:07:49 +08:00
|
|
|
if (btrfs_test_opt(root, SPACE_CACHE) &&
|
2011-04-13 21:41:04 +08:00
|
|
|
btrfs_super_generation(root->fs_info->super_copy) != cache_gen)
|
2010-06-22 02:48:16 +08:00
|
|
|
need_clear = 1;
|
2010-09-22 02:21:34 +08:00
|
|
|
if (btrfs_test_opt(root, CLEAR_CACHE))
|
|
|
|
need_clear = 1;
|
2010-06-22 02:48:16 +08:00
|
|
|
|
2009-01-06 10:25:51 +08:00
|
|
|
while (1) {
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = find_first_block_group(root, path, &key);
|
2010-05-16 22:46:24 +08:00
|
|
|
if (ret > 0)
|
|
|
|
break;
|
2008-03-25 03:01:56 +08:00
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
2007-10-16 04:14:19 +08:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
2008-04-26 04:53:30 +08:00
|
|
|
cache = kzalloc(sizeof(*cache), GFP_NOFS);
|
2007-04-27 04:46:15 +08:00
|
|
|
if (!cache) {
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = -ENOMEM;
|
2010-05-16 22:46:25 +08:00
|
|
|
goto error;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2011-03-29 13:46:06 +08:00
|
|
|
cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!cache->free_space_ctl) {
|
|
|
|
kfree(cache);
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto error;
|
|
|
|
}
|
2007-05-08 08:03:49 +08:00
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
atomic_set(&cache->count, 1);
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock_init(&cache->lock);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->fs_info = info;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
INIT_LIST_HEAD(&cache->list);
|
2009-04-03 21:47:43 +08:00
|
|
|
INIT_LIST_HEAD(&cache->cluster_list);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
if (need_clear)
|
|
|
|
cache->disk_cache_state = BTRFS_DC_CLEAR;
|
|
|
|
|
2007-10-16 04:14:19 +08:00
|
|
|
read_extent_buffer(leaf, &cache->item,
|
|
|
|
btrfs_item_ptr_offset(leaf, path->slots[0]),
|
|
|
|
sizeof(cache->item));
|
2007-04-27 04:46:15 +08:00
|
|
|
memcpy(&cache->key, &found_key, sizeof(found_key));
|
2008-03-25 03:01:56 +08:00
|
|
|
|
2007-04-27 04:46:15 +08:00
|
|
|
key.objectid = found_key.objectid + found_key.offset;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
cache->flags = btrfs_block_group_flags(&cache->item);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->sectorsize = root->sectorsize;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
btrfs_init_free_space_ctl(cache);
|
|
|
|
|
2011-02-02 23:53:47 +08:00
|
|
|
/*
|
|
|
|
* We need to exclude the super stripes now so that the space
|
|
|
|
* info has super bytes accounted for, otherwise we'll think
|
|
|
|
* we have more space than we actually do.
|
|
|
|
*/
|
|
|
|
exclude_super_stripes(root, cache);
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
/*
|
|
|
|
* check for two cases, either we are full, and therefore
|
|
|
|
* don't need to bother with the caching work since we won't
|
|
|
|
* find any space, or we are empty, and we can just add all
|
|
|
|
* the space in and be done with it. This saves us _alot_ of
|
|
|
|
* time, particularly in the full case.
|
|
|
|
*/
|
|
|
|
if (found_key.offset == btrfs_block_group_used(&cache->item)) {
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
2009-09-12 04:11:20 +08:00
|
|
|
free_excluded_extents(root, cache);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
} else if (btrfs_block_group_used(&cache->item) == 0) {
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
|
|
|
add_new_free_space(cache, root->fs_info,
|
|
|
|
found_key.objectid,
|
|
|
|
found_key.objectid +
|
|
|
|
found_key.offset);
|
2009-09-12 04:11:19 +08:00
|
|
|
free_excluded_extents(root, cache);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
}
|
2007-10-16 04:15:19 +08:00
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
ret = update_space_info(info, cache->flags, found_key.offset,
|
|
|
|
btrfs_block_group_used(&cache->item),
|
|
|
|
&space_info);
|
|
|
|
BUG_ON(ret);
|
|
|
|
cache->space_info = space_info;
|
2009-09-12 04:11:20 +08:00
|
|
|
spin_lock(&cache->space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->space_info->bytes_readonly += cache->bytes_super;
|
2009-09-12 04:11:20 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
__link_block_group(space_info, cache);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
|
|
|
ret = btrfs_add_block_group_cache(root->fs_info, cache);
|
|
|
|
BUG_ON(ret);
|
2008-10-01 07:24:06 +08:00
|
|
|
|
|
|
|
set_avail_alloc_bits(root->fs_info, cache->flags);
|
2008-11-18 10:11:30 +08:00
|
|
|
if (btrfs_chunk_readonly(root, cache->key.objectid))
|
2011-07-15 18:34:36 +08:00
|
|
|
set_block_group_ro(cache, 1);
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2010-05-16 22:46:24 +08:00
|
|
|
|
|
|
|
list_for_each_entry_rcu(space_info, &root->fs_info->space_info, list) {
|
|
|
|
if (!(get_alloc_profile(root, space_info->flags) &
|
|
|
|
(BTRFS_BLOCK_GROUP_RAID10 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_DUP)))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* avoid allocating from un-mirrored block group if there are
|
|
|
|
* mirrored block groups.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(cache, &space_info->block_groups[3], list)
|
2011-07-15 18:34:36 +08:00
|
|
|
set_block_group_ro(cache, 1);
|
2010-05-16 22:46:24 +08:00
|
|
|
list_for_each_entry(cache, &space_info->block_groups[4], list)
|
2011-07-15 18:34:36 +08:00
|
|
|
set_block_group_ro(cache, 1);
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2010-05-16 22:46:25 +08:00
|
|
|
|
|
|
|
init_global_block_rsv(info);
|
2008-03-25 03:01:56 +08:00
|
|
|
ret = 0;
|
|
|
|
error:
|
2007-04-27 04:46:15 +08:00
|
|
|
btrfs_free_path(path);
|
2008-03-25 03:01:56 +08:00
|
|
|
return ret;
|
2007-04-27 04:46:15 +08:00
|
|
|
}
|
2008-03-25 03:01:59 +08:00
|
|
|
|
|
|
|
int btrfs_make_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytes_used,
|
2008-04-16 03:41:47 +08:00
|
|
|
u64 type, u64 chunk_objectid, u64 chunk_offset,
|
2008-03-25 03:01:59 +08:00
|
|
|
u64 size)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct btrfs_root *extent_root;
|
|
|
|
struct btrfs_block_group_cache *cache;
|
|
|
|
|
|
|
|
extent_root = root->fs_info->extent_root;
|
|
|
|
|
2009-03-24 22:24:20 +08:00
|
|
|
root->fs_info->last_trans_log_full_commit = trans->transid;
|
2008-09-06 04:13:11 +08:00
|
|
|
|
2008-04-26 04:53:30 +08:00
|
|
|
cache = kzalloc(sizeof(*cache), GFP_NOFS);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
if (!cache)
|
|
|
|
return -ENOMEM;
|
2011-03-29 13:46:06 +08:00
|
|
|
cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!cache->free_space_ctl) {
|
|
|
|
kfree(cache);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
|
2008-04-16 03:41:47 +08:00
|
|
|
cache->key.objectid = chunk_offset;
|
2008-03-25 03:01:59 +08:00
|
|
|
cache->key.offset = size;
|
2008-12-12 05:30:39 +08:00
|
|
|
cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->sectorsize = root->sectorsize;
|
2010-06-22 02:48:16 +08:00
|
|
|
cache->fs_info = root->fs_info;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
2008-12-12 05:30:39 +08:00
|
|
|
atomic_set(&cache->count, 1);
|
2008-07-23 11:06:41 +08:00
|
|
|
spin_lock_init(&cache->lock);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
INIT_LIST_HEAD(&cache->list);
|
2009-04-03 21:47:43 +08:00
|
|
|
INIT_LIST_HEAD(&cache->cluster_list);
|
2008-05-25 02:04:53 +08:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
btrfs_init_free_space_ctl(cache);
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
btrfs_set_block_group_used(&cache->item, bytes_used);
|
|
|
|
btrfs_set_block_group_chunk_objectid(&cache->item, chunk_objectid);
|
|
|
|
cache->flags = type;
|
|
|
|
btrfs_set_block_group_flags(&cache->item, type);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
cache->last_byte_to_unpin = (u64)-1;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
cache->cached = BTRFS_CACHE_FINISHED;
|
2009-09-12 04:11:19 +08:00
|
|
|
exclude_super_stripes(root, cache);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
add_new_free_space(cache, root->fs_info, chunk_offset,
|
|
|
|
chunk_offset + size);
|
|
|
|
|
2009-09-12 04:11:19 +08:00
|
|
|
free_excluded_extents(root, cache);
|
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
|
|
|
|
&cache->space_info);
|
|
|
|
BUG_ON(ret);
|
2009-09-12 04:11:20 +08:00
|
|
|
|
|
|
|
spin_lock(&cache->space_info->lock);
|
2010-05-16 22:46:25 +08:00
|
|
|
cache->space_info->bytes_readonly += cache->bytes_super;
|
2009-09-12 04:11:20 +08:00
|
|
|
spin_unlock(&cache->space_info->lock);
|
|
|
|
|
2010-05-16 22:46:24 +08:00
|
|
|
__link_block_group(cache->space_info, cache);
|
2008-03-25 03:01:59 +08:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-24 01:14:11 +08:00
|
|
|
ret = btrfs_add_block_group_cache(root->fs_info, cache);
|
|
|
|
BUG_ON(ret);
|
2008-07-23 11:06:41 +08:00
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
ret = btrfs_insert_item(trans, extent_root, &cache->key, &cache->item,
|
|
|
|
sizeof(cache->item));
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
2008-04-05 03:40:00 +08:00
|
|
|
set_avail_alloc_bits(extent_root->fs_info, type);
|
2008-06-26 04:01:30 +08:00
|
|
|
|
2008-03-25 03:01:59 +08:00
|
|
|
return 0;
|
|
|
|
}
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
|
|
|
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 group_start)
|
|
|
|
{
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
2009-06-05 03:34:51 +08:00
|
|
|
struct btrfs_free_cluster *cluster;
|
2010-06-22 02:48:16 +08:00
|
|
|
struct btrfs_root *tree_root = root->fs_info->tree_root;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
struct btrfs_key key;
|
2010-06-22 02:48:16 +08:00
|
|
|
struct inode *inode;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
int ret;
|
2010-10-15 02:52:27 +08:00
|
|
|
int factor;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
|
|
|
root = root->fs_info->extent_root;
|
|
|
|
|
|
|
|
block_group = btrfs_lookup_block_group(root->fs_info, group_start);
|
|
|
|
BUG_ON(!block_group);
|
2008-11-13 03:34:12 +08:00
|
|
|
BUG_ON(!block_group->ro);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2011-03-07 10:13:33 +08:00
|
|
|
/*
|
|
|
|
* Free the reserved super bytes from this block group before
|
|
|
|
* remove it.
|
|
|
|
*/
|
|
|
|
free_excluded_extents(root, block_group);
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
memcpy(&key, &block_group->key, sizeof(key));
|
2010-10-15 02:52:27 +08:00
|
|
|
if (block_group->flags & (BTRFS_BLOCK_GROUP_DUP |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID1 |
|
|
|
|
BTRFS_BLOCK_GROUP_RAID10))
|
|
|
|
factor = 2;
|
|
|
|
else
|
|
|
|
factor = 1;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2009-06-05 03:34:51 +08:00
|
|
|
/* make sure this block group isn't part of an allocation cluster */
|
|
|
|
cluster = &root->fs_info->data_alloc_cluster;
|
|
|
|
spin_lock(&cluster->refill_lock);
|
|
|
|
btrfs_return_cluster_to_free_space(block_group, cluster);
|
|
|
|
spin_unlock(&cluster->refill_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* make sure this block group isn't part of a metadata
|
|
|
|
* allocation cluster
|
|
|
|
*/
|
|
|
|
cluster = &root->fs_info->meta_alloc_cluster;
|
|
|
|
spin_lock(&cluster->refill_lock);
|
|
|
|
btrfs_return_cluster_to_free_space(block_group, cluster);
|
|
|
|
spin_unlock(&cluster->refill_lock);
|
|
|
|
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
path = btrfs_alloc_path();
|
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 01:38:47 +08:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
2011-10-02 18:56:53 +08:00
|
|
|
inode = lookup_free_space_inode(tree_root, block_group, path);
|
2010-06-22 02:48:16 +08:00
|
|
|
if (!IS_ERR(inode)) {
|
2011-07-19 15:27:20 +08:00
|
|
|
ret = btrfs_orphan_add(trans, inode);
|
|
|
|
BUG_ON(ret);
|
2010-06-22 02:48:16 +08:00
|
|
|
clear_nlink(inode);
|
|
|
|
/* One for the block groups ref */
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->iref) {
|
|
|
|
block_group->iref = 0;
|
|
|
|
block_group->inode = NULL;
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
iput(inode);
|
|
|
|
} else {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
/* One for our lookup ref */
|
2011-09-20 00:26:24 +08:00
|
|
|
btrfs_add_delayed_iput(inode);
|
2010-06-22 02:48:16 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
|
|
|
|
key.offset = block_group->key.objectid;
|
|
|
|
key.type = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
if (ret > 0)
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
if (ret == 0) {
|
|
|
|
ret = btrfs_del_item(trans, tree_root, path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2010-06-22 02:48:16 +08:00
|
|
|
}
|
|
|
|
|
2009-01-21 23:49:16 +08:00
|
|
|
spin_lock(&root->fs_info->block_group_cache_lock);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
rb_erase(&block_group->cache_node,
|
|
|
|
&root->fs_info->block_group_cache_tree);
|
2009-01-21 23:49:16 +08:00
|
|
|
spin_unlock(&root->fs_info->block_group_cache_lock);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
down_write(&block_group->space_info->groups_sem);
|
2009-06-05 03:34:51 +08:00
|
|
|
/*
|
|
|
|
* we must use list_del_init so people can check to see if they
|
|
|
|
* are still on the list after taking the semaphore
|
|
|
|
*/
|
|
|
|
list_del_init(&block_group->list);
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-30 02:49:05 +08:00
|
|
|
up_write(&block_group->space_info->groups_sem);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
if (block_group->cached == BTRFS_CACHE_STARTED)
|
2009-09-12 04:11:19 +08:00
|
|
|
wait_block_group_cache_done(block_group);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-14 09:29:25 +08:00
|
|
|
|
|
|
|
btrfs_remove_free_space_cache(block_group);
|
|
|
|
|
2008-11-13 03:34:12 +08:00
|
|
|
spin_lock(&block_group->space_info->lock);
|
|
|
|
block_group->space_info->total_bytes -= block_group->key.offset;
|
|
|
|
block_group->space_info->bytes_readonly -= block_group->key.offset;
|
2010-10-15 02:52:27 +08:00
|
|
|
block_group->space_info->disk_total -= block_group->key.offset * factor;
|
2008-11-13 03:34:12 +08:00
|
|
|
spin_unlock(&block_group->space_info->lock);
|
2009-07-25 04:30:55 +08:00
|
|
|
|
2010-06-22 02:48:16 +08:00
|
|
|
memcpy(&key, &block_group->key, sizeof(key));
|
|
|
|
|
2009-07-25 04:30:55 +08:00
|
|
|
btrfs_clear_space_info_full(root->fs_info);
|
2008-11-13 03:34:12 +08:00
|
|
|
|
2009-04-03 21:47:43 +08:00
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
btrfs_put_block_group(block_group);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 22:09:34 +08:00
|
|
|
|
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -EIO;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = btrfs_del_item(trans, root, path);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
2011-01-06 19:30:25 +08:00
|
|
|
|
2011-03-07 10:13:14 +08:00
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *space_info;
|
2011-04-08 16:44:37 +08:00
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 features;
|
|
|
|
u64 flags;
|
|
|
|
int mixed = 0;
|
2011-03-07 10:13:14 +08:00
|
|
|
int ret;
|
|
|
|
|
2011-04-13 21:41:04 +08:00
|
|
|
disk_super = fs_info->super_copy;
|
2011-04-08 16:44:37 +08:00
|
|
|
if (!btrfs_super_root(disk_super))
|
|
|
|
return 1;
|
2011-03-07 10:13:14 +08:00
|
|
|
|
2011-04-08 16:44:37 +08:00
|
|
|
features = btrfs_super_incompat_flags(disk_super);
|
|
|
|
if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = 1;
|
2011-03-07 10:13:14 +08:00
|
|
|
|
2011-04-08 16:44:37 +08:00
|
|
|
flags = BTRFS_BLOCK_GROUP_SYSTEM;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
2011-03-07 10:13:14 +08:00
|
|
|
if (ret)
|
2011-04-08 16:44:37 +08:00
|
|
|
goto out;
|
2011-03-07 10:13:14 +08:00
|
|
|
|
2011-04-08 16:44:37 +08:00
|
|
|
if (mixed) {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
|
|
|
} else {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
flags = BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = update_space_info(fs_info, flags, 0, 0, &space_info);
|
|
|
|
}
|
|
|
|
out:
|
2011-03-07 10:13:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-01-06 19:30:25 +08:00
|
|
|
int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
|
|
|
|
{
|
|
|
|
return unpin_extent_range(root, start, end);
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
|
2011-03-24 18:24:27 +08:00
|
|
|
u64 num_bytes, u64 *actual_bytes)
|
2011-01-06 19:30:25 +08:00
|
|
|
{
|
2011-03-24 18:24:27 +08:00
|
|
|
return btrfs_discard_extent(root, bytenr, num_bytes, actual_bytes);
|
2011-01-06 19:30:25 +08:00
|
|
|
}
|
2011-03-24 18:24:28 +08:00
|
|
|
|
|
|
|
int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_block_group_cache *cache = NULL;
|
|
|
|
u64 group_trimmed;
|
|
|
|
u64 start;
|
|
|
|
u64 end;
|
|
|
|
u64 trimmed = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
cache = btrfs_lookup_block_group(fs_info, range->start);
|
|
|
|
|
|
|
|
while (cache) {
|
|
|
|
if (cache->key.objectid >= (range->start + range->len)) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
start = max(range->start, cache->key.objectid);
|
|
|
|
end = min(range->start + range->len,
|
|
|
|
cache->key.objectid + cache->key.offset);
|
|
|
|
|
|
|
|
if (end - start >= range->minlen) {
|
|
|
|
if (!block_group_cache_done(cache)) {
|
|
|
|
ret = cache_block_group(cache, NULL, root, 0);
|
|
|
|
if (!ret)
|
|
|
|
wait_block_group_cache_done(cache);
|
|
|
|
}
|
|
|
|
ret = btrfs_trim_block_group(cache,
|
|
|
|
&group_trimmed,
|
|
|
|
start,
|
|
|
|
end,
|
|
|
|
range->minlen);
|
|
|
|
|
|
|
|
trimmed += group_trimmed;
|
|
|
|
if (ret) {
|
|
|
|
btrfs_put_block_group(cache);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
cache = next_block_group(fs_info->tree_root, cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
range->len = trimmed;
|
|
|
|
return ret;
|
|
|
|
}
|