In addition to properly handling allocation failure from btrfs_alloc_path, I
also fixed up the kzalloc error handling code immediately below it.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.
do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
which can now return -ENOMEM. Instead of setting space_info->full on any
error from btrfs_alloc_chunk() I catch and return every error value _except_
-ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with
modified clone, on failure releases acl and replaces with NULL.
Returns 0 or -ve on error. All callers of posix_acl_create_masq()
switched.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified
clone or with NULL if that has failed; returns 0 or -ve on error. All
callers of posix_acl_chmod_masq() switched to that - they'd been doing
exactly the same thing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This moves logic for checking the cached ACL values from low-level
filesystems into generic code. The end result is a streamlined ACL
check that doesn't need to load the inode->i_op->check_acl pointer at
all for the common cached case.
The filesystems also don't need to check for a non-blocking RCU walk
case in their acl_check() functions, because that is all handled at a
VFS layer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
both callers there have dentry->d_parent stabilized by the fact that
their caller had obtained dentry from lookup_one_len() and had not
dropped ->i_mutex on parent since then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed. This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree. So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state. This way we are sure to cache the correct state. Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We used to store the checksums of the space cache directly in the space cache,
however that doesn't work out too well if we have more space than we can fit the
checksums into the first page. So instead use the normal checksumming
infrastructure. There were problems with doing this originally but those
problems don't exist now so this works out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We keep having problems with early enospc, and that's because our method of
making space is inherently racy. The problem is we can have one guy trying to
make space for himself, and in the meantime people come in and steal his
reservation. In order to stop this we make a waitqueue and put anybody who
comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
make more space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have to do weird things when handling enospc in the transaction joining code.
Because we've already joined the transaction we cannot commit the transaction
within the reservation code since it will deadlock, so we have to return EAGAIN
and then make sure we don't retry too many times. Instead of doing this, just
do the reservation the normal way before we join the transaction, that way we
can do whatever we want to try and reclaim space, and then if it fails we know
for sure we are out of space and we can return ENOSPC. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I've been watching how many btrfs_search_slot()'s we do and I noticed that when
we create a file with selinux enabled we were doing 2 each time we initialize
the security context. That's because we lookup the xattr first so we can delete
it if we're setting a new value to an existing xattr. But in the create case we
don't have any xattrs, so it is completely useless to have the extra lookup. So
re-arrange things so that we only lookup first if we specifically have
XATTR_REPLACE. That way in the basic case we only do 1 search, and in the more
complicated case we do the normal 2 lookups. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix oops when doing space balance
Btrfs: don't panic if we get an error while balancing V2
btrfs: add missing options displayed in mount output
A user reported an error where if we try to balance an fs after a device has
been removed it will blow up. This is because we get an EIO back and this is
where BUG_ON(ret) bites us in the ass. To fix we just exit. Thanks,
Reported-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are three missed mount options settable by user which are not
currently displayed in mount output.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix inconsonant inode information
Btrfs: make sure to update total_bitmaps when freeing cache V3
Btrfs: fix type mismatch in find_free_extent()
Btrfs: make sure to record the transid in new inodes
When iputting the inode, We may leave the delayed nodes if they have some
delayed items that have not been dealt with. So when the inode is read again,
we must look up the relative delayed node, and use the information in it to
initialize the inode. Or we will get inconsonant inode information, it may
cause that the same directory index number is allocated again, and hit the
following oops:
[ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
[ 5447.569766] ------------[ cut here ]------------
[ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
[SNIP]
[ 5447.790721] Call Trace:
[ 5447.793191] [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
[ 5447.800156] [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
[ 5447.806517] [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
[ 5447.812876] [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
[ 5447.818961] [<ffffffff8111f840>] vfs_create+0x72/0x92
[ 5447.824090] [<ffffffff8111fa8c>] do_last+0x22c/0x40b
[ 5447.829133] [<ffffffff8112076a>] path_openat+0xc0/0x2ef
[ 5447.834438] [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
[ 5447.841216] [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
[ 5447.847846] [<ffffffff81121a79>] do_filp_open+0x3d/0x87
[ 5447.853156] [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
[ 5447.859072] [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
[ 5447.864636] [<ffffffff8111f179>] ? do_getname+0x14b/0x173
[ 5447.870112] [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
[ 5447.875682] [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
[ 5447.880882] [<ffffffff81112d39>] do_sys_open+0x69/0xae
[ 5447.886153] [<ffffffff81112db1>] sys_open+0x20/0x22
[ 5447.891114] [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
Fix it by reusing the old delayed node.
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported this bug again where we have more bitmaps than we are supposed
to. This is because we failed to load the free space cache, but don't update
the ctl->total_bitmaps counter when we remove entries from the tree. This patch
fixes this problem and we should be good to go again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
data parameter should be u64 because a full-sized chunk flags field is
passed instead of 0/1 for distinguishing data from metadata. All
underlying functions expect u64.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we create a new inode, we aren't filling in the
field that records the transaction that last changed this
inode.
If we then go to fsync that inode, it will be skipped because the field
isn't filled in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was pointed out by 'make versioncheck' that some includes of
linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and
fs/omfs/file.c).
This patch removes them.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: avoid delayed metadata items during commits
btrfs: fix uninitialized return value
btrfs: fix wrong reservation when doing delayed inode operations
btrfs: Remove unused sysfs code
btrfs: fix dereference of ERR_PTR value
Btrfs: fix relocation races
Btrfs: set no_trans_join after trying to expand the transaction
Btrfs: protect the pending_snapshots list with trans_lock
Btrfs: fix path leakage on subvol deletion
Btrfs: drop the delalloc_bytes check in shrink_delalloc
Btrfs: check the return value from set_anon_super
Snapshot creation has two phases. One is the initial snapshot setup,
and the second is done during commit, while nobody is allowed to modify
the root we are snapshotting.
The delayed metadata insertion code can break that rule, it does a
delayed inode update on the inode of the parent of the snapshot,
and delayed directory item insertion.
This makes sure to run the pending delayed operations before we
record the snapshot root, which avoids corruptions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When allocation fails in btrfs_read_fs_root_no_name, ret is not set
although it is returned, holding a garbage value.
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Removes code no longer used. The sysfs file itself is kept, because the
btrfs developers expressed interest in putting new entries to sysfs.
Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent commit to get rid of our trans_mutex introduced
some races with block group relocation. The problem is that relocation
needs to do some record keeping about each root, and it was relying
on the transaction mutex to coordinate things in subtle ways.
This fix adds a mutex just for the relocation code and makes sure
it doesn't have a big impact on normal operations. The race is
really fixed in btrfs_record_root_in_trans, which is where we
step back and wait for the relocation code to finish accounting
setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can lockup if we try to allow new writers join the transaction and we have
flushoncommit set or have a pending snapshot. This is because we set
no_trans_join and then loop around and try to wait for ordered extents again.
The problem is the ordered endio stuff needs to join the transaction, which it
can't do because no_trans_join is set. So instead wait until after this loop to
set no_trans_join and then make sure to wait for num_writers == 1 in case
anybody got started in between us exiting the loop and setting no_trans_join.
This could easily be reproduced by mounting -o flushoncommit and running xfstest
13. It cannot be reproduced with this patch. Thanks,
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently there is nothing protecting the pending_snapshots list on the
transaction. We only hold the directory mutex that we are snapshotting and a
read lock on the subvol_sem, so we could race with somebody else creating a
snapshot in a different directory and end up with list corruption. So protect
this list with the trans_lock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The delayed ref patch accidently removed the btrfs_free_path in
btrfs_unlink_subvol, this puts it back and means we don't leak a path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: use join_transaction in btrfs_evict_inode()
Btrfs - use %pU to print fsid
Btrfs: fix extent state leak on failed nodatasum reads
btrfs: fix unlocked access of delalloc_inodes
Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline
Btrfs: clear current->journal_info on async transaction commit
Btrfs: make sure to recheck for bitmaps in clusters
btrfs: remove unneeded includes from scrub.c
btrfs: reinitialize scrub workers
btrfs: scrub: errors in tree enumeration
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: unlock the trans lock properly
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: fix duplicate checking logic
Btrfs: fix the allocator loop logic
Btrfs: fix bitmap regression
Btrfs: don't commit the transaction if we dont have enough pinned bytes
Btrfs: noinline the cluster searching functions
Btrfs: cache bitmaps when searching for a cluster
The WARN_ON() in start_transaction() was triggered while balancing.
The cause is btrfs_relocate_chunk() started a transaction and
then called iput() on the inode that stores free space cache,
and iput() called btrfs_start_transaction() again.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Get rid of FIXME comment. Uuids from dmesg are now the same as uuids
given by btrfs-progs.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When encountering an EIO while reading from a nodatasum extent, we
insert an error record into the inode's failure tree.
btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd
better clear the failure tree in that case, otherwise the kernel
complains about
BUG extent_state: Objects remaining on kmem_cache_close()
on rmmod.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
list_splice_init will make delalloc_inodes empty, but without a spinlock
around, this may produce corrupted list head, accessed in many placess,
The race window is very tight and nobody seems to have hit it so far.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so
don't declare the variable on stack.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit
builds. This shrinks its size to 128 bytes allowing it to fit into one
fewer cache lines and allows more objects per slab in its kmem_cache.
slabinfo extent_buffer reports :-
before:-
Sizes (bytes) Slabs
----------------------------------
Object : 136 Total : 123
SlabObj: 136 Full : 121
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 30
after :-
Object : 128 Total : 4
SlabObj: 128 Full : 2
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 32
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Normally current->jouranl_info is cleared by commit_transaction. For an
async snap or subvol creation, though, it runs in a work queue. Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace. When the actual commit runs in
the other thread it won't care that it's current->journal_info is already
NULL.
Signed-off-by: Sage Weil <sage@newdream.net>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef recently changed the free extent cache to look in
the block group cluster for any bitmaps before trying to
add a new bitmap for the same offset. This avoids BUG_ON()s due
covering duplicate ranges.
But it didn't go quite far enough. A given free range might span
between one or more bitmaps or free space entries. The code has
looping to cover this, but it doesn't check for clustered bitmaps
every time.
This shuffles our gotos to check for a bitmap in the cluster
for every new bitmap entry we try to add.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.
Signed-off-by: Arne Jansen <sensille@gmx.net>
due to the semantics of btrfs_search_slot the path can point to an
invalid slot when ret > 0. This condition went unnoticed, which in
turn could have led to an incomplete scrubbing.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
In btrfs_wait_for_commit if we came upon a transaction that had committed we
just exited, but that's bad since we are holding the trans_lock. So break
instead so that the lock is dropped. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
When merging my code into the integration test the second check for duplicate
entries got screwed up. This patch fixes it by dropping ret2 and just using ret
for the return value, and checking if we got an error before adding the bitmap
to the local list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was testing with empty_cluster = 0 to try and reproduce a problem and kept
hitting early enospc panics. This was because our loop logic was a little
confused. So this is what I did
1) Make the loop variable the ultimate decider on wether we should loop again
isntead of checking to see if we had an uncached bg, empty size or empty
cluster.
2) Increment loop before checking to see what we are on to make the loop
definitions make more sense.
3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
unless we didn't actually allocate a chunk. If we did allocate a chunk we
should be able to easily setup a new cluster so clearing
empty_size/empty_cluster makes us less efficient.
This kept me from hitting panics while trying to reproduce the other problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In cleaning up the clustering code I accidently introduced a regression by
adding bitmap entries to the cluster rb tree. The problem is if we've maxed out
the number of bitmaps we can have for the block group we can only add free space
to the bitmaps, but since the bitmap is on the cluster we can't find it and we
try to create another one. This would result in a panic because the total
bitmaps was bigger than the max bitmaps that were allowed. This patch fixes
this by checking to see if we have a cluster, and then looking at the cluster rb
tree to see if it has a bitmap entry and if it does and that space belongs to
that bitmap, go ahead and add it to that bitmap.
I could hit this panic every time with an fs_mark test within a couple of
minutes. With this patch I no longer hit the panic and fs_mark goes to
completion. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed when running an enospc test that we would get stuck committing the
transaction in check_data_space even though we truly didn't have enough space.
So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
commit the transaction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When profiling the find cluster code it's hard to tell where we are spending our
time because the bitmap and non-bitmap functions get inlined by the compiler, so
make that not happen. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we are looking for a cluster in a particularly sparse or fragmented block
group, we will do a lot of looping through the free space tree looking for
various things, and if we need to look at bitmaps we will endup doing the whole
dance twice. So instead add the bitmap entries to a temporary list so if we
have to do the bitmap search we can just look through the list of entries we've
found quickly instead of having to loop through the entire tree again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:
fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function
Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").
This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With xfstest 254 I can panic the box every time with the inode number caching
stuff on. This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free. To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0). With this fix 254 no longer panics. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree. It
prevents crashes during balancing operations.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page. This adds a check to
enforce that restriction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
Cache xattr security drop check for write v2
fs: block_page_mkwrite should wait for writeback to finish
mm: Wait for writeback when grabbing pages to begin a write
configfs: remove unnecessary dentry_unhash on rmdir, dir rename
fat: remove unnecessary dentry_unhash on rmdir, dir rename
hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
minix: remove unnecessary dentry_unhash on rmdir, dir rename
fuse: remove unnecessary dentry_unhash on rmdir, dir rename
coda: remove unnecessary dentry_unhash on rmdir, dir rename
afs: remove unnecessary dentry_unhash on rmdir, dir rename
affs: remove unnecessary dentry_unhash on rmdir, dir rename
9p: remove unnecessary dentry_unhash on rmdir, dir rename
ncpfs: fix rename over directory with dangling references
ncpfs: document dentry_unhash usage
ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
hfs: remove unnecessary dentry_unhash on rmdir, dir rename
omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
udf: remove unnecessary dentry_unhash from rmdir, dir rename
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
Btrfs: use the device_list_mutex during write_dev_supers
Btrfs: setup free ino caching in a more asynchronous way
btrfs scrub: don't coalesce pages that are logically discontiguous
Btrfs: return -ENOMEM in clear_extent_bit
Btrfs: add mount -o auto_defrag
Btrfs: using rcu lock in the reader side of devices list
Btrfs: drop unnecessary device lock
Btrfs: fix the race between remove dev and alloc chunk
Btrfs: fix the race between reading and updating devices
Btrfs: fix bh leak on __btrfs_open_devices path
Btrfs: fix unsafe usage of merge_state
Btrfs: allocate extent state and check the result properly
fs/btrfs: Add missing btrfs_free_path
Btrfs: check return value of btrfs_inc_extent_ref()
Btrfs: return error to caller if read_one_inode() fails
Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Btrfs: return error code to caller when btrfs_del_item fails
Btrfs: return error code to caller when btrfs_previous_item fails
btrfs: fix typo 'testeing' -> 'testing'
btrfs: typo: 'btrfS' -> 'btrfs'
...
write_dev_supers was changed to use RCU to protect the list of
devices, but it was then sleeping while it actually wrote the supers.
This fixes it to just use the mutex, since we really don't any
concurrency in write_dev_supers anyway.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.
This is just the prototype change with no user for it yet. I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.
Also remove incorrect comments that ->dirty_inode can't block. That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For a filesystem that has lots of files in it, the first time we mount
it with free ino caching support, it can take quite a long time to
setup the caching before we can create new files.
Here we fill the cache with [highest_ino, BTRFS_LAST_FREE_OBJECTID]
before we start the caching thread to search through the extent tree.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
scrub_page collects several pages into one bio as long as they are physically
contiguous. As we only save one logical address for the whole bio, don't
collect pages that are physically contiguous but logically discontiguous.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This will detect small random writes into files and
queue the up for an auto defrag process. It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
xen: cleancache shim to Xen Transcendent Memory
ocfs2: add cleancache support
ext4: add cleancache support
btrfs: add cleancache support
ext3: add cleancache support
mm/fs: add hooks to support cleancache
mm: cleancache core ops functions and config
fs: add field to superblock to support cleancache
mm/fs: cleancache documentation
Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>