Commit Graph

3054 Commits

Author SHA1 Message Date
Josef Bacik 96f1bb5777 Btrfs: do not overcommit if we don't have enough space for global rsv
Because of how little we allocate chunks now we can get really tight on
metadata space before we will allocate a new chunk.  This resulted in being
unable to add device extents when allocating a new metadata chunk as we did
not have enough space.  This is because we were allowed to overcommit too
much metadata without actually making sure we had enough space to make
allocations.  The idea behind overcommit is that we are allowed to say "sure
you can have that reservation" when most of the free space is occupied by
reservations, not actual allocations.  But in this case where a majority of
the total space is in use by actual allocations we can screw ourselves by
not being able to make real allocations when it matters.  So make sure we
have enough real space for our global reserve, and if not then don't allow
overcommitting.  Thanks,

Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:13 -05:00
Josef Bacik 0f5d42b287 Btrfs: remove extent mapping if we fail to add chunk
I got a double free error when unmounting a file system that failed to add a
chunk during its operation.  This is because we will kfree the mapping that
we created but leave the extent_map in the em_tree for chunks.  So to fix
this just remove the extent_map when we error out so we don't run into this
problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:11 -05:00
Josef Bacik 0448748849 Btrfs: fix chunk allocation error handling
If we error out allocating a dev extent we will have already created the
block group and such which will cause problems since the allocator may have
tried to allocate out of the block group that no longer exists.  This will
cause BUG_ON()'s in the bio submission path.  This also makes a failure to
allocate a dev extent a non-abort error, we will just clean up the dev
extents we did allocate and exit.  Now if we fail to delete the dev extents
we will abort since we can't have half of the dev extents hanging around,
but this will make us much less likely to abort.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:10 -05:00
Miao Xie 87533c4751 Btrfs: use bit operation for ->fs_state
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
	Task0 - CPU0		Task1 - CPU1
	mov %fs_state rax
	or $0x1 rax
				mov %fs_state rax
				or $0x2 rax
	mov rax %fs_state
				mov rax %fs_state
The expected value is 3, but in fact, it is 2.

Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.

Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:09 -05:00
Miao Xie de98ced9e7 Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits
There is no lock to protect
  fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:08 -05:00
Miao Xie df0af1a57f Btrfs: use the inode own lock to protect its delalloc_bytes
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:06 -05:00
Miao Xie 963d678b0f Btrfs: use percpu counter for fs_info->delalloc_bytes
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.

This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:05 -05:00
Miao Xie e2d845211e Btrfs: use percpu counter for dirty metadata count
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.

This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:04 -05:00
Miao Xie c018daecea Btrfs: protect fs_info->alloc_start
fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)

For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
	Task0 			Task1
				load high 32 bits
	set high 32 bits
	set low 32 bits
				load low 32 bits

Task1 will get 0.

This patch fixes this problem by using two locks to protect it
	fs_info->chunk_mutex
	sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:02 -05:00
Miao Xie 8c6a3ee6db Btrfs: add a comment for fs_info->max_inline
Though ->max_inline is a 64bit variant, and may be accessed by
multi-task, but it is just suggestive number, so we needn't add
anything to protect fs_info->max_inline, just add a comment to
explain wny we don't use a lock to protect it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:01 -05:00
Filipe Brandenburger 55e301fd57 Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:28 -05:00
Kusanagi Kouichi 82b22ac8f6 Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS
CAP_DAC_READ_SEARCH overrides read and search permission check on
file and directory. It seems fit for BTRFS_IOC_INO_PATHS.

Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:27 -05:00
Josef Bacik fe5fafbebd Revert "Btrfs: fix permissions of empty files not affected by umask"
This reverts commit 2794ed013b.

Wasn't supposed to get used in btrfs_mknod, it was supposed to be in
btrfs_create, which was done in commit
9185aa587b.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:26 -05:00
Miao Xie 5b947f1ba9 Btrfs: don't traverse the ordered operation list repeatedly
btrfs_run_ordered_operations() needn't traverse the ordered operation list
repeatedly, it is because the transaction commiter will invoke it again when
there is no other writer in this transaction, it can ensure that no one can
add new objects into the ordered operation list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:24 -05:00
Miao Xie 63607cc86a Btrfs: traverse and flush the delalloc inodes once
btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes
repeatedly. It is because we can regard the data that the users write after
we start delalloc inodes flush as the one which is after the delalloc inodes
flush is done, and we can flush it next time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:23 -05:00
Miao Xie eebc608406 Btrfs: check the return value of btrfs_run_ordered_operations()
We forget to check the return value of btrfs_run_ordered_operations() when
flushing all the pending stuffs, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:22 -05:00
Miao Xie 3edb2a68cb Btrfs: check the return value of btrfs_start_delalloc_inodes()
We forget to check the return value of btrfs_start_delalloc_inodes(), fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:21 -05:00
Miao Xie e6ec716f0d Btrfs: make raid attr array more readable
The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:19 -05:00
Liu Bo a1897fddd2 Btrfs: record first logical byte in memory
This'd save us a rbtree search which may become expensive in large filesystem.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:18 -05:00
Liu Bo 39f9d028c9 Btrfs: save us a read_lock
This does not change the logic of code, but can save us a read_lock.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:17 -05:00
Liu Bo 51fab69347 Btrfs: use token to avoid times mapping extent buffer
The API in tree log code has done sort of changes, and it proves that
we can benifit from using token, so do the same thing here.

function_graph tracer's timer shows that it costs nearly half time
of before(39.788us -> 22.391us).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:15 -05:00
Liu Bo dcfac4156f Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay
Argument 'trans' is not used any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:14 -05:00
Liu Bo c53d613e52 Btrfs: kill unused argument of update_block_group
Argument 'trans' is not used any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:13 -05:00
Liu Bo f6373bf3dc Btrfs: kill unused arguments of cache_block_group
Argument 'trans' and 'root' are not used any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:11 -05:00
Liu Bo 17b85495cf Btrfs: remove deprecated comments
commit d53ba47484
(Btrfs: use commit root when loading free space cache) has remove
the deadlock check, and the related comments can be removed as well.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:10 -05:00
Josef Bacik c6b305a89b Btrfs: don't re-enter when allocating a chunk
If we start running low on metadata space we will try to allocate a chunk,
which could then try to allocate a chunk to add the device entry.  The thing
is we allocate a chunk before we try really hard to make the allocation, so
we should be able to find space for the device entry.  Add a flag to the
trans handle so we know we're currently allocating a chunk so we can just
bail out if we try to allocate another chunk.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:09 -05:00
Josef Bacik 2ab28f322f Btrfs: wait on ordered extents at the last possible moment
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed.  So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:04 -05:00
Miao Xie dfd79829b7 Btrfs: fix trivial error in btrfs_ioctl_resize()
This patch fixes the following problem:
- improper return value
- unnecessary read-only check

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:44 -05:00
Miao Xie 4eee4fa4f8 Btrfs: use wrapper page_offset
Use wrapper page_offset to get byte-offset into filesystem object for page.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:43 -05:00
Miao Xie da633a4217 Btrfs: flush all dirty inodes if writeback can not start
We may try to flush some dirty pages when there is no enough space to reserve.
But it is possible that this operation fails, in order to get enough space to
reserve successfully, we will sync all the delalloc file. This operation is
safe, we needn't worry about the case that the filesystem goes from r/w to r/o.
because the filesystem should guarantee all the dirty pages have been written
into the disk after it becomes readonly, so the sync operation will do nothing
if the filesystem is already readonly. Though it may waste lots of time,
as a corner case, we needn't care.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:42 -05:00
Miao Xie 093486c453 Btrfs: make delayed ref lock logic more readable
Locking and unlocking delayed ref mutex are in the different functions,
and the name of lock functions is not uniform, so the readability is not
so good, this patch optimizes the lock logic and makes it more readable.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:41 -05:00
Miao Xie 0e8c36a9fd Btrfs: fix lots of orphan inodes when the space is not enough
We're running into having 50-100 orphans left over with xfstests 83
because of ENOSPC when trying to start the transaction for the inode update.
But in fact, it makes no sense in updating the inode for the new size while
we're deleting the stupid thing. This patch fixes this problem.

Reported-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:39 -05:00
Miao Xie 4ea41ce07d Btrfs: cleanup similar code in delayed inode
The delayed item commit code in several functions is similar, so
cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:36:38 -05:00
Miao Xie 7892b5afe4 Btrfs: use common work instead of delayed work
Since we do not want to delay the async transaction commit, we should
use common work, not delayed work.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-02-20 09:36:37 -05:00
Miao Xie 7b5a1c5310 Btrfs: cleanup unnecessary clear when freeing a transaction or a trans handle
We clear the transaction object and the trans handle when they are about to be
freed, it is unnecessary, cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-02-20 09:36:35 -05:00
Miao Xie 78a6184a3f Btrfs: use slabs for delayed reference allocation
The delayed reference allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.

And besides that, it can do check for leaked objects when the module is removed.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-02-20 09:36:34 -05:00
David Sterba 6f60cbd3ae btrfs: access superblock via pagecache in scan_one_device
btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages.  It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.

Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too.  When this
happens, we drop some of the important blocks written by mkfs.

This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread.  This way we don't have to care
about the blocksize of the device.

This also drops the invalidate_bdev() call.  It wasn't very polite to
invalidate during the scan either.  mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-15 16:57:47 -05:00
Arne Jansen 2a745b14bc Btrfs: fix crash in log replay with qgroups enabled
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-14 20:47:41 -05:00
Linus Torvalds 8d19514fad Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've got corner cases for updating i_size that ceph was hitting,
  error handling for quotas when we run out of space, a very subtle
  snapshot deletion race, a crash while removing devices, and one
  deadlock between subvolume creation and the sb_internal code (thanks
  lockdep)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: move d_instantiate outside the transaction during mksubvol
  Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
  Btrfs: fix possible stale data exposure
  Btrfs: fix missing i_size update
  Btrfs: fix race between snapshot deletion and getting inode
  Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
  Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
  Btrfs: do not merge logged extents if we've removed them from the tree
  btrfs: don't try to notify udev about missing devices
2013-02-08 12:06:46 +11:00
Chris Mason 1a65e24b0b Btrfs: move d_instantiate outside the transaction during mksubvol
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.

btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.

This commit moves the d_instantiate after the transaction closes.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-06 12:11:10 -05:00
Jan Schmidt eb6b88d92c Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.

Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.

Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-06 09:24:40 -05:00
Chris Mason 24f8ebe918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus 2013-02-05 19:24:44 -05:00
Josef Bacik 59fe4f4197 Btrfs: fix possible stale data exposure
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data.  The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size.  Fix this by
checking if the extent ends past the disk_i_size.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:16 -05:00
Josef Bacik 5d1f40202b Btrfs: fix missing i_size update
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data.  The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update.  So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:14 -05:00
Liu Bo 6f1c36055f Btrfs: fix race between snapshot deletion and getting inode
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().

And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.

Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.

(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0).  So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.

Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.

So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:13 -05:00
Miao Xie 843fcf3573 Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:11 -05:00
Miao Xie 0a3404dcff Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:10 -05:00
Josef Bacik 222c81dc38 Btrfs: do not merge logged extents if we've removed them from the tree
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it.  The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:03 -05:00
Chris Mason 0e4e026366 Merge branch 'for-linus' into raid56-experimental
Conflicts:
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:04:03 -05:00
Chris Mason 1f0905ec15 Btrfs: remove conflicting check for minimum number of devices in raid56
The device removal code was incorrectly checking against two different limits for
raid5 and raid6.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:01:42 -05:00
Tomasz Torcz 10e78e3a8a Btrfs: select XOR_BLOCKS in Kconfig
The Btrfs raid56 uses the generic xor helpers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 09:55:30 -05:00
Chris Mason bb721703aa Btrfs: reduce CPU contention while waiting for delayed extent operations
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.

It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.

The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.

This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself.  All the extra contention just slows
down the operations and doesn't get things done any faster.

This commit changes things to use a wait queue instead.  As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:25 -05:00
Chris Mason 242e18c7c1 Btrfs: reduce lock contention on extent buffer locks
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree.  On tree roots and
other extent buffers that very cache hot, this can be highly contended.

These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:25 -05:00
Chris Mason 8de972b4fa Btrfs: fix cluster alignment for mount -o ssd
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:24 -05:00
Chris Mason 6ac0f4884e Btrfs: add a plugging callback to raid56 writes
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.

This adds a plugging callback to gather those smaller writes full stripe
writes.

Example on flash:

fio job to do 64K writes in batches of 3 (which makes a full stripe):

With plugging: 450MB/s
Without plugging: 220MB/s

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:24 -05:00
Chris Mason 4ae10b3a13 Btrfs: Add a stripe cache to raid56
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk.  Pages are cached when:

* They are read in during a read/modify/write cycle

* They are written during a read/modify/write cycle

* They are involved in a parity rebuild

Pages are not cached if we're doing a full stripe write.  We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.

This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.

The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.

Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct

Without the stripe cache  -- 2.1MB/s
With the stripe cache 21MB/s

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:23 -05:00
David Woodhouse 53b381b3ab Btrfs: RAID5 and RAID6
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio.  This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle.  This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk.  This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:23 -05:00
David Woodhouse 64a167011b Btrfs: add rw argument to merge_bio_hook()
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 11:49:47 -05:00
Eric Sandeen 3c91160808 btrfs: don't try to notify udev about missing devices
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 11:47:37 -05:00
Linus Torvalds d7df025eb4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "It turns out that we had two crc bugs when running fsx-linux in a
  loop.  Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
  all down.  Miao also has a new OOM fix in this v2 pull as well.

  Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
  and resuming a running balance across drives.

  Josef's orphan truncate patch fixes an obscure corruption we'd see
  during xfstests.

  Arne's patches address problems with subvolume quotas.  If the user
  destroys quota groups incorrectly the FS will refuse to mount.

  The rest are smaller fixes and plugs for memory leaks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
  Btrfs: fix repeated delalloc work allocation
  Btrfs: fix wrong max device number for single profile
  Btrfs: fix missed transaction->aborted check
  Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
  Btrfs: put csums on the right ordered extent
  Btrfs: use right range to find checksum for compressed extents
  Btrfs: fix panic when recovering tree log
  Btrfs: do not allow logged extents to be merged or removed
  Btrfs: fix a regression in balance usage filter
  Btrfs: prevent qgroup destroy when there are still relations
  Btrfs: ignore orphan qgroup relations
  Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
  Btrfs: fix unlock order in btrfs_ioctl_rm_dev
  Btrfs: fix unlock order in btrfs_ioctl_resize
  Btrfs: fix "mutually exclusive op is running" error code
  Btrfs: bring back balance pause/resume logic
  btrfs: update timestamps on truncate()
  btrfs: fix btrfs_cont_expand() freeing IS_ERR em
  Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
  Btrfs: fix off-by-one in lseek
  ...
2013-01-25 10:55:21 -08:00
Miao Xie 1eafa6c737 Btrfs: fix repeated delalloc work allocation
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.

Fix this problem by splice this delalloc list.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:27 -05:00
Miao Xie c9f01bfe0c Btrfs: fix wrong max device number for single profile
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:26 -05:00
Miao Xie 2cba30f172 Btrfs: fix missed transaction->aborted check
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.

Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.

So we need more check for ->aborted when committing the transaction. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:25 -05:00
Miao Xie 8d25a086eb Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:23 -05:00
Josef Bacik e58dd74bcc Btrfs: put csums on the right ordered extent
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent.  This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent.  Without this we could end up with csums for
bytenrs that don't have extents to cover them yet.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:22 -05:00
Liu Bo 192000dda2 Btrfs: use right range to find checksum for compressed extents
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:17 -05:00
Josef Bacik b0175117b9 Btrfs: fix panic when recovering tree log
A user reported a BUG_ON(ret) that occured during tree log replay.  Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry.  We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON().  The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility.  Thanks,

Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:49 -05:00
Josef Bacik 201a903894 Btrfs: do not allow logged extents to be merged or removed
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:48 -05:00
Ilya Dryomov a105bb88f4 Btrfs: fix a regression in balance usage filter
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input.  Bring it back.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:40:27 -05:00
Chris Mason 83bfccb5c0 Merge branch 'mutex-ops@next-for-chris' of git://github.com/idryomov/btrfs-unstable into linus 2013-01-21 20:39:06 -05:00
Chris Mason daf2c08911 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into linus 2013-01-21 20:26:55 -05:00
Arne Jansen 2cf6870396 Btrfs: prevent qgroup destroy when there are still relations
Currently you can just destroy a qgroup even though it is in use by other qgroups
or has qgroups assigned to it. This patch prevents destruction of qgroups unless
they are completely unused. Otherwise destroy will return EBUSY.

Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:18:11 -05:00
Arne Jansen ff24858c65 Btrfs: ignore orphan qgroup relations
If a qgroup that has still assignments is deleted by the user, the corresponding
relations are left in the tree. This leads to an unmountable filesystem.
With this patch, those relations are simple ignored.

Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:18:11 -05:00
Ilya Dryomov 25122d15e2 Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Operation-specific check (whether subvol is readonly or not) should go
after the mutual exclusiveness check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:22 +02:00
Ilya Dryomov 4ac20c70b0 Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Fix unlock order in btrfs_ioctl_rm_dev().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:21 +02:00
Ilya Dryomov 18f39c416d Btrfs: fix unlock order in btrfs_ioctl_resize
Fix unlock order in btrfs_ioctl_resize().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:20 +02:00
Ilya Dryomov 2c0c9da02a Btrfs: fix "mutually exclusive op is running" error code
The error code that is returned in response to starting a mutually
exclusive operation when there is one already running got silently
changed from EINVAL to EINPROGRESS by 5ac00add.  Returning EINPROGRESS
to, say, add_dev, when rm_dev is running is misleading.  Furthermore,
the operation itself may want to use EINPROGRESS for other purposes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:18 +02:00
Ilya Dryomov ed0fb78fb6 Btrfs: bring back balance pause/resume logic
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge).  Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away.  Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:17 +02:00
Eric Sandeen 3972f2603d btrfs: update timestamps on truncate()
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:37 -05:00
Zach Brown f276795627 btrfs: fix btrfs_cont_expand() freeing IS_ERR em
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.

An instance of -EEXIST was reported in the wild:

  https://bugzilla.redhat.com/show_bug.cgi?id=874407

I have no idea if that -EEXIST is surprising, or not.  Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).

This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:23 -05:00
Liu Bo f9e4fb5393 Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
xfstests case 285 complains.

It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Liu Bo 1214b53f90 Btrfs: fix off-by-one in lseek
Lock end is inclusive.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Liu Bo 3268a2468e Btrfs: reset path lock state to zero
We forgot to reset the path lock state to zero after we unlock the path block,
and this can lead to the ASSERT checker in tree unlock API.

Reported-by: Slava Barinov <rayslava@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:53 -05:00
Liu Bo ac5c93005b Btrfs: let allocation start from the right raid type
This'd avoid us empty looping.

Say we have only one disk and the metadata raid type will be defaultly DUP,
and we do not need to start from index=0(RAID10) and get over two empty
loops to index=2(DUP).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Josef Bacik f3fe820c20 Btrfs: add orphan before truncating pagecache
Running xfstests 83 in a loop would sometimes fail the fsck.  This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up.  The problem with this is there is plenty of time
for the truncate to fail after we've done this work.  So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe.  This
fixes the btrfsck failures I was seeing while running 83 in a loop.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Josef Bacik 72bcd99d45 Btrfs: set flushing if we're limited flushing
We still need to say we're flushing if we're limit flushing to keep somebody
from coming in and stealing our reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:51 -05:00
Miao Xie 9754767657 Btrfs: fix missing write access release in btrfs_ioctl_resize()
We forget to give up the write access after we find some device operation
is going on. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:51 -05:00
Miao Xie dba60f3f5d Btrfs: fix resize a readonly device
We should not resize a readonly device, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:49 -05:00
Miao Xie 5c39da5b6c Btrfs: do not delete a subvolume which is in a R/O subvolume
Step to reproduce:
 # mkfs.btrfs <disk>
 # mount <disk> <mnt>
 # btrfs sub create <mnt>/subv0
 # btrfs sub snap <mnt> <mnt>/subv0/snap0
 # change <mnt>/subv0 from R/W to R/O
 # btrfs sub del <mnt>/subv0/snap0

We deleted the snapshot successfully. I think we should not be able to delete
the snapshot since the parent subvolume is R/O.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-01-14 13:52:32 -05:00
Miao Xie d86e56cf7d Btrfs: disable qgroup id 0
Qgroup id 0 is a special number, we should set the id of a qgroup to 0.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-01-14 13:52:31 -05:00
Lukas Czerner cc975eb460 btrfs: get the device in write mode when deleting it
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-01-14 13:52:31 -05:00
Tsutomu Itoh cfa7a9ccda Btrfs: fix memory leak in name_cache_insert()
We should free name_cache_entry before returning from the
error handling code.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2013-01-14 13:52:30 -05:00
Linus Torvalds 1f0377ff08 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS update from Al Viro:
 "fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
  misc stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
  vfs: make lremovexattr retry once on ESTALE error
  vfs: make removexattr retry once on ESTALE
  vfs: make llistxattr retry once on ESTALE error
  vfs: make listxattr retry once on ESTALE error
  vfs: make lgetxattr retry once on ESTALE
  vfs: make getxattr retry once on an ESTALE error
  vfs: allow lsetxattr() to retry once on ESTALE errors
  vfs: allow setxattr to retry once on ESTALE errors
  vfs: allow utimensat() calls to retry once on an ESTALE error
  vfs: fix user_statfs to retry once on ESTALE errors
  vfs: make fchownat retry once on ESTALE errors
  vfs: make fchmodat retry once on ESTALE errors
  vfs: have chroot retry once on ESTALE error
  vfs: have chdir retry lookup and call once on ESTALE error
  vfs: have faccessat retry once on an ESTALE error
  vfs: have do_sys_truncate retry once on an ESTALE error
  vfs: fix renameat to retry on ESTALE errors
  vfs: make do_unlinkat retry once on ESTALE errors
  vfs: make do_rmdir retry once on ESTALE errors
  vfs: add a flags argument to user_path_parent
  ...
2012-12-20 18:14:31 -08:00
Linus Torvalds 1ca22254b3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull two btrfs reverts from Chris Mason:
 "I had missed that for two of the patches in my last pull, we had
  included different fixes during 3.7."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
  Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
2012-12-20 13:57:09 -08:00
Jeff Layton 39e3c9553f vfs: remove DCACHE_NEED_LOOKUP
The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.

Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 13:57:36 -05:00
Chris Mason 57ba86c00f Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
This reverts commit 6a7a665d78.

This was bug was fixed differently in 3.6, so this commit
isn't needed.

Conflicts:
	fs/btrfs/ctree.c

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-18 19:35:32 -05:00
Chris Mason 4c3e696981 Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
This reverts commit 95c80bb1f6.

The bug addressed by this commit was fixed differently back in 3.6

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-18 15:43:18 -05:00
Linus Torvalds a22180d266 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "A big set of fixes and features.

  In terms of line count, most of the code comes from Stefan, who added
  the ability to replace a single drive in place.  This is different
  from how btrfs normally replaces drives, and is much much much faster.

  Josef is plowing through our synchronous write performance.  This pull
  request does not include the DIO_OWN_WAITING patch that was discussed
  on the list, but it has a number of other improvements to cut down our
  latencies and CPU time during fsync/O_DIRECT writes.

  Miao Xie has a big series of fixes and is spreading out ordered
  operations over more CPUs.  This improves performance and reduces
  contention.

  I've put in fixes for error handling around hash collisions.  These
  are going back to individual stable kernels as I test against them.

  Otherwise we have a lot of fixes and cleanups, thanks everyone!
  raid5/6 is being rebased against the device replacement code.  I'll
  have it posted this Friday along with a nice series of benchmarks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
  Btrfs: fix a bug of per-file nocow
  Btrfs: fix hash overflow handling
  Btrfs: don't take inode delalloc mutex if we're a free space inode
  Btrfs: fix autodefrag and umount lockup
  Btrfs: fix permissions of empty files not affected by umask
  Btrfs: put raid properties into global table
  Btrfs: fix BUG() in scrub when first superblock reading gives EIO
  Btrfs: do not call file_update_time in aio_write
  Btrfs: only unlock and relock if we have to
  Btrfs: use tokens where we can in the tree log
  Btrfs: optimize leaf_space_used
  Btrfs: don't memset new tokens
  Btrfs: only clear dirty on the buffer if it is marked as dirty
  Btrfs: move checks in set_page_dirty under DEBUG
  Btrfs: log changed inodes based on the extent map tree
  Btrfs: add path->really_keep_locks
  Btrfs: do not mark ems as prealloc if we are writing to them
  Btrfs: keep track of the extents original block length
  Btrfs: inline csums if we're fsyncing
  Btrfs: don't bother copying if we're only logging the inode
  ...
2012-12-18 09:42:05 -08:00
Andrew Morton 965c8e59cf lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.  Fix most of the
sites.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:12 -08:00
Liu Bo 213490b301 Btrfs: fix a bug of per-file nocow
Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found    ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found   ---> with nocow, btrfs breaks the extent into three parts

The new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-17 14:48:21 -05:00