Because of how little we allocate chunks now we can get really tight on
metadata space before we will allocate a new chunk. This resulted in being
unable to add device extents when allocating a new metadata chunk as we did
not have enough space. This is because we were allowed to overcommit too
much metadata without actually making sure we had enough space to make
allocations. The idea behind overcommit is that we are allowed to say "sure
you can have that reservation" when most of the free space is occupied by
reservations, not actual allocations. But in this case where a majority of
the total space is in use by actual allocations we can screw ourselves by
not being able to make real allocations when it matters. So make sure we
have enough real space for our global reserve, and if not then don't allow
overcommitting. Thanks,
Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I got a double free error when unmounting a file system that failed to add a
chunk during its operation. This is because we will kfree the mapping that
we created but leave the extent_map in the em_tree for chunks. So to fix
this just remove the extent_map when we error out so we don't run into this
problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we error out allocating a dev extent we will have already created the
block group and such which will cause problems since the allocator may have
tried to allocate out of the block group that no longer exists. This will
cause BUG_ON()'s in the bio submission path. This also makes a failure to
allocate a dev extent a non-abort error, we will just clean up the dev
extents we did allocate and exit. Now if we fail to delete the dev extents
we will abort since we can't have half of the dev extents hanging around,
but this will make us much less likely to abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.
Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.
Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect
fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.
This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.
This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)
For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
Task0 Task1
load high 32 bits
set high 32 bits
set low 32 bits
load low 32 bits
Task1 will get 0.
This patch fixes this problem by using two locks to protect it
fs_info->chunk_mutex
sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Though ->max_inline is a 64bit variant, and may be accessed by
multi-task, but it is just suggestive number, so we needn't add
anything to protect fs_info->max_inline, just add a comment to
explain wny we don't use a lock to protect it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
CAP_DAC_READ_SEARCH overrides read and search permission check on
file and directory. It seems fit for BTRFS_IOC_INO_PATHS.
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This reverts commit 2794ed013b.
Wasn't supposed to get used in btrfs_mknod, it was supposed to be in
btrfs_create, which was done in commit
9185aa587b.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_run_ordered_operations() needn't traverse the ordered operation list
repeatedly, it is because the transaction commiter will invoke it again when
there is no other writer in this transaction, it can ensure that no one can
add new objects into the ordered operation list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes
repeatedly. It is because we can regard the data that the users write after
we start delalloc inodes flush as the one which is after the delalloc inodes
flush is done, and we can flush it next time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_run_ordered_operations() when
flushing all the pending stuffs, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_start_delalloc_inodes(), fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd save us a rbtree search which may become expensive in large filesystem.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This does not change the logic of code, but can save us a read_lock.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The API in tree log code has done sort of changes, and it proves that
we can benifit from using token, so do the same thing here.
function_graph tracer's timer shows that it costs nearly half time
of before(39.788us -> 22.391us).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit d53ba47484
(Btrfs: use commit root when loading free space cache) has remove
the deadlock check, and the related comments can be removed as well.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we start running low on metadata space we will try to allocate a chunk,
which could then try to allocate a chunk to add the device entry. The thing
is we allocate a chunk before we try really hard to make the allocation, so
we should be able to find space for the device entry. Add a flag to the
trans handle so we know we're currently allocating a chunk so we can just
bail out if we try to allocate another chunk. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch fixes the following problem:
- improper return value
- unnecessary read-only check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Use wrapper page_offset to get byte-offset into filesystem object for page.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may try to flush some dirty pages when there is no enough space to reserve.
But it is possible that this operation fails, in order to get enough space to
reserve successfully, we will sync all the delalloc file. This operation is
safe, we needn't worry about the case that the filesystem goes from r/w to r/o.
because the filesystem should guarantee all the dirty pages have been written
into the disk after it becomes readonly, so the sync operation will do nothing
if the filesystem is already readonly. Though it may waste lots of time,
as a corner case, we needn't care.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Locking and unlocking delayed ref mutex are in the different functions,
and the name of lock functions is not uniform, so the readability is not
so good, this patch optimizes the lock logic and makes it more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We're running into having 50-100 orphans left over with xfstests 83
because of ENOSPC when trying to start the transaction for the inode update.
But in fact, it makes no sense in updating the inode for the new size while
we're deleting the stupid thing. This patch fixes this problem.
Reported-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The delayed item commit code in several functions is similar, so
cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we do not want to delay the async transaction commit, we should
use common work, not delayed work.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We clear the transaction object and the trans handle when they are about to be
freed, it is unnecessary, cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The delayed reference allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages. It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.
Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too. When this
happens, we drop some of the important blocks written by mkfs.
This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread. This way we don't have to care
about the blocksize of the device.
This also drops the invalidate_bdev() call. It wasn't very polite to
invalidate during the scan either. mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.
btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.
This commit moves the d_instantiate after the transaction closes.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.
Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.
Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data. The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size. Fix this by
checking if the extent ends past the disk_i_size. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data. The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update. So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().
And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.
Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.
(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.
Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it. The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The device removal code was incorrectly checking against two different limits for
raid5 and raid6.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.
It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.
The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.
This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself. All the extra contention just slows
down the operations and doesn't get things done any faster.
This commit changes things to use a wait queue instead. As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.
This adds a plugging callback to gather those smaller writes full stripe
writes.
Example on flash:
fio job to do 64K writes in batches of 3 (which makes a full stripe):
With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk. Pages are cached when:
* They are read in during a read/modify/write cycle
* They are written during a read/modify/write cycle
* They are involved in a parity rebuild
Pages are not cached if we're doing a full stripe write. We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.
This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.
The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.
Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct
Without the stripe cache -- 2.1MB/s
With the stripe cache 21MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.
Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.
Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.
Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.
The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.
Fix this problem by splice this delalloc list.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.
Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.
So we need more check for ->aborted when committing the transaction. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent. This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent. Without this we could end up with csums for
bytenrs that don't have extents to cover them yet. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a BUG_ON(ret) that occured during tree log replay. Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry. We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility. Thanks,
Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input. Bring it back.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently you can just destroy a qgroup even though it is in use by other qgroups
or has qgroups assigned to it. This patch prevents destruction of qgroups unless
they are completely unused. Otherwise destroy will return EBUSY.
Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If a qgroup that has still assignments is deleted by the user, the corresponding
relations are left in the tree. This leads to an unmountable filesystem.
With this patch, those relations are simple ignored.
Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Operation-specific check (whether subvol is readonly or not) should go
after the mutual exclusiveness check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The error code that is returned in response to starting a mutually
exclusive operation when there is one already running got silently
changed from EINVAL to EINPROGRESS by 5ac00add. Returning EINPROGRESS
to, say, add_dev, when rm_dev is running is misleading. Furthermore,
the operation itself may want to use EINPROGRESS for other purposes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge). Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away. Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.
An instance of -EEXIST was reported in the wild:
https://bugzilla.redhat.com/show_bug.cgi?id=874407
I have no idea if that -EEXIST is surprising, or not. Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).
This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
xfstests case 285 complains.
It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forgot to reset the path lock state to zero after we unlock the path block,
and this can lead to the ASSERT checker in tree unlock API.
Reported-by: Slava Barinov <rayslava@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd avoid us empty looping.
Say we have only one disk and the metadata raid type will be defaultly DUP,
and we do not need to start from index=0(RAID10) and get over two empty
loops to index=2(DUP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Running xfstests 83 in a loop would sometimes fail the fsck. This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up. The problem with this is there is plenty of time
for the truncate to fail after we've done this work. So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe. This
fixes the btrfsck failures I was seeing while running 83 in a loop. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We still need to say we're flushing if we're limit flushing to keep somebody
from coming in and stealing our reservation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to give up the write access after we find some device operation
is going on. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
# mkfs.btrfs <disk>
# mount <disk> <mnt>
# btrfs sub create <mnt>/subv0
# btrfs sub snap <mnt> <mnt>/subv0/snap0
# change <mnt>/subv0 from R/W to R/O
# btrfs sub del <mnt>/subv0/snap0
We deleted the snapshot successfully. I think we should not be able to delete
the snapshot since the parent subvolume is R/O.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Pull VFS update from Al Viro:
"fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
misc stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
vfs: make lremovexattr retry once on ESTALE error
vfs: make removexattr retry once on ESTALE
vfs: make llistxattr retry once on ESTALE error
vfs: make listxattr retry once on ESTALE error
vfs: make lgetxattr retry once on ESTALE
vfs: make getxattr retry once on an ESTALE error
vfs: allow lsetxattr() to retry once on ESTALE errors
vfs: allow setxattr to retry once on ESTALE errors
vfs: allow utimensat() calls to retry once on an ESTALE error
vfs: fix user_statfs to retry once on ESTALE errors
vfs: make fchownat retry once on ESTALE errors
vfs: make fchmodat retry once on ESTALE errors
vfs: have chroot retry once on ESTALE error
vfs: have chdir retry lookup and call once on ESTALE error
vfs: have faccessat retry once on an ESTALE error
vfs: have do_sys_truncate retry once on an ESTALE error
vfs: fix renameat to retry on ESTALE errors
vfs: make do_unlinkat retry once on ESTALE errors
vfs: make do_rmdir retry once on ESTALE errors
vfs: add a flags argument to user_path_parent
...
Pull two btrfs reverts from Chris Mason:
"I had missed that for two of the patches in my last pull, we had
included different fixes during 3.7."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This reverts commit 6a7a665d78.
This was bug was fixed differently in 3.6, so this commit
isn't needed.
Conflicts:
fs/btrfs/ctree.c
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 95c80bb1f6.
The bug addressed by this commit was fixed differently back in 3.6
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three parts
The new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>