If we are heavily fragmented we will continually try to prealloc the largest
extent size we can every time we call btrfs_reserve_extent. This can be very
expensive when we are heavily fragmented, burning lots of CPU cycles and loops
through the allocator. So instead notice when we get a smaller chunk from the
allocator than what we specified and use this as the new maximum size we try to
allocate. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add check at btrfs_destroy_inode() time to detect qgroup reserved space
leak.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
In clear_bit_hook, qgroup reserved data is already handled quite well,
either released by finish_ordered_io or invalidatepage.
So calling btrfs_qgroup_free_data() here is completely meaningless, and
since btrfs_qgroup_free_data() will lock io_tree, so it can't be called
with io_tree lock hold.
This patch will add a new function
btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
the lockdep warning.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For btrfs_invalidatepage() and its variant evict_inode_truncate_page(),
there will be pages don't reach disk.
In that case, their reserved space won't be release nor freed by
finish_ordered_io() nor delayed_ref handler.
So we must free their qgroup reserved space, or we will leaking reserved
space again.
So this will patch will call btrfs_qgroup_free_data() for
invalidatepage() and its variant evict_inode_truncate_page().
And due to the nature of new btrfs_qgroup_reserve/free_data() reserved
space will only be reserved or freed once, so for pages which are
already flushed to disk, their reserved space will be released and freed
by delayed_ref handler.
Double free won't be a problem.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For NOCOW and inline case, there will be no delayed_ref created for
them, so we should free their reserved data space at proper
time(finish_ordered_io for NOCOW and cow_file_inline for inline).
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.
Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.
Now, the whole btrfs qgroup is swithed to use the new reserve facilities.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use new __btrfs_delalloc_reserve_space() and
__btrfs_delalloc_release_space() to reserve and release space for
delalloc.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qgroup reserved space needs to be released from inode dirty map and get
freed at different timing:
1) Release when the metadata is written into tree
After corresponding metadata is written into tree, any newer write will
be COWed(don't include NOCOW case yet).
So we must release its range from inode dirty range map, or we will
forget to reserve needed range, causing accounting exceeding the limit.
2) Free reserved bytes when delayed ref is run
When delayed refs are run, qgroup accounting will follow soon and turn
the reserved bytes into rfer/excl numbers.
As run_delayed_refs and qgroup accounting are all done at
commit_transaction() time, we are safe to free reserved space in
run_delayed_ref time().
With these timing to release/free reserved space, we should be able to
resolve the long existing qgroup reserve space leak problem.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can safely iterate whole list items, without using list_del macro.
So remove the list_del call.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When truncating a file to a smaller size which consists of an inline
extent that is compressed, we did not discard (or made unusable) the
data between the new file size and the old file size, wasting metadata
space and allowing for the truncated data to be leaked and the data
corruption/loss mentioned below.
We were also not correctly decrementing the number of bytes used by the
inode, we were setting it to zero, giving a wrong report for callers of
the stat(2) syscall. The fsck tool also reported an error about a mismatch
between the nbytes of the file versus the real space used by the file.
Now because we weren't discarding the truncated region of the file, it
was possible for a caller of the clone ioctl to actually read the data
that was truncated, allowing for a security breach without requiring root
access to the system, using only standard filesystem operations. The
scenario is the following:
1) User A creates a file which consists of an inline and compressed
extent with a size of 2000 bytes - the file is not accessible to
any other users (no read, write or execution permission for anyone
else);
2) The user truncates the file to a size of 1000 bytes;
3) User A makes the file world readable;
4) User B creates a file consisting of an inline extent of 2000 bytes;
5) User B issues a clone operation from user A's file into its own
file (using a length argument of 0, clone the whole range);
6) User B now gets to see the 1000 bytes that user A truncated from
its file before it made its file world readbale. User B also lost
the bytes in the range [1000, 2000[ bytes from its own file, but
that might be ok if his/her intention was reading stale data from
user A that was never supposed to be public.
Note that this contrasts with the case where we truncate a file from 2000
bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
this case reading any byte from the range [1000, 2000[ will return a value
of 0x00, instead of the original data.
This problem exists since the clone ioctl was added and happens both with
and without my recent data loss and file corruption fixes for the clone
ioctl (patch "Btrfs: fix file corruption and data loss after cloning
inline extents").
So fix this by truncating the compressed inline extents as we do for the
non-compressed case, which involves decompressing, if the data isn't already
in the page cache, compressing the truncated version of the extent, writing
the compressed content into the inline extent and then truncate it.
The following test case for fstests reproduces the problem. In order for
the test to pass both this fix and my previous fix for the clone ioctl
that forbids cloning a smaller inline extent into a larger one,
which is titled "Btrfs: fix file corruption and data loss after cloning
inline extents", are needed. Without that other fix the test fails in a
different way that does not leak the truncated data, instead part of
destination file gets replaced with zeroes (because the destination file
has a larger inline extent than the source).
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"
# Create our test files. File foo is going to be the source of a clone operation
# and consists of a single inline extent with an uncompressed size of 512 bytes,
# while file bar consists of a single inline extent with an uncompressed size of
# 256 bytes. For our test's purpose, it's important that file bar has an inline
# extent with a size smaller than foo's inline extent.
$XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \
-c "pwrite -S 0x2a 128 384" \
$SCRATCH_MNT/foo | _filter_xfs_io
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io
# Now durably persist all metadata and data. We do this to make sure that we get
# on disk an inline extent with a size of 512 bytes for file foo.
sync
# Now truncate our file foo to a smaller size. Because it consists of a
# compressed and inline extent, btrfs did not shrink the inline extent to the
# new size (if the extent was not compressed, btrfs would shrink it to 128
# bytes), it only updates the inode's i_size to 128 bytes.
$XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo
# Now clone foo's inline extent into bar.
# This clone operation should fail with errno EOPNOTSUPP because the source
# file consists only of an inline extent and the file's size is smaller than
# the inline extent of the destination (128 bytes < 256 bytes). However the
# clone ioctl was not prepared to deal with a file that has a size smaller
# than the size of its inline extent (something that happens only for compressed
# inline extents), resulting in copying the full inline extent from the source
# file into the destination file.
#
# Note that btrfs' clone operation for inline extents consists of removing the
# inline extent from the destination inode and copy the inline extent from the
# source inode into the destination inode, meaning that if the destination
# inode's inline extent is larger (N bytes) than the source inode's inline
# extent (M bytes), some bytes (N - M bytes) will be lost from the destination
# file. Btrfs could copy the source inline extent's data into the destination's
# inline extent so that we would not lose any data, but that's currently not
# done due to the complexity that would be needed to deal with such cases
# (specially when one or both extents are compressed), returning EOPNOTSUPP, as
# it's normally not a very common case to clone very small files (only case
# where we get inline extents) and copying inline extents does not save any
# space (unlike for normal, non-inlined extents).
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Now because the above clone operation used to succeed, and due to foo's inline
# extent not being shinked by the truncate operation, our file bar got the whole
# inline extent copied from foo, making us lose the last 128 bytes from bar
# which got replaced by the bytes in range [128, 256[ from foo before foo was
# truncated - in other words, data loss from bar and being able to read old and
# stale data from foo that should not be possible to read anymore through normal
# filesystem operations. Contrast with the case where we truncate a file from a
# size N to a smaller size M, truncate it back to size N and then read the range
# [M, N[, we should always get the value 0x00 for all the bytes in that range.
# We expected the clone operation to fail with errno EOPNOTSUPP and therefore
# not modify our file's bar data/metadata. So its content should be 256 bytes
# long with all bytes having the value 0xbb.
#
# Without the btrfs bug fix, the clone operation succeeded and resulted in
# leaking truncated data from foo, the bytes that belonged to its range
# [128, 256[, and losing data from bar in that same range. So reading the
# file gave us the following content:
#
# 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
# *
# 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
# *
# 0000400
echo "File bar's content after the clone operation:"
od -t x1 $SCRATCH_MNT/bar
# Also because the foo's inline extent was not shrunk by the truncate
# operation, btrfs' fsck, which is run by the fstests framework everytime a
# test completes, failed reporting the following error:
#
# root 5 inode 257 errors 400, nbytes wrong
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs fixes from Chris Mason:
"This is an assorted set I've been queuing up:
Jeff Mahoney tracked down a tricky one where we ended up starting IO
on the wrong mapping for special files in btrfs_evict_inode. A few
people reported this one on the list.
Filipe found (and provided a test for) a difficult bug in reading
compressed extents, and Josef fixed up some quota record keeping with
snapshot deletion. Chandan killed off an accounting bug during DIO
that lead to WARN_ONs as we freed inodes"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: keep dropped roots in cache until transaction commit
Btrfs: Direct I/O: Fix space accounting
btrfs: skip waiting on ordered range for special files
Btrfs: fix read corruption of compressed and shared extents
Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
Btrfs: don't initialize a space info as full to prevent ENOSPC
The following call trace is seen when generic/095 test is executed,
WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
Modules linked in:
CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
Call Trace:
[<ffffffff81984058>] dump_stack+0x45/0x57
[<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
[<ffffffff81050465>] warn_slowpath_null+0x15/0x20
[<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
[<ffffffff8117ce07>] destroy_inode+0x37/0x60
[<ffffffff8117cf39>] evict+0x109/0x170
[<ffffffff8117cfd5>] dispose_list+0x35/0x50
[<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
[<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
[<ffffffff81165951>] kill_anon_super+0x11/0x20
[<ffffffff81302093>] btrfs_kill_super+0x13/0x110
[<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
[<ffffffff811660cf>] deactivate_super+0x5f/0x70
[<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
[<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
[<ffffffff81069c06>] task_work_run+0x96/0xb0
[<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
[<ffffffff8198cbc2>] int_signal+0x12/0x17
This means that the inode had non-zero "outstanding extents" during
eviction. This occurs because, during direct I/O a task which successfully
used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
not clear the bit after finishing the DIO write. A future DIO write could
actually fail and the unused reserve space won't be freed because of the
previously set BTRFS_INODE_DIO_READY bit.
Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
following issue,
|-----------------------------------+-------------------------------------|
| Task A | Task B |
|-----------------------------------+-------------------------------------|
| Start direct i/o write on inode X.| |
| reserve space | |
| Allocate ordered extent | |
| release reserved space | |
| Set BTRFS_INODE_DIO_READY bit. | |
| | splice() |
| | Transfer data from pipe buffer to |
| | destination file. |
| | - kmap(pipe buffer page) |
| | - Start direct i/o write on |
| | inode X. |
| | - reserve space |
| | - dio_refill_pages() |
| | - sdio->blocks_available == 0 |
| | - Since a kernel address is |
| | being passed instead of a |
| | user space address, |
| | iov_iter_get_pages() returns |
| | -EFAULT. |
| | - Since BTRFS_INODE_DIO_READY is |
| | set, we don't release reserved |
| | space. |
| | - Clear BTRFS_INODE_DIO_READY bit.|
| -EIOCBQUEUED is returned. | |
|-----------------------------------+-------------------------------------|
Hence this commit introduces "struct btrfs_dio_data" to track the usage of
reserved data space. The remaining unused "reserve space" can now be freed
reliably.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.
filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens
next depends on whether there's an open file handle associated with the
inode. If there is, we write to the block device, which is unexpected
behavior. If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.
Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.
Cc: <stable@vger.kernel.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs cleanups and fixes from Chris Mason:
"These are small cleanups, and also some fixes for our async worker
thread initialization.
I was having some trouble testing these, but it ended up being a
combination of changing around my test servers and a shiny new
schedule while atomic from the new start/finish_plug in
writeback_sb_inodes().
That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
changing a bunch of things in my test setup at once it would have been
really clear. Fix for writeback_sb_inodes() on the way as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
btrfs: async_thread: Fix workqueue 'max_active' value when initializing
btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance
btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
btrfs: Update out-of-date "skip parity stripe" comment
Pull btrfs updates from Chris Mason:
"This has Jeff Mahoney's long standing trim patch that fixes corners
where trims were missing. Omar has some raid5/6 fixes, especially for
using scrub and device replace when devices are missing.
Zhao Lie continues cleaning and fixing things, this series fixes some
really hard to hit corners in xfstests. I had to pull it last merge
window due to some deadlocks, but those are now resolved.
I added support for Tejun's new blkio controllers. It seems to work
well for single devices, we'll expand to multi-device as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
btrfs: fix compile when block cgroups are not enabled
Btrfs: fix file read corruption after extent cloning and fsync
Btrfs: check if previous transaction aborted to avoid fs corruption
btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
btrfs: Prevent from early transaction abort
btrfs: Remove unused arguments in tree-log.c
btrfs: Remove useless condition in start_log_trans()
Btrfs: add support for blkio controllers
Btrfs: remove unused mutex from struct 'btrfs_fs_info'
Btrfs: fix parity scrub of RAID 5/6 with missing device
Btrfs: fix device replace of a missing RAID 5/6 device
Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
Btrfs: count devices correctly in readahead during RAID 5/6 replace
Btrfs: remove misleading handling of missing device scrub
btrfs: fix clone / extent-same deadlocks
Btrfs: fix defrag to merge tail file extent
Btrfs: fix warning in backref walking
btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
btrfs: Remove root argument in extent_data_ref_count()
btrfs: Fix wrong comment of btrfs_alloc_tree_block()
...
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
We need not check path before btrfs_free_path() is called because
path is checked in btrfs_free_path().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.
Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.
Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.
The end result is able to throttle nicely on single disk filesystems. A
little more work is required for multi-device filesystems.
Signed-off-by: Chris Mason <clm@fb.com>
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.
This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.
The following test case for fstests triggers the problem:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with 2 hard links.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Now remove one of the links, trigger inode eviction and then fsync
# our inode.
unlink $SCRATCH_MNT/testdir/bar
echo 2 > /proc/sys/vm/drop_caches
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
# Silently drop all writes on our scratch device to simulate a power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify our directory entries.
echo "Entries in testdir:"
ls -1 $SCRATCH_MNT/testdir
# If we remove our inode, its parent should become empty and therefore we should
# be able to remove the parent.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test failed on btrfs with:
generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
--- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100
@@ -1,3 +1,6 @@
QA output created by 098
Entries in testdir:
+bar
foo
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
...
(Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
$ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
Checking filesystem on /dev/sdc
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.
This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
Example reproducer:
$ mkfs.btrfs -O no-holes -f /dev/sdd
$ mount /dev/sdd /mnt
# Create our test file with some data and durably persist it.
$ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
$ sync
# Append some data to the file, increasing its size, and leave a hole
# between the old size and the start offset if the following write. So
# our file gets a hole in the range [128Kb, 256Kb[.
$ xfs_io -c "truncate 160K" /mnt/foo
# We expect to see our file with a size of 160Kb, with the first 128Kb
# of data all having the value 0xaa and the remaining 32Kb of data all
# having the value 0x00.
$ od -t x1 /mnt/foo
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0500000
# Now cleanly unmount and mount again the filesystem.
$ umount /mnt
$ mount /dev/sdd /mnt
# We expect to get the same result as before, a file with a size of
# 160Kb, with the first 128Kb of data all having the value 0xaa and the
# remaining 32Kb of data all having the value 0x00.
$ od -t x1 /mnt/foo
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0400000
In the example above the file size/data do not match what they were before
the remount.
Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().
Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().
This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The hang is uncoverd by generic/019.
btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().
This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The comment was not correct about the part where it says the endio
callback of the bio might have not yet been called - update it
to mention that by that time the endio callback execution might
still be in progress only.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:
[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636] [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636] [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
[161779.860636] [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636] [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636] [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636] [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636] [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636] [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636] [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
[161779.860636] [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636] [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636] [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636] [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
[161779.860636] [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636] [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)
We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:
unreferenced object 0xffff8803f553bf80 (size 96):
comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
hex dump (first 32 bytes):
88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00 .l..............
00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00 ................
backtrace:
[<ffffffff81161ffe>] create_object+0x172/0x29a
[<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
[<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
[<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
[<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
[<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
[<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
[<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
[<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
[<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
[<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
[<ffffffff8116586a>] __vfs_write+0x7c/0xa5
[<ffffffff81165da9>] vfs_write+0xa0/0xe4
[<ffffffff81166675>] SyS_pwrite64+0x64/0x82
[<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
[<ffffffffffffffff>] 0xffffffffffffffff
For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.
So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zygo Blaxell and other users have reported occasional hangs while an
inode is being evicted, leading to traces like the following:
[ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
[ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2
[ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000
[ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
[ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
[ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
[ 5281.981396] Call Trace:
[ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83
[ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
[ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
[ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
[ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
[ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148
[ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5
[ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
[ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
[ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 5282.001299] 1 lock held by rm/20488:
[ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
This happens when we have readahead, which calls readpages(), happening
right before the inode eviction handler is invoked. So the reason is
essentially:
1) readpages() is called while a reference on the inode is held, so
eviction can not be triggered before readpages() returns. It also
locks one or more ranges in the inode's io_tree (which is done at
extent_io.c:__do_contiguous_readpages());
2) readpages() submits several read bios, all with an end io callback
that runs extent_io.c:end_bio_extent_readpage() and that is executed
by other task when a bio finishes, corresponding to a work queue
(fs_info->end_io_workers) worker kthread. This callback unlocks
the ranges in the inode's io_tree that were previously locked in
step 1;
3) readpages() returns, the reference on the inode is dropped;
4) One or more of the read bios previously submitted are still not
complete (their end io callback was not yet invoked or has not
yet finished execution);
5) Inode eviction is triggered (through an unlink call for example).
The inode reference count was not incremented before submitting
the read bios, therefore this is possible;
6) The eviction handler starts executing and enters the loop that
iterates over all extent states in the inode's io_tree;
7) The loop picks one extent state record and uses its ->start and
->end fields, after releasing the inode's io_tree spinlock, to
call lock_extent_bits() and clear_extent_bit(). The call to lock
the range [state->start, state->end] blocks because the whole
range or a part of it was locked by the previous call to
readpages() and the corresponding end io callback, which unlocks
the range was not yet executed;
8) The end io callback for the read bio is executed and unlocks the
range [state->start, state->end] (or a superset of that range).
And at clear_extent_bit() the extent_state record state is used
as a second argument to split_state(), which sets state->start to
a larger value;
9) The task executing the eviction handler is woken up by the task
executing the bio's end io callback (through clear_state_bit) and
the eviction handler locks the range
[old value for state->start, state->end]. Shortly after, when
calling clear_extent_bit(), it unlocks the range
[new value for state->start, state->end], so it ends up unlocking
only part of the range that it locked, leaving an extent state
record in the io_tree that represents the unlocked subrange;
10) The eviction handler loop, in its next iteration, gets the
extent_state record for the subrange that it did not unlock in the
previous step and then tries to lock it, resulting in an hang.
So fix this by not using the ->start and ->end fields of an existing
extent_state record. This is a simple solution, and an alternative
could be to bump the inode's reference count before submitting each
read bio and having it dropped in the bio's end io callback. But that
would be a more invasive/complex change and would not protect against
other possible places that are not holding a reference on the inode
as well. Something to consider in the future.
Many thanks to Zygo Blaxell for reporting, in the mailing list, the
issue, a set of scripts to trigger it and testing this fix.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"A few more btrfs fixes.
These range from corners Filipe found in the new free space cache
writeback to a grab bag of fixes from the list"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
btrfs: unlock i_mutex after attempting to delete subvolume during send
btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
btrfs: fix race on ENOMEM in alloc_extent_buffer
btrfs: handle ENOMEM in btrfs_alloc_tree_block
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Btrfs: don't check for delalloc_bytes in cache_save_setup
Btrfs: fix deadlock when starting writeback of bg caches
Btrfs: fix race between start dirty bg cache writeout and bg deletion
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
Problem:
(1). test_fd = open(fname, O_RDWR|O_DIRECT)
(2). pwrite(test_fd, buf, 4096, 0)
(3). close(test_fd)
(4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches"
(5). test_fd = open(fname, O_RDWR|O_DIRECT)
(6). fsync(test_fd);
<-------- we did not get the correct log entry for the file
Reason:
When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.
Fix:
This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.
For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]
After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]
In other setups, Robert Elliott reported seeing good performance
improvements:
https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs updates from Chris Mason:
"I've been running these through a longer set of load tests because my
commits change the free space cache writeout. It fixes commit stalls
on large filesystems (~20T space used and up) that we have been
triggering here. We were seeing new writers blocked for 10 seconds or
more during commits, which is far from good.
Josef and I fixed up ENOSPC aborts when deleting huge files (3T or
more), that are triggered because our metadata reservations were not
properly accounting for crcs and were not replenishing during the
truncate.
Also in this series, a number of qgroup fixes from Fujitsu and Dave
Sterba collected most of the pending cleanups from the list"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits)
btrfs: quota: Update quota tree after qgroup relationship change.
btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
btrfs: Update btrfs qgroup status item when rescan is done.
btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
btrfs: Check qgroup level in kernel qgroup assign.
btrfs: qgroup: allow to remove qgroup which has parent but no child.
btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
btrfs: qgroup: do a reservation in a higher level.
Btrfs: qgroup, Account data space in more proper timings.
Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
Btrfs: qgroup: free reserved in exceeding quota.
Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
btrfs: qgroup: fix limit args override whole limit struct
btrfs: qgroup: update limit info in function btrfs_run_qgroups().
btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
btrfs: Support busy loop of write and delete
...
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are two problems in qgroup:
a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
is not available to user.
b). When user is writing a inline data, qgroup will not reserve it,
it means this is a window we can exceed the limit of a qgroup.
The main idea of this patch is reserving the data size of write_bytes
rather than the reserve_bytes. It means qgroup will not care about
the data size btrfs will reserve for user, but only care about the
data size user is going to write. Then reserve it when user want to
write and release it in transaction committed.
In this way, qgroup can be released from the complex procedure in
btrfs and only do the reserve when user want to write and account
when the data is written in commit_transaction().
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to account the
bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.
Example:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 20987904 20987904 0 10485760 --- ---
qg->excl is 20987904 larger than max_excl 10485760.
This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use --- qgroup->reserved --- qgroup->excl
With this patch applied:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 9453568 9453568 0 10485760 --- ---
Reported-by: Cyril SCETBON <cyril.scetbon@free.fr>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
while true; do
dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
rm /btrfs_dir/file
sync
done
And we'll see dd failed because btrfs return NO_SPACE.
Reason:
Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
in end to free fs space for next write, but sometimes it hadn't
done work on time, because btrfs-cleaner thread get delayed-iputs
from list before, but do iput() after next write.
This is log:
[ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
[ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
[ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
[ 2569.087554] comm=sync func=btrfs_commit_transaction() end
[ 2569.191081] comm=dd begin
[ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
[ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
[ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
...
[ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
[ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
Fix:
Make btrfs_commit_transaction() wait current running btrfs-cleaner's
delayed-iputs() done in end.
Test:
Use script similar to above(more complex),
before patch:
7 failed in 100 * 20 loop.
after patch:
0 failed in 100 * 20 loop.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv. We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.
Signed-off-by: Chris Mason <clm@fb.com>
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.
For very large files, we can easily go through all of that space as we
loop through the extents. This changes truncate to refill the space
reservation as it progresses through the file.
Signed-off-by: Chris Mason <clm@fb.com>
As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs. This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.
This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.
Signed-off-by: Chris Mason <clm@fb.com>