2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2008-06-12 09:53:53 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/fs.h>
|
2008-10-10 01:39:39 +08:00
|
|
|
#include <linux/fsnotify.h>
|
2008-06-12 09:53:53 +08:00
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/time.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/backing-dev.h>
|
2008-10-10 01:39:39 +08:00
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/namei.h>
|
2008-06-12 09:53:53 +08:00
|
|
|
#include <linux/writeback.h>
|
|
|
|
#include <linux/compat.h>
|
2008-10-10 01:39:39 +08:00
|
|
|
#include <linux/security.h>
|
2008-06-12 09:53:53 +08:00
|
|
|
#include <linux/xattr.h>
|
2017-06-01 01:32:09 +08:00
|
|
|
#include <linux/mm.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2011-03-24 18:24:28 +08:00
|
|
|
#include <linux/blkdev.h>
|
2012-07-25 23:35:53 +08:00
|
|
|
#include <linux/uuid.h>
|
2013-01-29 14:04:50 +08:00
|
|
|
#include <linux/btrfs.h>
|
2013-08-07 02:42:51 +08:00
|
|
|
#include <linux/uaccess.h>
|
2018-01-29 19:41:30 +08:00
|
|
|
#include <linux/iversion.h>
|
2021-04-07 20:36:43 +08:00
|
|
|
#include <linux/fileattr.h>
|
2021-07-01 04:01:49 +08:00
|
|
|
#include <linux/fsverity.h>
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
#include <linux/sched/xacct.h>
|
2008-06-12 09:53:53 +08:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "disk-io.h"
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
#include "export.h"
|
2008-06-12 09:53:53 +08:00
|
|
|
#include "transaction.h"
|
|
|
|
#include "btrfs_inode.h"
|
|
|
|
#include "print-tree.h"
|
|
|
|
#include "volumes.h"
|
2008-06-26 04:01:30 +08:00
|
|
|
#include "locking.h"
|
2011-07-07 22:48:38 +08:00
|
|
|
#include "backref.h"
|
2012-06-05 02:03:51 +08:00
|
|
|
#include "rcu-string.h"
|
2012-07-26 05:19:24 +08:00
|
|
|
#include "send.h"
|
2012-11-06 22:08:53 +08:00
|
|
|
#include "dev-replace.h"
|
Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-07 19:47:46 +08:00
|
|
|
#include "props.h"
|
2013-11-02 01:07:02 +08:00
|
|
|
#include "sysfs.h"
|
2014-05-14 08:30:47 +08:00
|
|
|
#include "qgroup.h"
|
Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
If we delete a snapshot, fsync its parent directory and crash/power fail
before the next transaction commit, on the next mount when we attempt to
replay the log tree of the root containing the parent directory we will
fail and prevent the filesystem from mounting, which is solvable by wiping
out the log trees with the btrfs-zero-log tool but very inconvenient as
we will lose any data and metadata fsynced before the parent directory
was fsynced.
For example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ btrfs subvolume snapshot /mnt /mnt/testdir/snap
$ btrfs subvolume delete /mnt/testdir/snap
$ xfs_io -c "fsync" /mnt/testdir
< crash / power failure and reboot >
$ mount /dev/sdc /mnt
mount: mount(2) failed: No such file or directory
And in dmesg/syslog we get the following message and trace:
[192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
[192066.363010] ------------[ cut here ]------------
[192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
[192066.367250] BTRFS: Transaction aborted (error -2)
[192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1
[192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
[192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
[192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
[192066.385827] Call Trace:
[192066.386373] [<ffffffff81257570>] dump_stack+0x4e/0x79
[192066.387387] [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
[192066.388429] [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.389236] [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
[192066.389884] [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.390621] [<ffffffff81184b55>] ? iput+0xb0/0x266
[192066.391200] [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[192066.391930] [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
[192066.392715] [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
[192066.393510] [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
[192066.394241] [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
[192066.394958] [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
[192066.395628] [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
[192066.396790] [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
[192066.397891] [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
[192066.398897] [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
[192066.399823] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.400739] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.401700] [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.402482] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.403930] [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
[192066.404831] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.405726] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.406621] [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.407401] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.408247] [<ffffffff8118ae36>] do_mount+0x893/0x9d2
[192066.409047] [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
[192066.409842] [<ffffffff8118b187>] SyS_mount+0x75/0xa1
[192066.410621] [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
[192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
[192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
[192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
[192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
[192066.444613] BTRFS: open_ctree failed
This happens because when we are replaying the log and processing the
directory entry pointing to the snapshot in the subvolume tree, we treat
its btrfs_dir_item item as having a location with a key type matching
BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
object id refers to a root number and not to an inode in the root
containing the parent directory.
So fix this by triggering a transaction commit if an fsync against the
parent directory is requested after deleting a snapshot. This is the
simplest approach for a rare use case. Some alternative that avoids the
transaction commit would require more code to explicitly delete the
snapshot at log replay time (factoring out common code from ioctl.c:
btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
log tree of the snapshot's root from the log root of the root of tree
roots, amongst other steps.
A test case for xfstests that triggers the issue follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create a snapshot at the root of our filesystem (mount point path), delete it,
# fsync the mount point path, crash and mount to replay the log. This should
# succeed and after the filesystem is mounted the snapshot should not be visible
# anymore.
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT
_flakey_drop_and_remount
[ -e $SCRATCH_MNT/snap1 ] && \
echo "Snapshot snap1 still exists after log replay"
# Similar scenario as above, but this time the snapshot is created inside a
# directory and not directly under the root (mount point path).
mkdir $SCRATCH_MNT/testdir
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
_flakey_drop_and_remount
[ -e $SCRATCH_MNT/testdir/snap2 ] && \
echo "Snapshot snap2 still exists after log replay"
_unmount_flakey
echo "Silence is golden"
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-02-10 18:42:25 +08:00
|
|
|
#include "tree-log.h"
|
2016-03-10 17:26:59 +08:00
|
|
|
#include "compression.h"
|
2019-06-19 04:09:16 +08:00
|
|
|
#include "space-info.h"
|
2019-06-20 03:12:00 +08:00
|
|
|
#include "delalloc-space.h"
|
2019-06-21 03:37:44 +08:00
|
|
|
#include "block-group.h"
|
2021-08-06 16:12:37 +08:00
|
|
|
#include "subpage.h"
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2014-01-31 04:17:00 +08:00
|
|
|
#ifdef CONFIG_64BIT
|
|
|
|
/* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
|
|
|
|
* structures are incorrect, as the timespec structure from userspace
|
|
|
|
* is 4 bytes too small. We define these alternatives here to teach
|
|
|
|
* the kernel about the 32-bit struct packing.
|
|
|
|
*/
|
|
|
|
struct btrfs_ioctl_timespec_32 {
|
|
|
|
__u64 sec;
|
|
|
|
__u32 nsec;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_ioctl_received_subvol_args_32 {
|
|
|
|
char uuid[BTRFS_UUID_SIZE]; /* in */
|
|
|
|
__u64 stransid; /* in */
|
|
|
|
__u64 rtransid; /* out */
|
|
|
|
struct btrfs_ioctl_timespec_32 stime; /* in */
|
|
|
|
struct btrfs_ioctl_timespec_32 rtime; /* out */
|
|
|
|
__u64 flags; /* in */
|
|
|
|
__u64 reserved[16]; /* in */
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
#define BTRFS_IOC_SET_RECEIVED_SUBVOL_32 _IOWR(BTRFS_IOCTL_MAGIC, 37, \
|
|
|
|
struct btrfs_ioctl_received_subvol_args_32)
|
|
|
|
#endif
|
|
|
|
|
2017-09-27 22:43:13 +08:00
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
|
|
|
|
struct btrfs_ioctl_send_args_32 {
|
|
|
|
__s64 send_fd; /* in */
|
|
|
|
__u64 clone_sources_count; /* in */
|
|
|
|
compat_uptr_t clone_sources; /* in */
|
|
|
|
__u64 parent_root; /* in */
|
|
|
|
__u64 flags; /* in */
|
2021-10-22 22:53:36 +08:00
|
|
|
__u32 version; /* in */
|
|
|
|
__u8 reserved[28]; /* in */
|
2017-09-27 22:43:13 +08:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
#define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
|
|
|
|
struct btrfs_ioctl_send_args_32)
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
|
|
|
|
struct btrfs_ioctl_encoded_io_args_32 {
|
|
|
|
compat_uptr_t iov;
|
|
|
|
compat_ulong_t iovcnt;
|
|
|
|
__s64 offset;
|
|
|
|
__u64 flags;
|
|
|
|
__u64 len;
|
|
|
|
__u64 unencoded_len;
|
|
|
|
__u64 unencoded_offset;
|
|
|
|
__u32 compression;
|
|
|
|
__u32 encryption;
|
|
|
|
__u8 reserved[64];
|
|
|
|
};
|
|
|
|
|
|
|
|
#define BTRFS_IOC_ENCODED_READ_32 _IOR(BTRFS_IOCTL_MAGIC, 64, \
|
|
|
|
struct btrfs_ioctl_encoded_io_args_32)
|
2019-08-14 07:00:02 +08:00
|
|
|
#define BTRFS_IOC_ENCODED_WRITE_32 _IOW(BTRFS_IOCTL_MAGIC, 64, \
|
|
|
|
struct btrfs_ioctl_encoded_io_args_32)
|
2017-09-27 22:43:13 +08:00
|
|
|
#endif
|
2014-01-31 04:17:00 +08:00
|
|
|
|
2009-04-17 16:37:41 +08:00
|
|
|
/* Mask out flags that are inappropriate for the given type of inode. */
|
2018-03-27 00:52:15 +08:00
|
|
|
static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
|
|
|
|
unsigned int flags)
|
2009-04-17 16:37:41 +08:00
|
|
|
{
|
2018-03-27 00:52:15 +08:00
|
|
|
if (S_ISDIR(inode->i_mode))
|
2009-04-17 16:37:41 +08:00
|
|
|
return flags;
|
2018-03-27 00:52:15 +08:00
|
|
|
else if (S_ISREG(inode->i_mode))
|
2009-04-17 16:37:41 +08:00
|
|
|
return flags & ~FS_DIRSYNC_FL;
|
|
|
|
else
|
|
|
|
return flags & (FS_NODUMP_FL | FS_NOATIME_FL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-03-27 01:12:25 +08:00
|
|
|
* Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
|
|
|
|
* ioctl.
|
2009-04-17 16:37:41 +08:00
|
|
|
*/
|
btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.
As for the implementation, it unfortunately gets a little complicated.
The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.
It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:
btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.
This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).
Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.
With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-01 04:01:48 +08:00
|
|
|
static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
|
2009-04-17 16:37:41 +08:00
|
|
|
{
|
|
|
|
unsigned int iflags = 0;
|
btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.
As for the implementation, it unfortunately gets a little complicated.
The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.
It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:
btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.
This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).
Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.
With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-01 04:01:48 +08:00
|
|
|
u32 flags = binode->flags;
|
2021-07-01 04:01:49 +08:00
|
|
|
u32 ro_flags = binode->ro_flags;
|
2009-04-17 16:37:41 +08:00
|
|
|
|
|
|
|
if (flags & BTRFS_INODE_SYNC)
|
|
|
|
iflags |= FS_SYNC_FL;
|
|
|
|
if (flags & BTRFS_INODE_IMMUTABLE)
|
|
|
|
iflags |= FS_IMMUTABLE_FL;
|
|
|
|
if (flags & BTRFS_INODE_APPEND)
|
|
|
|
iflags |= FS_APPEND_FL;
|
|
|
|
if (flags & BTRFS_INODE_NODUMP)
|
|
|
|
iflags |= FS_NODUMP_FL;
|
|
|
|
if (flags & BTRFS_INODE_NOATIME)
|
|
|
|
iflags |= FS_NOATIME_FL;
|
|
|
|
if (flags & BTRFS_INODE_DIRSYNC)
|
|
|
|
iflags |= FS_DIRSYNC_FL;
|
2011-04-15 11:03:06 +08:00
|
|
|
if (flags & BTRFS_INODE_NODATACOW)
|
|
|
|
iflags |= FS_NOCOW_FL;
|
2021-07-01 04:01:49 +08:00
|
|
|
if (ro_flags & BTRFS_INODE_RO_VERITY)
|
|
|
|
iflags |= FS_VERITY_FL;
|
2011-04-15 11:03:06 +08:00
|
|
|
|
2016-03-15 08:09:59 +08:00
|
|
|
if (flags & BTRFS_INODE_NOCOMPRESS)
|
2011-04-15 11:03:06 +08:00
|
|
|
iflags |= FS_NOCOMP_FL;
|
2016-03-15 08:09:59 +08:00
|
|
|
else if (flags & BTRFS_INODE_COMPRESS)
|
|
|
|
iflags |= FS_COMPR_FL;
|
2009-04-17 16:37:41 +08:00
|
|
|
|
|
|
|
return iflags;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update inode->i_flags based on the btrfs internal flags.
|
|
|
|
*/
|
2018-03-27 00:40:21 +08:00
|
|
|
void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
|
2009-04-17 16:37:41 +08:00
|
|
|
{
|
2018-04-23 21:45:18 +08:00
|
|
|
struct btrfs_inode *binode = BTRFS_I(inode);
|
2014-06-26 05:36:02 +08:00
|
|
|
unsigned int new_fl = 0;
|
2009-04-17 16:37:41 +08:00
|
|
|
|
2018-04-23 21:45:18 +08:00
|
|
|
if (binode->flags & BTRFS_INODE_SYNC)
|
2014-06-26 05:36:02 +08:00
|
|
|
new_fl |= S_SYNC;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (binode->flags & BTRFS_INODE_IMMUTABLE)
|
2014-06-26 05:36:02 +08:00
|
|
|
new_fl |= S_IMMUTABLE;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (binode->flags & BTRFS_INODE_APPEND)
|
2014-06-26 05:36:02 +08:00
|
|
|
new_fl |= S_APPEND;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (binode->flags & BTRFS_INODE_NOATIME)
|
2014-06-26 05:36:02 +08:00
|
|
|
new_fl |= S_NOATIME;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (binode->flags & BTRFS_INODE_DIRSYNC)
|
2014-06-26 05:36:02 +08:00
|
|
|
new_fl |= S_DIRSYNC;
|
2021-07-01 04:01:49 +08:00
|
|
|
if (binode->ro_flags & BTRFS_INODE_RO_VERITY)
|
|
|
|
new_fl |= S_VERITY;
|
2014-06-26 05:36:02 +08:00
|
|
|
|
|
|
|
set_mask_bits(&inode->i_flags,
|
2021-07-01 04:01:49 +08:00
|
|
|
S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
|
|
|
|
S_VERITY, new_fl);
|
2009-04-17 16:37:41 +08:00
|
|
|
}
|
|
|
|
|
2020-07-10 15:49:56 +08:00
|
|
|
/*
|
|
|
|
* Check if @flags are a supported and valid set of FS_*_FL flags and that
|
|
|
|
* the old and new flags are not conflicting
|
|
|
|
*/
|
|
|
|
static int check_fsflags(unsigned int old_flags, unsigned int flags)
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 18:12:20 +08:00
|
|
|
{
|
|
|
|
if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \
|
|
|
|
FS_NOATIME_FL | FS_NODUMP_FL | \
|
|
|
|
FS_SYNC_FL | FS_DIRSYNC_FL | \
|
2011-04-15 11:02:49 +08:00
|
|
|
FS_NOCOMP_FL | FS_COMPR_FL |
|
|
|
|
FS_NOCOW_FL))
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 18:12:20 +08:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2020-07-10 15:49:56 +08:00
|
|
|
/* COMPR and NOCOMP on new/old are valid */
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 18:12:20 +08:00
|
|
|
if ((flags & FS_NOCOMP_FL) && (flags & FS_COMPR_FL))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-10 15:49:56 +08:00
|
|
|
if ((flags & FS_COMPR_FL) && (flags & FS_NOCOW_FL))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* NOCOW and compression options are mutually exclusive */
|
|
|
|
if ((old_flags & FS_NOCOW_FL) && (flags & (FS_COMPR_FL | FS_NOCOMP_FL)))
|
|
|
|
return -EINVAL;
|
|
|
|
if ((flags & FS_NOCOW_FL) && (old_flags & (FS_COMPR_FL | FS_NOCOMP_FL)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 18:12:20 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-11-10 19:26:11 +08:00
|
|
|
static int check_fsflags_compatible(struct btrfs_fs_info *fs_info,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
|
|
|
if (btrfs_is_zoned(fs_info) && (flags & FS_NOCOW_FL))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-04-07 20:36:43 +08:00
|
|
|
/*
|
|
|
|
* Set flags/xflags from the internal inode flags. The remaining items of
|
|
|
|
* fsxattr are zeroed.
|
|
|
|
*/
|
|
|
|
int btrfs_fileattr_get(struct dentry *dentry, struct fileattr *fa)
|
2009-04-17 16:37:41 +08:00
|
|
|
{
|
2021-04-07 20:36:43 +08:00
|
|
|
struct btrfs_inode *binode = BTRFS_I(d_inode(dentry));
|
|
|
|
|
btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.
As for the implementation, it unfortunately gets a little complicated.
The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.
It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:
btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.
This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).
Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.
With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-01 04:01:48 +08:00
|
|
|
fileattr_fill_flags(fa, btrfs_inode_flags_to_fsflags(binode));
|
2021-04-07 20:36:43 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_fileattr_set(struct user_namespace *mnt_userns,
|
|
|
|
struct dentry *dentry, struct fileattr *fa)
|
|
|
|
{
|
|
|
|
struct inode *inode = d_inode(dentry);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2018-04-23 21:45:18 +08:00
|
|
|
struct btrfs_inode *binode = BTRFS_I(inode);
|
|
|
|
struct btrfs_root *root = binode->root;
|
2009-04-17 16:37:41 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2019-07-01 23:25:34 +08:00
|
|
|
unsigned int fsflags, old_fsflags;
|
2009-04-17 16:37:41 +08:00
|
|
|
int ret;
|
2019-04-20 19:48:53 +08:00
|
|
|
const char *comp = NULL;
|
2020-07-10 15:49:56 +08:00
|
|
|
u32 binode_flags;
|
2009-04-17 16:37:41 +08:00
|
|
|
|
2010-12-20 16:04:08 +08:00
|
|
|
if (btrfs_root_readonly(root))
|
|
|
|
return -EROFS;
|
|
|
|
|
2021-04-07 20:36:43 +08:00
|
|
|
if (fileattr_has_fsx(fa))
|
|
|
|
return -EOPNOTSUPP;
|
2012-06-12 22:20:32 +08:00
|
|
|
|
2021-04-07 20:36:43 +08:00
|
|
|
fsflags = btrfs_mask_fsflags_for_type(inode, fa->flags);
|
btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.
As for the implementation, it unfortunately gets a little complicated.
The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.
It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:
btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.
This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).
Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.
With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-01 04:01:48 +08:00
|
|
|
old_fsflags = btrfs_inode_flags_to_fsflags(binode);
|
2020-07-10 15:49:56 +08:00
|
|
|
ret = check_fsflags(old_fsflags, fsflags);
|
|
|
|
if (ret)
|
2021-04-07 20:36:43 +08:00
|
|
|
return ret;
|
2020-07-10 15:49:56 +08:00
|
|
|
|
2020-11-10 19:26:11 +08:00
|
|
|
ret = check_fsflags_compatible(fs_info, fsflags);
|
|
|
|
if (ret)
|
2021-04-07 20:36:43 +08:00
|
|
|
return ret;
|
2020-11-10 19:26:11 +08:00
|
|
|
|
2020-07-10 15:49:56 +08:00
|
|
|
binode_flags = binode->flags;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_SYNC_FL)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_SYNC;
|
2009-04-17 16:37:41 +08:00
|
|
|
else
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_SYNC;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_IMMUTABLE_FL)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_IMMUTABLE;
|
2009-04-17 16:37:41 +08:00
|
|
|
else
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_IMMUTABLE;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_APPEND_FL)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_APPEND;
|
2009-04-17 16:37:41 +08:00
|
|
|
else
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_APPEND;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_NODUMP_FL)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_NODUMP;
|
2009-04-17 16:37:41 +08:00
|
|
|
else
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_NODUMP;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_NOATIME_FL)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_NOATIME;
|
2009-04-17 16:37:41 +08:00
|
|
|
else
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_NOATIME;
|
2021-04-07 20:36:43 +08:00
|
|
|
|
|
|
|
/* If coming from FS_IOC_FSSETXATTR then skip unconverted flags */
|
|
|
|
if (!fa->flags_valid) {
|
|
|
|
/* 1 item for the inode */
|
|
|
|
trans = btrfs_start_transaction(root, 1);
|
2021-05-01 00:00:55 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2021-04-07 20:36:43 +08:00
|
|
|
goto update_flags;
|
|
|
|
}
|
|
|
|
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_DIRSYNC_FL)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_DIRSYNC;
|
2009-04-17 16:37:41 +08:00
|
|
|
else
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_DIRSYNC;
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_NOCOW_FL) {
|
2019-04-20 19:48:57 +08:00
|
|
|
if (S_ISREG(inode->i_mode)) {
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
/*
|
|
|
|
* It's safe to turn csums off here, no extents exist.
|
|
|
|
* Otherwise we want the flag to reflect the real COW
|
|
|
|
* status of the file and will not set it.
|
|
|
|
*/
|
|
|
|
if (inode->i_size == 0)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_NODATACOW |
|
|
|
|
BTRFS_INODE_NODATASUM;
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
} else {
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_NODATACOW;
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
2016-05-20 09:18:45 +08:00
|
|
|
* Revert back under same assumptions as above
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
*/
|
2019-04-20 19:48:57 +08:00
|
|
|
if (S_ISREG(inode->i_mode)) {
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
if (inode->i_size == 0)
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~(BTRFS_INODE_NODATACOW |
|
|
|
|
BTRFS_INODE_NODATASUM);
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
} else {
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_NODATACOW;
|
btrfs: allow setting NOCOW for a zero sized file via ioctl
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-09-07 19:56:55 +08:00
|
|
|
}
|
|
|
|
}
|
2009-04-17 16:37:41 +08:00
|
|
|
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 18:12:20 +08:00
|
|
|
/*
|
|
|
|
* The COMPRESS flag can only be changed by users, while the NOCOMPRESS
|
|
|
|
* flag may be changed automatically if compression code won't make
|
|
|
|
* things smaller.
|
|
|
|
*/
|
2018-04-23 21:45:18 +08:00
|
|
|
if (fsflags & FS_NOCOMP_FL) {
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~BTRFS_INODE_COMPRESS;
|
|
|
|
binode_flags |= BTRFS_INODE_NOCOMPRESS;
|
2018-04-23 21:45:18 +08:00
|
|
|
} else if (fsflags & FS_COMPR_FL) {
|
Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-07 19:47:46 +08:00
|
|
|
|
2021-04-07 20:36:43 +08:00
|
|
|
if (IS_SWAPFILE(inode))
|
|
|
|
return -ETXTBSY;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags |= BTRFS_INODE_COMPRESS;
|
|
|
|
binode_flags &= ~BTRFS_INODE_NOCOMPRESS;
|
Btrfs: add support for inode properties
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-07 19:47:46 +08:00
|
|
|
|
2017-11-01 00:32:41 +08:00
|
|
|
comp = btrfs_compress_type2str(fs_info->compress_type);
|
|
|
|
if (!comp || comp[0] == 0)
|
|
|
|
comp = btrfs_compress_type2str(BTRFS_COMPRESS_ZLIB);
|
2011-04-15 11:03:17 +08:00
|
|
|
} else {
|
2019-04-20 19:48:55 +08:00
|
|
|
binode_flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS);
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 18:12:20 +08:00
|
|
|
}
|
2009-04-17 16:37:41 +08:00
|
|
|
|
2019-04-20 19:48:53 +08:00
|
|
|
/*
|
|
|
|
* 1 for inode item
|
|
|
|
* 2 for properties
|
|
|
|
*/
|
|
|
|
trans = btrfs_start_transaction(root, 3);
|
2021-04-07 20:36:43 +08:00
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
2009-04-17 16:37:41 +08:00
|
|
|
|
2019-04-20 19:48:53 +08:00
|
|
|
if (comp) {
|
|
|
|
ret = btrfs_set_prop(trans, inode, "btrfs.compression", comp,
|
|
|
|
strlen(comp), 0);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out_end_trans;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ret = btrfs_set_prop(trans, inode, "btrfs.compression", NULL,
|
|
|
|
0, 0);
|
|
|
|
if (ret && ret != -ENODATA) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
goto out_end_trans;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-04-07 20:36:43 +08:00
|
|
|
update_flags:
|
2019-04-20 19:48:55 +08:00
|
|
|
binode->flags = binode_flags;
|
2018-03-27 00:40:21 +08:00
|
|
|
btrfs_sync_inode_flags_to_i_flags(inode);
|
2012-04-06 03:03:02 +08:00
|
|
|
inode_inc_iversion(inode);
|
2016-09-14 22:48:06 +08:00
|
|
|
inode->i_ctime = current_time(inode);
|
2020-11-02 22:48:59 +08:00
|
|
|
ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
|
2009-04-17 16:37:41 +08:00
|
|
|
|
2019-04-20 19:48:53 +08:00
|
|
|
out_end_trans:
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2011-02-24 17:38:16 +08:00
|
|
|
return ret;
|
2009-04-17 16:37:41 +08:00
|
|
|
}
|
|
|
|
|
2021-05-14 23:42:30 +08:00
|
|
|
/*
|
|
|
|
* Start exclusive operation @type, return true on success
|
|
|
|
*/
|
2020-08-25 23:02:32 +08:00
|
|
|
bool btrfs_exclop_start(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation type)
|
|
|
|
{
|
2021-05-14 23:42:30 +08:00
|
|
|
bool ret = false;
|
|
|
|
|
|
|
|
spin_lock(&fs_info->super_lock);
|
|
|
|
if (fs_info->exclusive_operation == BTRFS_EXCLOP_NONE) {
|
|
|
|
fs_info->exclusive_operation = type;
|
|
|
|
ret = true;
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
|
|
|
|
|
|
|
return ret;
|
2020-08-25 23:02:32 +08:00
|
|
|
}
|
|
|
|
|
2021-05-19 03:05:52 +08:00
|
|
|
/*
|
|
|
|
* Conditionally allow to enter the exclusive operation in case it's compatible
|
|
|
|
* with the running one. This must be paired with btrfs_exclop_start_unlock and
|
|
|
|
* btrfs_exclop_finish.
|
|
|
|
*
|
|
|
|
* Compatibility:
|
|
|
|
* - the same type is already running
|
2021-11-25 17:14:42 +08:00
|
|
|
* - when trying to add a device and balance has been paused
|
2021-05-19 03:05:52 +08:00
|
|
|
* - not BTRFS_EXCLOP_NONE - this is intentionally incompatible and the caller
|
|
|
|
* must check the condition first that would allow none -> @type
|
|
|
|
*/
|
|
|
|
bool btrfs_exclop_start_try_lock(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation type)
|
|
|
|
{
|
|
|
|
spin_lock(&fs_info->super_lock);
|
2021-11-25 17:14:42 +08:00
|
|
|
if (fs_info->exclusive_operation == type ||
|
|
|
|
(fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED &&
|
|
|
|
type == BTRFS_EXCLOP_DEV_ADD))
|
2021-05-19 03:05:52 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_exclop_start_unlock(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
|
|
|
}
|
|
|
|
|
2020-08-25 23:02:32 +08:00
|
|
|
void btrfs_exclop_finish(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
2021-05-14 23:42:30 +08:00
|
|
|
spin_lock(&fs_info->super_lock);
|
2020-08-25 23:02:32 +08:00
|
|
|
WRITE_ONCE(fs_info->exclusive_operation, BTRFS_EXCLOP_NONE);
|
2021-05-14 23:42:30 +08:00
|
|
|
spin_unlock(&fs_info->super_lock);
|
2020-08-25 23:02:33 +08:00
|
|
|
sysfs_notify(&fs_info->fs_devices->fsid_kobj, NULL, "exclusive_operation");
|
2020-08-25 23:02:32 +08:00
|
|
|
}
|
|
|
|
|
2021-11-25 17:14:41 +08:00
|
|
|
void btrfs_exclop_balance(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation op)
|
|
|
|
{
|
|
|
|
switch (op) {
|
|
|
|
case BTRFS_EXCLOP_BALANCE_PAUSED:
|
|
|
|
spin_lock(&fs_info->super_lock);
|
|
|
|
ASSERT(fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
|
|
|
|
fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD);
|
|
|
|
fs_info->exclusive_operation = BTRFS_EXCLOP_BALANCE_PAUSED;
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
|
|
|
break;
|
|
|
|
case BTRFS_EXCLOP_BALANCE:
|
|
|
|
spin_lock(&fs_info->super_lock);
|
|
|
|
ASSERT(fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED);
|
|
|
|
fs_info->exclusive_operation = BTRFS_EXCLOP_BALANCE;
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"invalid exclop balance operation %d requested", op);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-01-05 16:30:06 +08:00
|
|
|
static int btrfs_ioctl_getversion(struct inode *inode, int __user *arg)
|
2009-04-17 16:37:41 +08:00
|
|
|
{
|
|
|
|
return put_user(inode->i_generation, arg);
|
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2019-10-11 08:23:11 +08:00
|
|
|
static noinline int btrfs_ioctl_fitrim(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2011-03-24 18:24:28 +08:00
|
|
|
{
|
|
|
|
struct btrfs_device *device;
|
|
|
|
struct fstrim_range range;
|
|
|
|
u64 minlen = ULLONG_MAX;
|
|
|
|
u64 num_devices = 0;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2021-02-04 18:21:46 +08:00
|
|
|
/*
|
|
|
|
* btrfs_trim_block_group() depends on space cache, which is not
|
|
|
|
* available in zoned filesystem. So, disallow fitrim on a zoned
|
|
|
|
* filesystem for now.
|
|
|
|
*/
|
|
|
|
if (btrfs_is_zoned(fs_info))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2019-03-26 18:49:56 +08:00
|
|
|
/*
|
|
|
|
* If the fs is mounted with nologreplay, which requires it to be
|
|
|
|
* mounted in RO mode as well, we can not allow discard on free space
|
|
|
|
* inside block groups, because log trees refer to extents that are not
|
|
|
|
* pinned in a block group's free space cache (pinning the extents is
|
|
|
|
* precisely the first phase of replaying a log tree).
|
|
|
|
*/
|
|
|
|
if (btrfs_test_opt(fs_info, NOLOGREPLAY))
|
|
|
|
return -EROFS;
|
|
|
|
|
2011-04-20 18:09:16 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(device, &fs_info->fs_devices->devices,
|
|
|
|
dev_list) {
|
2022-04-15 12:52:56 +08:00
|
|
|
if (!device->bdev || !bdev_max_discard_sectors(device->bdev))
|
2011-03-24 18:24:28 +08:00
|
|
|
continue;
|
2022-04-15 12:52:56 +08:00
|
|
|
num_devices++;
|
|
|
|
minlen = min_t(u64, bdev_discard_granularity(device->bdev),
|
|
|
|
minlen);
|
2011-03-24 18:24:28 +08:00
|
|
|
}
|
2011-04-20 18:09:16 +08:00
|
|
|
rcu_read_unlock();
|
2011-09-05 22:34:54 +08:00
|
|
|
|
2011-03-24 18:24:28 +08:00
|
|
|
if (!num_devices)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
if (copy_from_user(&range, arg, sizeof(range)))
|
|
|
|
return -EFAULT;
|
btrfs: Ensure btrfs_trim_fs can trim the whole filesystem
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.
[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes). So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).
While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.
For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.
In that case, btrfs will not trim existing block groups.
[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-09-07 14:16:24 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE: Don't truncate the range using super->total_bytes. Bytenr of
|
|
|
|
* block group is in the logical address space, which can be any
|
|
|
|
* sectorsize aligned bytenr in the range [0, U64_MAX].
|
|
|
|
*/
|
|
|
|
if (range.len < fs_info->sb->s_blocksize)
|
2011-09-05 22:34:54 +08:00
|
|
|
return -EINVAL;
|
2011-03-24 18:24:28 +08:00
|
|
|
|
|
|
|
range.minlen = max(range.minlen, minlen);
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_trim_fs(fs_info, &range);
|
2011-03-24 18:24:28 +08:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (copy_to_user(arg, &range, sizeof(range)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-10-02 01:57:39 +08:00
|
|
|
int __pure btrfs_is_empty_uuid(u8 *uuid)
|
2013-08-15 23:11:20 +08:00
|
|
|
{
|
2013-11-15 19:14:55 +08:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_UUID_SIZE; i++) {
|
|
|
|
if (uuid[i])
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
2013-08-15 23:11:20 +08:00
|
|
|
}
|
|
|
|
|
2022-03-15 09:12:34 +08:00
|
|
|
/*
|
|
|
|
* Calculate the number of transaction items to reserve for creating a subvolume
|
|
|
|
* or snapshot, not including the inode, directory entries, or parent directory.
|
|
|
|
*/
|
|
|
|
static unsigned int create_subvol_num_items(struct btrfs_qgroup_inherit *inherit)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* 1 to add root block
|
|
|
|
* 1 to add root item
|
|
|
|
* 1 to add root ref
|
|
|
|
* 1 to add root backref
|
|
|
|
* 1 to add UUID item
|
|
|
|
* 1 to add qgroup info
|
|
|
|
* 1 to add qgroup limit
|
|
|
|
*
|
|
|
|
* Ideally the last two would only be accounted if qgroups are enabled,
|
|
|
|
* but that can change between now and the time we would insert them.
|
|
|
|
*/
|
|
|
|
unsigned int num_items = 7;
|
|
|
|
|
|
|
|
if (inherit) {
|
|
|
|
/* 2 to add qgroup relations for each inherited qgroup */
|
|
|
|
num_items += 2 * inherit->num_qgroups;
|
|
|
|
}
|
|
|
|
return num_items;
|
|
|
|
}
|
|
|
|
|
2021-07-27 18:48:52 +08:00
|
|
|
static noinline int create_subvol(struct user_namespace *mnt_userns,
|
|
|
|
struct inode *dir, struct dentry *dentry,
|
2013-02-07 14:02:44 +08:00
|
|
|
struct btrfs_qgroup_inherit *inherit)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(dir->i_sb);
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_key key;
|
2016-03-25 00:49:22 +08:00
|
|
|
struct btrfs_root_item *root_item;
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_inode_item *inode_item;
|
|
|
|
struct extent_buffer *leaf;
|
2013-02-28 18:04:33 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(dir)->root;
|
2009-09-22 04:00:26 +08:00
|
|
|
struct btrfs_root *new_root;
|
2013-02-28 18:04:33 +08:00
|
|
|
struct btrfs_block_rsv block_rsv;
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 10:36:02 +08:00
|
|
|
struct timespec64 cur_time = current_time(dir);
|
2022-03-15 09:12:34 +08:00
|
|
|
struct btrfs_new_inode_args new_inode_args = {
|
|
|
|
.dir = dir,
|
|
|
|
.dentry = dentry,
|
|
|
|
.subvol = true,
|
|
|
|
};
|
|
|
|
unsigned int trans_num_items;
|
2008-06-12 09:53:53 +08:00
|
|
|
int ret;
|
2022-03-10 09:31:33 +08:00
|
|
|
dev_t anon_dev;
|
2008-06-12 09:53:53 +08:00
|
|
|
u64 objectid;
|
|
|
|
|
2016-03-25 00:49:22 +08:00
|
|
|
root_item = kzalloc(sizeof(*root_item), GFP_KERNEL);
|
|
|
|
if (!root_item)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2020-12-07 23:32:33 +08:00
|
|
|
ret = btrfs_get_free_objectid(fs_info->tree_root, &objectid);
|
2011-07-17 09:38:06 +08:00
|
|
|
if (ret)
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out_root_item;
|
2020-06-16 10:17:36 +08:00
|
|
|
|
2015-02-27 16:24:23 +08:00
|
|
|
/*
|
|
|
|
* Don't create subvolume whose level is not zero. Or qgroup will be
|
2016-05-20 09:18:45 +08:00
|
|
|
* screwed up since it assumes subvolume qgroup's level to be 0.
|
2015-02-27 16:24:23 +08:00
|
|
|
*/
|
2016-03-25 00:49:22 +08:00
|
|
|
if (btrfs_qgroup_level(objectid)) {
|
|
|
|
ret = -ENOSPC;
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out_root_item;
|
2016-03-25 00:49:22 +08:00
|
|
|
}
|
2015-02-27 16:24:23 +08:00
|
|
|
|
2022-03-10 09:31:33 +08:00
|
|
|
ret = get_anon_bdev(&anon_dev);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_root_item;
|
|
|
|
|
2022-03-15 09:12:34 +08:00
|
|
|
new_inode_args.inode = btrfs_new_subvol_inode(mnt_userns, dir);
|
|
|
|
if (!new_inode_args.inode) {
|
2022-03-15 09:12:32 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_anon_dev;
|
|
|
|
}
|
2022-03-15 09:12:34 +08:00
|
|
|
ret = btrfs_new_inode_prepare(&new_inode_args, &trans_num_items);
|
|
|
|
if (ret)
|
|
|
|
goto out_inode;
|
|
|
|
trans_num_items += create_subvol_num_items(inherit);
|
2022-03-15 09:12:32 +08:00
|
|
|
|
2013-02-28 18:04:33 +08:00
|
|
|
btrfs_init_block_rsv(&block_rsv, BTRFS_BLOCK_RSV_TEMP);
|
2022-03-15 09:12:34 +08:00
|
|
|
ret = btrfs_subvolume_reserve_metadata(root, &block_rsv,
|
|
|
|
trans_num_items, false);
|
2013-02-28 18:04:33 +08:00
|
|
|
if (ret)
|
2022-03-15 09:12:34 +08:00
|
|
|
goto out_new_inode_args;
|
2013-02-28 18:04:33 +08:00
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
And with the following metadata leak:
BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents
[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().
This leads to the leakage.
[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.
And free the qgroup reserved metadata space when releasing the
block_rsv.
To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.
Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-24 14:46:10 +08:00
|
|
|
btrfs_subvolume_release_metadata(root, &block_rsv);
|
2022-03-15 09:12:34 +08:00
|
|
|
goto out_new_inode_args;
|
2013-02-28 18:04:33 +08:00
|
|
|
}
|
|
|
|
trans->block_rsv = &block_rsv;
|
|
|
|
trans->bytes_reserved = block_rsv.size;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2018-07-18 14:45:41 +08:00
|
|
|
ret = btrfs_qgroup_inherit(trans, 0, objectid, inherit);
|
2011-09-14 21:58:21 +08:00
|
|
|
if (ret)
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2011-09-14 21:58:21 +08:00
|
|
|
|
2020-08-20 23:46:03 +08:00
|
|
|
leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
|
|
|
|
BTRFS_NESTING_NORMAL);
|
2008-07-25 00:17:14 +08:00
|
|
|
if (IS_ERR(leaf)) {
|
|
|
|
ret = PTR_ERR(leaf);
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2008-07-25 00:17:14 +08:00
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
|
|
|
|
2016-03-25 00:49:22 +08:00
|
|
|
inode_item = &root_item->inode;
|
2013-07-16 11:19:18 +08:00
|
|
|
btrfs_set_stack_inode_generation(inode_item, 1);
|
|
|
|
btrfs_set_stack_inode_size(inode_item, 3);
|
|
|
|
btrfs_set_stack_inode_nlink(inode_item, 1);
|
2016-06-15 21:22:56 +08:00
|
|
|
btrfs_set_stack_inode_nbytes(inode_item,
|
2016-06-23 06:54:23 +08:00
|
|
|
fs_info->nodesize);
|
2013-07-16 11:19:18 +08:00
|
|
|
btrfs_set_stack_inode_mode(inode_item, S_IFDIR | 0755);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2016-03-25 00:49:22 +08:00
|
|
|
btrfs_set_root_flags(root_item, 0);
|
|
|
|
btrfs_set_root_limit(root_item, 0);
|
2013-07-16 11:19:18 +08:00
|
|
|
btrfs_set_stack_inode_flags(inode_item, BTRFS_INODE_ROOT_ITEM_INIT);
|
2011-03-28 10:01:25 +08:00
|
|
|
|
2016-03-25 00:49:22 +08:00
|
|
|
btrfs_set_root_bytenr(root_item, leaf->start);
|
|
|
|
btrfs_set_root_generation(root_item, trans->transid);
|
|
|
|
btrfs_set_root_level(root_item, 0);
|
|
|
|
btrfs_set_root_refs(root_item, 1);
|
|
|
|
btrfs_set_root_used(root_item, leaf->len);
|
|
|
|
btrfs_set_root_last_snapshot(root_item, 0);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2016-03-25 00:49:22 +08:00
|
|
|
btrfs_set_root_generation_v2(root_item,
|
|
|
|
btrfs_root_generation(root_item));
|
2020-02-24 23:37:51 +08:00
|
|
|
generate_random_guid(root_item->uuid);
|
2016-03-25 00:49:22 +08:00
|
|
|
btrfs_set_stack_timespec_sec(&root_item->otime, cur_time.tv_sec);
|
|
|
|
btrfs_set_stack_timespec_nsec(&root_item->otime, cur_time.tv_nsec);
|
|
|
|
root_item->ctime = root_item->otime;
|
|
|
|
btrfs_set_root_ctransid(root_item, trans->transid);
|
|
|
|
btrfs_set_root_otransid(root_item, trans->transid);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2008-06-26 04:01:30 +08:00
|
|
|
btrfs_tree_unlock(leaf);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2020-12-07 23:32:37 +08:00
|
|
|
btrfs_set_root_dirid(root_item, BTRFS_FIRST_FREE_OBJECTID);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
key.objectid = objectid;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 22:45:14 +08:00
|
|
|
key.offset = 0;
|
2014-06-05 00:41:45 +08:00
|
|
|
key.type = BTRFS_ROOT_ITEM_KEY;
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_insert_root(trans, fs_info->tree_root, &key,
|
2016-03-25 00:49:22 +08:00
|
|
|
root_item);
|
2021-04-20 17:55:12 +08:00
|
|
|
if (ret) {
|
|
|
|
/*
|
|
|
|
* Since we don't abort the transaction in this case, free the
|
|
|
|
* tree block so that we don't leak space and leave the
|
|
|
|
* filesystem in an inconsistent state (an extent item in the
|
2021-12-13 16:45:12 +08:00
|
|
|
* extent tree with a backreference for a root that does not
|
2021-12-13 16:45:13 +08:00
|
|
|
* exists).
|
2021-04-20 17:55:12 +08:00
|
|
|
*/
|
2021-12-13 16:45:13 +08:00
|
|
|
btrfs_tree_lock(leaf);
|
|
|
|
btrfs_clean_tree_block(leaf);
|
|
|
|
btrfs_tree_unlock(leaf);
|
2021-12-13 16:45:12 +08:00
|
|
|
btrfs_free_tree_block(trans, objectid, leaf, 0, 1);
|
2021-04-20 17:55:12 +08:00
|
|
|
free_extent_buffer(leaf);
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2021-04-20 17:55:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
free_extent_buffer(leaf);
|
|
|
|
leaf = NULL;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2020-06-16 10:17:36 +08:00
|
|
|
new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev);
|
2012-03-12 23:03:00 +08:00
|
|
|
if (IS_ERR(new_root)) {
|
|
|
|
ret = PTR_ERR(new_root);
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2012-03-12 23:03:00 +08:00
|
|
|
}
|
2022-03-10 09:31:33 +08:00
|
|
|
/* anon_dev is owned by new_root now. */
|
2020-06-16 10:17:36 +08:00
|
|
|
anon_dev = 0;
|
2022-03-15 09:12:34 +08:00
|
|
|
BTRFS_I(new_inode_args.inode)->root = new_root;
|
|
|
|
/* ... and new_root is owned by new_inode_args.inode now. */
|
2009-09-22 04:00:26 +08:00
|
|
|
|
2021-03-13 04:25:06 +08:00
|
|
|
ret = btrfs_record_root_in_trans(trans, new_root);
|
|
|
|
if (ret) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2012-03-12 23:03:00 +08:00
|
|
|
}
|
2008-11-18 09:37:39 +08:00
|
|
|
|
btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 09:12:35 +08:00
|
|
|
ret = btrfs_uuid_tree_add(trans, root_item->uuid,
|
|
|
|
BTRFS_UUID_KEY_SUBVOL, objectid);
|
2019-12-06 22:37:15 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2019-12-06 22:37:15 +08:00
|
|
|
}
|
2009-01-06 04:43:43 +08:00
|
|
|
|
btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 09:12:35 +08:00
|
|
|
ret = btrfs_create_new_inode(trans, &new_inode_args);
|
2019-12-06 22:37:15 +08:00
|
|
|
if (ret) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
2022-03-10 09:31:33 +08:00
|
|
|
goto out;
|
2019-12-06 22:37:15 +08:00
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
|
btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:
1. It reduces code duplication.
2. It allows us to set up the inode completely before inserting the
inode item, removing calls to btrfs_update_inode().
3. It fixes a leak of an inode on disk in some error cases. For example,
in btrfs_create(), if btrfs_new_inode() succeeds, then we have
inserted an inode item and its inode ref. However, if something after
that fails (e.g., btrfs_init_inode_security()), then we end the
transaction and then decrement the link count on the inode. If the
transaction is committed and the system crashes before the failed
inode is deleted, then we leak that inode on disk. Instead, this
refactoring aborts the transaction when we can't recover more
gracefully.
4. It exposes various ways that subvolume creation diverges from mkdir
in terms of inheriting flags, properties, permissions, and POSIX
ACLs, a lot of which appears to be accidental. This patch explicitly
does _not_ change the existing non-standard behavior, but it makes
those differences more clear in the code and documents them so that
we can discuss whether they should be changed.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 09:12:35 +08:00
|
|
|
d_instantiate_new(dentry, new_inode_args.inode);
|
|
|
|
new_inode_args.inode = NULL;
|
2013-08-15 23:11:20 +08:00
|
|
|
|
2022-03-10 09:31:33 +08:00
|
|
|
out:
|
2013-02-28 18:04:33 +08:00
|
|
|
trans->block_rsv = NULL;
|
|
|
|
trans->bytes_reserved = 0;
|
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
And with the following metadata leak:
BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents
[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().
This leads to the leakage.
[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.
And free the qgroup reserved metadata space when releasing the
block_rsv.
To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.
Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-24 14:46:10 +08:00
|
|
|
btrfs_subvolume_release_metadata(root, &block_rsv);
|
2014-01-09 14:57:06 +08:00
|
|
|
|
2021-12-13 16:45:14 +08:00
|
|
|
if (ret)
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
else
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
2022-03-15 09:12:34 +08:00
|
|
|
out_new_inode_args:
|
|
|
|
btrfs_new_inode_args_destroy(&new_inode_args);
|
2022-03-15 09:12:32 +08:00
|
|
|
out_inode:
|
2022-03-15 09:12:34 +08:00
|
|
|
iput(new_inode_args.inode);
|
2022-03-10 09:31:33 +08:00
|
|
|
out_anon_dev:
|
2020-06-16 10:17:36 +08:00
|
|
|
if (anon_dev)
|
|
|
|
free_anon_bdev(anon_dev);
|
2022-03-10 09:31:33 +08:00
|
|
|
out_root_item:
|
2016-03-25 00:49:22 +08:00
|
|
|
kfree(root_item);
|
|
|
|
return ret;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
2013-02-28 18:01:15 +08:00
|
|
|
static int create_snapshot(struct btrfs_root *root, struct inode *dir,
|
2020-03-13 23:23:20 +08:00
|
|
|
struct dentry *dentry, bool readonly,
|
2013-02-28 18:01:15 +08:00
|
|
|
struct btrfs_qgroup_inherit *inherit)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(dir->i_sb);
|
2009-11-12 17:37:02 +08:00
|
|
|
struct inode *inode;
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_pending_snapshot *pending_snapshot;
|
2022-03-15 09:12:34 +08:00
|
|
|
unsigned int trans_num_items;
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
2009-11-12 17:37:02 +08:00
|
|
|
int ret;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2021-12-16 04:40:03 +08:00
|
|
|
/* We do not support snapshotting right now. */
|
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"extent tree v2 doesn't support snapshotting yet");
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
2020-05-15 14:01:40 +08:00
|
|
|
if (!test_bit(BTRFS_ROOT_SHAREABLE, &root->state))
|
2008-06-12 09:53:53 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
if (atomic_read(&root->nr_swapfiles)) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"cannot snapshot subvolume with active swapfile");
|
|
|
|
return -ETXTBSY;
|
|
|
|
}
|
|
|
|
|
2017-02-13 18:03:44 +08:00
|
|
|
pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_KERNEL);
|
2015-11-11 01:53:56 +08:00
|
|
|
if (!pending_snapshot)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2020-06-16 10:17:36 +08:00
|
|
|
ret = get_anon_bdev(&pending_snapshot->anon_dev);
|
|
|
|
if (ret < 0)
|
|
|
|
goto free_pending;
|
2015-11-11 01:54:00 +08:00
|
|
|
pending_snapshot->root_item = kzalloc(sizeof(struct btrfs_root_item),
|
2017-02-13 18:03:44 +08:00
|
|
|
GFP_KERNEL);
|
2015-11-11 01:54:03 +08:00
|
|
|
pending_snapshot->path = btrfs_alloc_path();
|
|
|
|
if (!pending_snapshot->root_item || !pending_snapshot->path) {
|
2015-11-11 01:54:00 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto free_pending;
|
|
|
|
}
|
|
|
|
|
2012-09-06 18:02:28 +08:00
|
|
|
btrfs_init_block_rsv(&pending_snapshot->block_rsv,
|
|
|
|
BTRFS_BLOCK_RSV_TEMP);
|
2013-02-28 18:04:33 +08:00
|
|
|
/*
|
2022-03-15 09:12:34 +08:00
|
|
|
* 1 to add dir item
|
|
|
|
* 1 to add dir index
|
|
|
|
* 1 to update parent inode item
|
2013-02-28 18:04:33 +08:00
|
|
|
*/
|
2022-03-15 09:12:34 +08:00
|
|
|
trans_num_items = create_subvol_num_items(inherit) + 3;
|
2013-02-28 18:04:33 +08:00
|
|
|
ret = btrfs_subvolume_reserve_metadata(BTRFS_I(dir)->root,
|
2022-03-15 09:12:34 +08:00
|
|
|
&pending_snapshot->block_rsv,
|
|
|
|
trans_num_items, false);
|
2013-02-28 18:04:33 +08:00
|
|
|
if (ret)
|
2020-05-14 17:19:18 +08:00
|
|
|
goto free_pending;
|
2013-02-28 18:04:33 +08:00
|
|
|
|
2008-11-18 10:02:50 +08:00
|
|
|
pending_snapshot->dentry = dentry;
|
2008-06-12 09:53:53 +08:00
|
|
|
pending_snapshot->root = root;
|
2010-12-20 16:04:08 +08:00
|
|
|
pending_snapshot->readonly = readonly;
|
2013-02-28 18:01:15 +08:00
|
|
|
pending_snapshot->dir = dir;
|
2013-02-07 14:02:44 +08:00
|
|
|
pending_snapshot->inherit = inherit;
|
2010-05-16 22:48:46 +08:00
|
|
|
|
2013-02-28 18:04:33 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2010-05-16 22:48:46 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
btrfs: fix use-after-free after failure to create a snapshot
At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
then attach it to the transaction's list of pending snapshots. After that
we call btrfs_commit_transaction(), and if that returns an error we jump
to 'fail' label, where we kfree() the pending snapshot structure. This can
result in a later use-after-free of the pending snapshot:
1) We allocated the pending snapshot and added it to the transaction's
list of pending snapshots;
2) We call btrfs_commit_transaction(), and it fails either at the first
call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
In both cases, we don't abort the transaction and we release our
transaction handle. We jump to the 'fail' label and free the pending
snapshot structure. We return with the pending snapshot still in the
transaction's list;
3) Another task commits the transaction. This time there's no error at
all, and then during the transaction commit it accesses a pointer
to the pending snapshot structure that the snapshot creation task
has already freed, resulting in a user-after-free.
This issue could actually be detected by smatch, which produced the
following warning:
fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
So fix this by not having the snapshot creation ioctl directly add the
pending snapshot to the transaction's list. Instead add the pending
snapshot to the transaction handle, and then at btrfs_commit_transaction()
we add the snapshot to the list only when we can guarantee that any error
returned after that point will result in a transaction abort, in which
case the ioctl code can safely free the pending snapshot and no one can
access it anymore.
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-21 23:44:39 +08:00
|
|
|
trans->pending_snapshot = pending_snapshot;
|
2020-03-13 23:23:20 +08:00
|
|
|
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
2013-03-04 17:44:29 +08:00
|
|
|
if (ret)
|
2012-10-23 03:51:44 +08:00
|
|
|
goto fail;
|
2010-05-16 22:48:46 +08:00
|
|
|
|
|
|
|
ret = pending_snapshot->error;
|
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
|
2014-10-16 04:50:56 +08:00
|
|
|
ret = btrfs_orphan_cleanup(pending_snapshot->snap);
|
|
|
|
if (ret)
|
|
|
|
goto fail;
|
|
|
|
|
2015-03-18 06:25:59 +08:00
|
|
|
inode = btrfs_lookup_dentry(d_inode(dentry->d_parent), dentry);
|
2009-11-12 17:37:02 +08:00
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
ret = PTR_ERR(inode);
|
|
|
|
goto fail;
|
|
|
|
}
|
2013-12-13 08:51:42 +08:00
|
|
|
|
2009-11-12 17:37:02 +08:00
|
|
|
d_instantiate(dentry, inode);
|
|
|
|
ret = 0;
|
2020-06-16 10:17:36 +08:00
|
|
|
pending_snapshot->anon_dev = 0;
|
2009-11-12 17:37:02 +08:00
|
|
|
fail:
|
2020-06-16 10:17:36 +08:00
|
|
|
/* Prevent double freeing of anon_dev */
|
|
|
|
if (ret && pending_snapshot->snap)
|
|
|
|
pending_snapshot->snap->anon_dev = 0;
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(pending_snapshot->snap);
|
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
And with the following metadata leak:
BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents
[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().
This leads to the leakage.
[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.
And free the qgroup reserved metadata space when releasing the
block_rsv.
To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.
Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-24 14:46:10 +08:00
|
|
|
btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
|
2015-11-11 01:54:00 +08:00
|
|
|
free_pending:
|
2020-06-16 10:17:36 +08:00
|
|
|
if (pending_snapshot->anon_dev)
|
|
|
|
free_anon_bdev(pending_snapshot->anon_dev);
|
2015-11-11 01:54:00 +08:00
|
|
|
kfree(pending_snapshot->root_item);
|
2015-11-11 01:54:03 +08:00
|
|
|
btrfs_free_path(pending_snapshot->path);
|
2015-11-11 01:53:56 +08:00
|
|
|
kfree(pending_snapshot);
|
|
|
|
|
2008-06-12 09:53:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-10-30 03:46:43 +08:00
|
|
|
/* copy of may_delete in fs/namei.c()
|
|
|
|
* Check whether we can remove a link victim from directory dir, check
|
|
|
|
* whether the type of victim is right.
|
|
|
|
* 1. We can't do it if dir is read-only (done in permission())
|
|
|
|
* 2. We should have write and exec permissions on dir
|
|
|
|
* 3. We can't remove anything from append-only dir
|
|
|
|
* 4. We can't do anything with immutable dir (done in permission())
|
|
|
|
* 5. If the sticky bit on dir is set we should either
|
|
|
|
* a. be owner of dir, or
|
|
|
|
* b. be owner of victim, or
|
|
|
|
* c. have CAP_FOWNER capability
|
2016-05-20 09:18:45 +08:00
|
|
|
* 6. If the victim is append-only or immutable we can't do anything with
|
2010-10-30 03:46:43 +08:00
|
|
|
* links pointing to it.
|
|
|
|
* 7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
|
|
|
|
* 8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
|
|
|
|
* 9. We can't remove a root or mountpoint.
|
|
|
|
* 10. We don't allow removal of NFS sillyrenamed files; it's handled by
|
|
|
|
* nfs_async_unlink().
|
|
|
|
*/
|
|
|
|
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
static int btrfs_may_delete(struct user_namespace *mnt_userns,
|
|
|
|
struct inode *dir, struct dentry *victim, int isdir)
|
2010-10-30 03:46:43 +08:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2015-03-18 06:25:59 +08:00
|
|
|
if (d_really_is_negative(victim))
|
2010-10-30 03:46:43 +08:00
|
|
|
return -ENOENT;
|
|
|
|
|
2015-03-18 06:25:59 +08:00
|
|
|
BUG_ON(d_inode(victim->d_parent) != dir);
|
2012-10-11 03:25:25 +08:00
|
|
|
audit_inode_child(dir, victim, AUDIT_TYPE_CHILD_DELETE);
|
2010-10-30 03:46:43 +08:00
|
|
|
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
error = inode_permission(mnt_userns, dir, MAY_WRITE | MAY_EXEC);
|
2010-10-30 03:46:43 +08:00
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
if (IS_APPEND(dir))
|
|
|
|
return -EPERM;
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
if (check_sticky(mnt_userns, dir, d_inode(victim)) ||
|
2021-01-21 21:19:31 +08:00
|
|
|
IS_APPEND(d_inode(victim)) || IS_IMMUTABLE(d_inode(victim)) ||
|
|
|
|
IS_SWAPFILE(d_inode(victim)))
|
2010-10-30 03:46:43 +08:00
|
|
|
return -EPERM;
|
|
|
|
if (isdir) {
|
VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:
(1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
(2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
(3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
complicated than it appears as some calls should be converted to
d_can_lookup() instead. The difference is whether the directory in
question is a real dir with a ->lookup op or whether it's a fake dir with
a ->d_automount op.
In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).
Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer. In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.
However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.
There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
intended for special directory entry types that don't have attached inodes.
The following perl+coccinelle script was used:
use strict;
my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
print "No matches\n";
exit(0);
}
my @cocci = (
'@@',
'expression E;',
'@@',
'',
'- S_ISLNK(E->d_inode->i_mode)',
'+ d_is_symlink(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISDIR(E->d_inode->i_mode)',
'+ d_is_dir(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISREG(E->d_inode->i_mode)',
'+ d_is_reg(E)' );
my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);
foreach my $file (@callers) {
chomp $file;
print "Processing ", $file, "\n";
system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
die "spatch failed";
}
[AV: overlayfs parts skipped]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-29 20:02:35 +08:00
|
|
|
if (!d_is_dir(victim))
|
2010-10-30 03:46:43 +08:00
|
|
|
return -ENOTDIR;
|
|
|
|
if (IS_ROOT(victim))
|
|
|
|
return -EBUSY;
|
VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:
(1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
(2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
(3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
complicated than it appears as some calls should be converted to
d_can_lookup() instead. The difference is whether the directory in
question is a real dir with a ->lookup op or whether it's a fake dir with
a ->d_automount op.
In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).
Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer. In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.
However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.
There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
intended for special directory entry types that don't have attached inodes.
The following perl+coccinelle script was used:
use strict;
my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
print "No matches\n";
exit(0);
}
my @cocci = (
'@@',
'expression E;',
'@@',
'',
'- S_ISLNK(E->d_inode->i_mode)',
'+ d_is_symlink(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISDIR(E->d_inode->i_mode)',
'+ d_is_dir(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISREG(E->d_inode->i_mode)',
'+ d_is_reg(E)' );
my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);
foreach my $file (@callers) {
chomp $file;
print "Processing ", $file, "\n";
system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
die "spatch failed";
}
[AV: overlayfs parts skipped]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-29 20:02:35 +08:00
|
|
|
} else if (d_is_dir(victim))
|
2010-10-30 03:46:43 +08:00
|
|
|
return -EISDIR;
|
|
|
|
if (IS_DEADDIR(dir))
|
|
|
|
return -ENOENT;
|
|
|
|
if (victim->d_flags & DCACHE_NFSFS_RENAMED)
|
|
|
|
return -EBUSY;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-10-10 01:39:39 +08:00
|
|
|
/* copy of may_create in fs/namei.c() */
|
2021-07-27 18:48:52 +08:00
|
|
|
static inline int btrfs_may_create(struct user_namespace *mnt_userns,
|
|
|
|
struct inode *dir, struct dentry *child)
|
2008-10-10 01:39:39 +08:00
|
|
|
{
|
2015-03-18 06:25:59 +08:00
|
|
|
if (d_really_is_positive(child))
|
2008-10-10 01:39:39 +08:00
|
|
|
return -EEXIST;
|
|
|
|
if (IS_DEADDIR(dir))
|
|
|
|
return -ENOENT;
|
2021-07-27 18:48:52 +08:00
|
|
|
if (!fsuidgid_has_mapping(dir->i_sb, mnt_userns))
|
2021-07-27 18:48:51 +08:00
|
|
|
return -EOVERFLOW;
|
2021-07-27 18:48:52 +08:00
|
|
|
return inode_permission(mnt_userns, dir, MAY_WRITE | MAY_EXEC);
|
2008-10-10 01:39:39 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a new subvolume below @parent. This is largely modeled after
|
|
|
|
* sys_mkdirat and vfs_mkdir, but we only do a single component lookup
|
|
|
|
* inside this filesystem so it's quite a bit simpler.
|
|
|
|
*/
|
2016-11-21 08:34:31 +08:00
|
|
|
static noinline int btrfs_mksubvol(const struct path *parent,
|
2021-07-27 18:48:52 +08:00
|
|
|
struct user_namespace *mnt_userns,
|
2017-02-15 01:33:53 +08:00
|
|
|
const char *name, int namelen,
|
2010-10-30 03:41:32 +08:00
|
|
|
struct btrfs_root *snap_src,
|
2020-03-13 23:23:20 +08:00
|
|
|
bool readonly,
|
2013-02-07 14:02:44 +08:00
|
|
|
struct btrfs_qgroup_inherit *inherit)
|
2008-10-10 01:39:39 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *dir = d_inode(parent->dentry);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(dir->i_sb);
|
2008-10-10 01:39:39 +08:00
|
|
|
struct dentry *dentry;
|
|
|
|
int error;
|
|
|
|
|
2016-05-26 12:05:12 +08:00
|
|
|
error = down_write_killable_nested(&dir->i_rwsem, I_MUTEX_PARENT);
|
|
|
|
if (error == -EINTR)
|
|
|
|
return error;
|
2008-10-10 01:39:39 +08:00
|
|
|
|
2021-07-27 18:48:52 +08:00
|
|
|
dentry = lookup_one(mnt_userns, name, parent->dentry, namelen);
|
2008-10-10 01:39:39 +08:00
|
|
|
error = PTR_ERR(dentry);
|
|
|
|
if (IS_ERR(dentry))
|
|
|
|
goto out_unlock;
|
|
|
|
|
2021-07-27 18:48:52 +08:00
|
|
|
error = btrfs_may_create(mnt_userns, dir, dentry);
|
2008-10-10 01:39:39 +08:00
|
|
|
if (error)
|
2012-06-29 17:58:46 +08:00
|
|
|
goto out_dput;
|
2008-10-10 01:39:39 +08:00
|
|
|
|
2012-12-18 03:26:57 +08:00
|
|
|
/*
|
|
|
|
* even if this name doesn't exist, we may get hash collisions.
|
|
|
|
* check for them now when we can safely fail
|
|
|
|
*/
|
|
|
|
error = btrfs_check_dir_item_collision(BTRFS_I(dir)->root,
|
|
|
|
dir->i_ino, name,
|
|
|
|
namelen);
|
|
|
|
if (error)
|
|
|
|
goto out_dput;
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
down_read(&fs_info->subvol_sem);
|
2009-09-22 04:00:26 +08:00
|
|
|
|
|
|
|
if (btrfs_root_refs(&BTRFS_I(dir)->root->root_item) == 0)
|
|
|
|
goto out_up_read;
|
|
|
|
|
2020-03-13 23:23:20 +08:00
|
|
|
if (snap_src)
|
|
|
|
error = create_snapshot(snap_src, dir, dentry, readonly, inherit);
|
|
|
|
else
|
2022-03-10 09:31:39 +08:00
|
|
|
error = create_subvol(mnt_userns, dir, dentry, inherit);
|
2020-03-13 23:23:20 +08:00
|
|
|
|
2009-09-22 04:00:26 +08:00
|
|
|
if (!error)
|
|
|
|
fsnotify_mkdir(dir, dentry);
|
|
|
|
out_up_read:
|
2016-06-23 06:54:23 +08:00
|
|
|
up_read(&fs_info->subvol_sem);
|
2008-10-10 01:39:39 +08:00
|
|
|
out_dput:
|
|
|
|
dput(dentry);
|
|
|
|
out_unlock:
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_unlock(dir, 0);
|
2008-10-10 01:39:39 +08:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2020-05-14 17:19:18 +08:00
|
|
|
static noinline int btrfs_mksnapshot(const struct path *parent,
|
2021-07-27 18:48:52 +08:00
|
|
|
struct user_namespace *mnt_userns,
|
2020-05-14 17:19:18 +08:00
|
|
|
const char *name, int namelen,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
bool readonly,
|
|
|
|
struct btrfs_qgroup_inherit *inherit)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
bool snapshot_force_cow = false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Force new buffered writes to reserve space even when NOCOW is
|
|
|
|
* possible. This is to avoid later writeback (running dealloc) to
|
|
|
|
* fallback to COW mode and unexpectedly fail with ENOSPC.
|
|
|
|
*/
|
|
|
|
btrfs_drew_read_lock(&root->snapshot_lock);
|
|
|
|
|
btrfs: fix deadlock when cloning inline extents and using qgroups
There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.
When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").
However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.
When this issue happens, stack traces like the following are reported:
[72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000
[72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
[72747.556271] Call Trace:
[72747.556273] __schedule+0x296/0x760
[72747.556277] schedule+0x3c/0xa0
[72747.556279] io_schedule+0x12/0x40
[72747.556284] __lock_page+0x13c/0x280
[72747.556287] ? generic_file_readonly_mmap+0x70/0x70
[72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs]
[72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160
[72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
[72747.556362] ? update_group_capacity+0x25/0x210
[72747.556366] ? cpumask_next_and+0x1a/0x20
[72747.556391] extent_writepages+0x44/0xa0 [btrfs]
[72747.556394] do_writepages+0x41/0xd0
[72747.556398] __writeback_single_inode+0x39/0x2a0
[72747.556403] writeback_sb_inodes+0x1ea/0x440
[72747.556407] __writeback_inodes_wb+0x5f/0xc0
[72747.556410] wb_writeback+0x235/0x2b0
[72747.556414] ? get_nr_inodes+0x35/0x50
[72747.556417] wb_workfn+0x354/0x490
[72747.556420] ? newidle_balance+0x2c5/0x3e0
[72747.556424] process_one_work+0x1aa/0x340
[72747.556426] worker_thread+0x30/0x390
[72747.556429] ? create_worker+0x1a0/0x1a0
[72747.556432] kthread+0x116/0x130
[72747.556435] ? kthread_park+0x80/0x80
[72747.556438] ret_from_fork+0x1f/0x30
[72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[72747.566961] Call Trace:
[72747.566964] __schedule+0x296/0x760
[72747.566968] ? finish_wait+0x80/0x80
[72747.566970] schedule+0x3c/0xa0
[72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
[72747.566999] ? finish_wait+0x80/0x80
[72747.567024] lock_extent_bits+0x37/0x90 [btrfs]
[72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs]
[72747.567051] ? find_get_pages_range_tag+0x2cd/0x380
[72747.567076] __extent_writepage+0x203/0x320 [btrfs]
[72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[72747.567106] ? update_load_avg+0x7e/0x5f0
[72747.567109] ? enqueue_entity+0xf4/0x6f0
[72747.567134] extent_writepages+0x44/0xa0 [btrfs]
[72747.567137] ? enqueue_task_fair+0x93/0x6f0
[72747.567140] do_writepages+0x41/0xd0
[72747.567144] __filemap_fdatawrite_range+0xc7/0x100
[72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs]
[72747.567200] process_one_work+0x1aa/0x340
[72747.567202] worker_thread+0x30/0x390
[72747.567205] ? create_worker+0x1a0/0x1a0
[72747.567208] kthread+0x116/0x130
[72747.567211] ? kthread_park+0x80/0x80
[72747.567214] ret_from_fork+0x1f/0x30
[72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000
[72747.569689] Call Trace:
[72747.569691] __schedule+0x296/0x760
[72747.569694] schedule+0x3c/0xa0
[72747.569721] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569725] ? finish_wait+0x80/0x80
[72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
[72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
[72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
[72747.569810] ? path_lookupat.isra.48+0x97/0x140
[72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs]
[72747.569836] ? __kmalloc+0x16a/0x2c0
[72747.569839] do_iter_readv_writev+0x160/0x1c0
[72747.569843] do_iter_write+0x80/0x1b0
[72747.569847] vfs_writev+0x84/0x140
[72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs]
[72747.569873] do_writev+0x65/0x100
[72747.569876] do_syscall_64+0x33/0x40
[72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000
[72747.569903] Call Trace:
[72747.569906] __schedule+0x296/0x760
[72747.569909] schedule+0x3c/0xa0
[72747.569936] try_flush_qgroup+0x95/0x140 [btrfs]
[72747.569940] ? finish_wait+0x80/0x80
[72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
[72747.569989] start_transaction+0x279/0x580 [btrfs]
[72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs]
[72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs]
[72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs]
[72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs]
[72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
[72747.570126] do_clone_file_range+0xed/0x200
[72747.570131] vfs_clone_file_range+0x37/0x110
[72747.570134] ioctl_file_clone+0x7d/0xb0
[72747.570137] do_vfs_ioctl+0x138/0x630
[72747.570140] __x64_sys_ioctl+0x62/0xc0
[72747.570143] do_syscall_64+0x33/0x40
[72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9
So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.
The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-22 19:08:05 +08:00
|
|
|
ret = btrfs_start_delalloc_snapshot(root, false);
|
2020-05-14 17:19:18 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All previous writes have started writeback in NOCOW mode, so now
|
|
|
|
* we force future writes to fallback to COW mode during snapshot
|
|
|
|
* creation.
|
|
|
|
*/
|
|
|
|
atomic_inc(&root->snapshot_force_cow);
|
|
|
|
snapshot_force_cow = true;
|
|
|
|
|
|
|
|
btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
|
|
|
|
|
2021-07-27 18:48:52 +08:00
|
|
|
ret = btrfs_mksubvol(parent, mnt_userns, name, namelen,
|
2020-05-14 17:19:18 +08:00
|
|
|
root, readonly, inherit);
|
|
|
|
out:
|
|
|
|
if (snapshot_force_cow)
|
|
|
|
atomic_dec(&root->snapshot_force_cow);
|
|
|
|
btrfs_drew_read_unlock(&root->snapshot_lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-02-11 14:46:12 +08:00
|
|
|
/*
|
|
|
|
* Defrag specific helper to get an extent map.
|
|
|
|
*
|
|
|
|
* Differences between this and btrfs_get_extent() are:
|
|
|
|
*
|
|
|
|
* - No extent_map will be added to inode->extent_tree
|
|
|
|
* To reduce memory usage in the long run.
|
|
|
|
*
|
|
|
|
* - Extra optimization to skip file extents older than @newer_than
|
|
|
|
* By using btrfs_search_forward() we can skip entire file ranges that
|
|
|
|
* have extents created in past transactions, because btrfs_search_forward()
|
|
|
|
* will not visit leaves and nodes with a generation smaller than given
|
|
|
|
* minimal generation threshold (@newer_than).
|
|
|
|
*
|
|
|
|
* Return valid em if we find a file extent matching the requirement.
|
|
|
|
* Return NULL if we can not find a file extent matching the requirement.
|
|
|
|
*
|
|
|
|
* Return ERR_PTR() for error.
|
|
|
|
*/
|
|
|
|
static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
|
|
|
|
u64 start, u64 newer_than)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = inode->root;
|
|
|
|
struct btrfs_file_extent_item *fi;
|
|
|
|
struct btrfs_path path = { 0 };
|
|
|
|
struct extent_map *em;
|
|
|
|
struct btrfs_key key;
|
|
|
|
u64 ino = btrfs_ino(inode);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
em = alloc_extent_map();
|
|
|
|
if (!em) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = ino;
|
|
|
|
key.type = BTRFS_EXTENT_DATA_KEY;
|
|
|
|
key.offset = start;
|
|
|
|
|
|
|
|
if (newer_than) {
|
|
|
|
ret = btrfs_search_forward(root, &key, &path, newer_than);
|
|
|
|
if (ret < 0)
|
|
|
|
goto err;
|
|
|
|
/* Can't find anything newer */
|
|
|
|
if (ret > 0)
|
|
|
|
goto not_found;
|
|
|
|
} else {
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, &path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
|
|
|
|
/*
|
|
|
|
* If btrfs_search_slot() makes path to point beyond nritems,
|
|
|
|
* we should not have an empty leaf, as this inode must at
|
|
|
|
* least have its INODE_ITEM.
|
|
|
|
*/
|
|
|
|
ASSERT(btrfs_header_nritems(path.nodes[0]));
|
|
|
|
path.slots[0] = btrfs_header_nritems(path.nodes[0]) - 1;
|
|
|
|
}
|
|
|
|
btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
|
|
|
|
/* Perfect match, no need to go one slot back */
|
|
|
|
if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY &&
|
|
|
|
key.offset == start)
|
|
|
|
goto iterate;
|
|
|
|
|
|
|
|
/* We didn't find a perfect match, needs to go one slot back */
|
|
|
|
if (path.slots[0] > 0) {
|
|
|
|
btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
|
|
|
|
if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
|
|
|
|
path.slots[0]--;
|
|
|
|
}
|
|
|
|
|
|
|
|
iterate:
|
|
|
|
/* Iterate through the path to find a file extent covering @start */
|
|
|
|
while (true) {
|
|
|
|
u64 extent_end;
|
|
|
|
|
|
|
|
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We may go one slot back to INODE_REF/XATTR item, then
|
|
|
|
* need to go forward until we reach an EXTENT_DATA.
|
|
|
|
* But we should still has the correct ino as key.objectid.
|
|
|
|
*/
|
|
|
|
if (WARN_ON(key.objectid < ino) || key.type < BTRFS_EXTENT_DATA_KEY)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
/* It's beyond our target range, definitely not extent found */
|
|
|
|
if (key.objectid > ino || key.type > BTRFS_EXTENT_DATA_KEY)
|
|
|
|
goto not_found;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* | |<- File extent ->|
|
|
|
|
* \- start
|
|
|
|
*
|
|
|
|
* This means there is a hole between start and key.offset.
|
|
|
|
*/
|
|
|
|
if (key.offset > start) {
|
|
|
|
em->start = start;
|
|
|
|
em->orig_start = start;
|
|
|
|
em->block_start = EXTENT_MAP_HOLE;
|
|
|
|
em->len = key.offset - start;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
|
|
|
|
struct btrfs_file_extent_item);
|
|
|
|
extent_end = btrfs_file_extent_end(&path);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* |<- file extent ->| |
|
|
|
|
* \- start
|
|
|
|
*
|
|
|
|
* We haven't reached start, search next slot.
|
|
|
|
*/
|
|
|
|
if (extent_end <= start)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
/* Now this extent covers @start, convert it to em */
|
|
|
|
btrfs_extent_item_to_extent_map(inode, &path, fi, false, em);
|
|
|
|
break;
|
|
|
|
next:
|
|
|
|
ret = btrfs_next_item(root, &path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto err;
|
|
|
|
if (ret > 0)
|
|
|
|
goto not_found;
|
|
|
|
}
|
|
|
|
btrfs_release_path(&path);
|
|
|
|
return em;
|
|
|
|
|
|
|
|
not_found:
|
|
|
|
btrfs_release_path(&path);
|
|
|
|
free_extent_map(em);
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
err:
|
|
|
|
btrfs_release_path(&path);
|
|
|
|
free_extent_map(em);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
|
2022-02-11 14:46:12 +08:00
|
|
|
u64 newer_than, bool locked)
|
2012-03-29 21:57:45 +08:00
|
|
|
{
|
|
|
|
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
|
2012-06-11 16:03:35 +08:00
|
|
|
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
|
|
|
|
struct extent_map *em;
|
2021-08-06 16:12:34 +08:00
|
|
|
const u32 sectorsize = BTRFS_I(inode)->root->fs_info->sectorsize;
|
2012-03-29 21:57:45 +08:00
|
|
|
|
2012-06-11 16:03:35 +08:00
|
|
|
/*
|
|
|
|
* hopefully we have this extent in the tree already, try without
|
|
|
|
* the full extent lock
|
|
|
|
*/
|
2012-03-29 21:57:45 +08:00
|
|
|
read_lock(&em_tree->lock);
|
2021-08-06 16:12:34 +08:00
|
|
|
em = lookup_extent_mapping(em_tree, start, sectorsize);
|
2012-03-29 21:57:45 +08:00
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
|
btrfs: defrag: don't use merged extent map for their generation check
For extent maps, if they are not compressed extents and are adjacent by
logical addresses and file offsets, they can be merged into one larger
extent map.
Such merged extent map will have the higher generation of all the
original ones.
But this brings a problem for autodefrag, as it relies on accurate
extent_map::generation to determine if one extent should be defragged.
For merged extent maps, their higher generation can mark some older
extents to be defragged while the original extent map doesn't meet the
minimal generation threshold.
Thus this will cause extra IO.
So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
to indicate if the extent map is merged from one or more ems.
And for autodefrag, if we find a merged extent map, and its generation
meets the generation requirement, we just don't use this one, and go
back to defrag_get_extent() to read extent maps from subvolume trees.
This could cause more read IO, but should result less defrag data write,
so in the long run it should be a win for autodefrag.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-11 14:46:13 +08:00
|
|
|
/*
|
|
|
|
* We can get a merged extent, in that case, we need to re-search
|
|
|
|
* tree to get the original em for defrag.
|
|
|
|
*
|
|
|
|
* If @newer_than is 0 or em::generation < newer_than, we can trust
|
|
|
|
* this em, as either we don't care about the generation, or the
|
|
|
|
* merged extent map will be rejected anyway.
|
|
|
|
*/
|
|
|
|
if (em && test_bit(EXTENT_FLAG_MERGED, &em->flags) &&
|
|
|
|
newer_than && em->generation >= newer_than) {
|
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
|
|
|
}
|
|
|
|
|
2012-06-11 16:03:35 +08:00
|
|
|
if (!em) {
|
2014-03-11 21:56:15 +08:00
|
|
|
struct extent_state *cached = NULL;
|
2021-08-06 16:12:34 +08:00
|
|
|
u64 end = start + sectorsize - 1;
|
2014-03-11 21:56:15 +08:00
|
|
|
|
2012-06-11 16:03:35 +08:00
|
|
|
/* get the big lock and read metadata off disk */
|
2021-08-06 16:12:38 +08:00
|
|
|
if (!locked)
|
2022-09-10 05:53:43 +08:00
|
|
|
lock_extent(io_tree, start, end, &cached);
|
2022-02-11 14:46:12 +08:00
|
|
|
em = defrag_get_extent(BTRFS_I(inode), start, newer_than);
|
2021-08-06 16:12:38 +08:00
|
|
|
if (!locked)
|
2022-09-10 05:53:43 +08:00
|
|
|
unlock_extent(io_tree, start, end, &cached);
|
2012-06-11 16:03:35 +08:00
|
|
|
|
|
|
|
if (IS_ERR(em))
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return em;
|
|
|
|
}
|
2012-03-29 21:57:45 +08:00
|
|
|
|
2022-07-09 07:18:42 +08:00
|
|
|
static u32 get_extent_max_capacity(const struct btrfs_fs_info *fs_info,
|
|
|
|
const struct extent_map *em)
|
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:21 +08:00
|
|
|
{
|
|
|
|
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
|
|
|
|
return BTRFS_MAX_COMPRESSED;
|
2022-07-09 07:18:42 +08:00
|
|
|
return fs_info->max_extent_size;
|
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:21 +08:00
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
|
btrfs: avoid defragging extents whose next extents are not targets
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:
btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).
If mergeable, we will add range A into target for defrag.
However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.
But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.
And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.
[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()
This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.
While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().
Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 19:28:05 +08:00
|
|
|
u32 extent_thresh, u64 newer_than, bool locked)
|
2012-06-11 16:03:35 +08:00
|
|
|
{
|
2022-07-09 07:18:42 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2012-06-11 16:03:35 +08:00
|
|
|
struct extent_map *next;
|
btrfs: defrag: don't try to merge regular extents with preallocated extents
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:20 +08:00
|
|
|
bool ret = false;
|
2012-06-11 16:03:35 +08:00
|
|
|
|
|
|
|
/* this is the last extent */
|
|
|
|
if (em->start + em->len >= i_size_read(inode))
|
|
|
|
return false;
|
|
|
|
|
2022-02-11 14:46:12 +08:00
|
|
|
/*
|
btrfs: avoid defragging extents whose next extents are not targets
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:
btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).
If mergeable, we will add range A into target for defrag.
However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.
But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.
And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.
[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()
This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.
While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().
Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 19:28:05 +08:00
|
|
|
* Here we need to pass @newer_then when checking the next extent, or
|
|
|
|
* we will hit a case we mark current extent for defrag, but the next
|
|
|
|
* one will not be a target.
|
|
|
|
* This will just cause extra IO without really reducing the fragments.
|
2022-02-11 14:46:12 +08:00
|
|
|
*/
|
btrfs: avoid defragging extents whose next extents are not targets
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:
btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).
If mergeable, we will add range A into target for defrag.
However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.
But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.
And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.
[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()
This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.
While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().
Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 19:28:05 +08:00
|
|
|
next = defrag_lookup_extent(inode, em->start + em->len, newer_than, locked);
|
btrfs: defrag: don't try to merge regular extents with preallocated extents
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:20 +08:00
|
|
|
/* No more em or hole */
|
2014-08-27 04:55:54 +08:00
|
|
|
if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE)
|
btrfs: defrag: don't try to merge regular extents with preallocated extents
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:20 +08:00
|
|
|
goto out;
|
|
|
|
if (test_bit(EXTENT_FLAG_PREALLOC, &next->flags))
|
|
|
|
goto out;
|
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:21 +08:00
|
|
|
/*
|
|
|
|
* If the next extent is at its max capacity, defragging current extent
|
|
|
|
* makes no sense, as the total number of extents won't change.
|
|
|
|
*/
|
2022-07-09 07:18:42 +08:00
|
|
|
if (next->len >= get_extent_max_capacity(fs_info, em))
|
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:21 +08:00
|
|
|
goto out;
|
btrfs: avoid defragging extents whose next extents are not targets
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:
btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).
If mergeable, we will add range A into target for defrag.
However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.
But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.
And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.
[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()
This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.
While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().
Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 19:28:05 +08:00
|
|
|
/* Skip older extent */
|
|
|
|
if (next->generation < newer_than)
|
|
|
|
goto out;
|
|
|
|
/* Also check extent size */
|
|
|
|
if (next->len >= extent_thresh)
|
|
|
|
goto out;
|
|
|
|
|
btrfs: defrag: don't try to merge regular extents with preallocated extents
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:20 +08:00
|
|
|
ret = true;
|
|
|
|
out:
|
2012-06-11 16:03:35 +08:00
|
|
|
free_extent_map(next);
|
2012-03-29 21:57:45 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:35 +08:00
|
|
|
/*
|
|
|
|
* Prepare one page to be defragged.
|
|
|
|
*
|
|
|
|
* This will ensure:
|
|
|
|
*
|
|
|
|
* - Returned page is locked and has been set up properly.
|
|
|
|
* - No ordered extent exists in the page.
|
|
|
|
* - The page is uptodate.
|
|
|
|
*
|
|
|
|
* NOTE: Caller should also wait for page writeback after the cluster is
|
|
|
|
* prepared, here we don't do writeback wait for each page.
|
|
|
|
*/
|
|
|
|
static struct page *defrag_prepare_one_page(struct btrfs_inode *inode,
|
|
|
|
pgoff_t index)
|
2010-03-10 23:52:59 +08:00
|
|
|
{
|
2021-08-06 16:12:35 +08:00
|
|
|
struct address_space *mapping = inode->vfs_inode.i_mapping;
|
|
|
|
gfp_t mask = btrfs_alloc_write_mask(mapping);
|
|
|
|
u64 page_start = (u64)index << PAGE_SHIFT;
|
|
|
|
u64 page_end = page_start + PAGE_SIZE - 1;
|
|
|
|
struct extent_state *cached_state = NULL;
|
|
|
|
struct page *page;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
again:
|
|
|
|
page = find_or_create_page(mapping, index, mask);
|
|
|
|
if (!page)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2010-03-10 23:52:59 +08:00
|
|
|
|
|
|
|
/*
|
btrfs: fix deadlock when defragging transparent huge pages
Attempting to defragment a Btrfs file containing a transparent huge page
immediately deadlocks with the following stack trace:
#0 context_switch (kernel/sched/core.c:4940:2)
#1 __schedule (kernel/sched/core.c:6287:8)
#2 schedule (kernel/sched/core.c:6366:3)
#3 io_schedule (kernel/sched/core.c:8389:2)
#4 wait_on_page_bit_common (mm/filemap.c:1356:4)
#5 __lock_page (mm/filemap.c:1648:2)
#6 lock_page (./include/linux/pagemap.h:625:3)
#7 pagecache_get_page (mm/filemap.c:1910:4)
#8 find_or_create_page (./include/linux/pagemap.h:420:9)
#9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
#10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
#11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
#12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
#13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
#14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
#15 vfs_ioctl (fs/ioctl.c:51:10)
#16 __do_sys_ioctl (fs/ioctl.c:874:11)
#17 __se_sys_ioctl (fs/ioctl.c:860:1)
#18 __x64_sys_ioctl (fs/ioctl.c:860:1)
#19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
#20 do_syscall_64 (arch/x86/entry/common.c:80:7)
#21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
A huge page is represented by a compound page, which consists of a
struct page for each PAGE_SIZE page within the huge page. The first
struct page is the "head page", and the remaining are "tail pages".
Defragmentation attempts to lock each page in the range. However,
lock_page() on a tail page actually locks the corresponding head page.
So, if defragmentation tries to lock more than one struct page in a
compound page, it tries to lock the same head page twice and deadlocks
with itself.
Ideally, we should be able to defragment transparent huge pages.
However, THP for filesystems is currently read-only, so a lot of code is
not ready to use huge pages for I/O. For now, let's just return
ETXTBUSY.
This can be reproduced with the following on a kernel with
CONFIG_READ_ONLY_THP_FOR_FS=y:
$ cat create_thp_file.c
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
static const char zeroes[1024 * 1024];
static const size_t FILE_SIZE = 2 * 1024 * 1024;
int main(int argc, char **argv)
{
if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}
int fd = creat(argv[1], 0777);
if (fd == -1) {
perror("creat");
return EXIT_FAILURE;
}
size_t written = 0;
while (written < FILE_SIZE) {
ssize_t ret = write(fd, zeroes,
sizeof(zeroes) < FILE_SIZE - written ?
sizeof(zeroes) : FILE_SIZE - written);
if (ret < 0) {
perror("write");
return EXIT_FAILURE;
}
written += ret;
}
close(fd);
fd = open(argv[1], O_RDONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
/*
* Reserve some address space so that we can align the file mapping to
* the huge page size.
*/
void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (placeholder_map == MAP_FAILED) {
perror("mmap (placeholder)");
return EXIT_FAILURE;
}
void *aligned_address =
(void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
MAP_SHARED | MAP_FIXED, fd, 0);
if (map == MAP_FAILED) {
perror("mmap");
return EXIT_FAILURE;
}
if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
perror("madvise");
return EXIT_FAILURE;
}
char *line = NULL;
size_t line_capacity = 0;
FILE *smaps_file = fopen("/proc/self/smaps", "r");
if (!smaps_file) {
perror("fopen");
return EXIT_FAILURE;
}
for (;;) {
for (size_t off = 0; off < FILE_SIZE; off += 4096)
((volatile char *)map)[off];
ssize_t ret;
bool this_mapping = false;
while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
unsigned long start, end, huge;
if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
this_mapping = (start <= (uintptr_t)map &&
(uintptr_t)map < end);
} else if (this_mapping &&
sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
huge > 0) {
return EXIT_SUCCESS;
}
}
sleep(6);
rewind(smaps_file);
fflush(smaps_file);
}
}
$ ./create_thp_file huge
$ btrfs fi defrag -czstd ./huge
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 11:35:01 +08:00
|
|
|
* Since we can defragment files opened read-only, we can encounter
|
|
|
|
* transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS). We
|
|
|
|
* can't do I/O using huge pages yet, so return an error for now.
|
|
|
|
* Filesystem transparent huge pages are typically only used for
|
|
|
|
* executables that explicitly enable them, so this isn't very
|
|
|
|
* restrictive.
|
2010-03-10 23:52:59 +08:00
|
|
|
*/
|
btrfs: fix deadlock when defragging transparent huge pages
Attempting to defragment a Btrfs file containing a transparent huge page
immediately deadlocks with the following stack trace:
#0 context_switch (kernel/sched/core.c:4940:2)
#1 __schedule (kernel/sched/core.c:6287:8)
#2 schedule (kernel/sched/core.c:6366:3)
#3 io_schedule (kernel/sched/core.c:8389:2)
#4 wait_on_page_bit_common (mm/filemap.c:1356:4)
#5 __lock_page (mm/filemap.c:1648:2)
#6 lock_page (./include/linux/pagemap.h:625:3)
#7 pagecache_get_page (mm/filemap.c:1910:4)
#8 find_or_create_page (./include/linux/pagemap.h:420:9)
#9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
#10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
#11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
#12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
#13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
#14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
#15 vfs_ioctl (fs/ioctl.c:51:10)
#16 __do_sys_ioctl (fs/ioctl.c:874:11)
#17 __se_sys_ioctl (fs/ioctl.c:860:1)
#18 __x64_sys_ioctl (fs/ioctl.c:860:1)
#19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
#20 do_syscall_64 (arch/x86/entry/common.c:80:7)
#21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
A huge page is represented by a compound page, which consists of a
struct page for each PAGE_SIZE page within the huge page. The first
struct page is the "head page", and the remaining are "tail pages".
Defragmentation attempts to lock each page in the range. However,
lock_page() on a tail page actually locks the corresponding head page.
So, if defragmentation tries to lock more than one struct page in a
compound page, it tries to lock the same head page twice and deadlocks
with itself.
Ideally, we should be able to defragment transparent huge pages.
However, THP for filesystems is currently read-only, so a lot of code is
not ready to use huge pages for I/O. For now, let's just return
ETXTBUSY.
This can be reproduced with the following on a kernel with
CONFIG_READ_ONLY_THP_FOR_FS=y:
$ cat create_thp_file.c
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
static const char zeroes[1024 * 1024];
static const size_t FILE_SIZE = 2 * 1024 * 1024;
int main(int argc, char **argv)
{
if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}
int fd = creat(argv[1], 0777);
if (fd == -1) {
perror("creat");
return EXIT_FAILURE;
}
size_t written = 0;
while (written < FILE_SIZE) {
ssize_t ret = write(fd, zeroes,
sizeof(zeroes) < FILE_SIZE - written ?
sizeof(zeroes) : FILE_SIZE - written);
if (ret < 0) {
perror("write");
return EXIT_FAILURE;
}
written += ret;
}
close(fd);
fd = open(argv[1], O_RDONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
/*
* Reserve some address space so that we can align the file mapping to
* the huge page size.
*/
void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (placeholder_map == MAP_FAILED) {
perror("mmap (placeholder)");
return EXIT_FAILURE;
}
void *aligned_address =
(void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
MAP_SHARED | MAP_FIXED, fd, 0);
if (map == MAP_FAILED) {
perror("mmap");
return EXIT_FAILURE;
}
if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
perror("madvise");
return EXIT_FAILURE;
}
char *line = NULL;
size_t line_capacity = 0;
FILE *smaps_file = fopen("/proc/self/smaps", "r");
if (!smaps_file) {
perror("fopen");
return EXIT_FAILURE;
}
for (;;) {
for (size_t off = 0; off < FILE_SIZE; off += 4096)
((volatile char *)map)[off];
ssize_t ret;
bool this_mapping = false;
while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
unsigned long start, end, huge;
if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
this_mapping = (start <= (uintptr_t)map &&
(uintptr_t)map < end);
} else if (this_mapping &&
sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
huge > 0) {
return EXIT_SUCCESS;
}
}
sleep(6);
rewind(smaps_file);
fflush(smaps_file);
}
}
$ ./create_thp_file huge
$ btrfs fi defrag -czstd ./huge
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-20 11:35:01 +08:00
|
|
|
if (PageCompound(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
|
|
|
return ERR_PTR(-ETXTBSY);
|
|
|
|
}
|
2010-03-10 23:52:59 +08:00
|
|
|
|
2021-08-06 16:12:35 +08:00
|
|
|
ret = set_page_extent_mapped(page);
|
|
|
|
if (ret < 0) {
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
2010-03-10 23:52:59 +08:00
|
|
|
|
2021-08-06 16:12:35 +08:00
|
|
|
/* Wait for any existing ordered extent in the range */
|
|
|
|
while (1) {
|
|
|
|
struct btrfs_ordered_extent *ordered;
|
2010-03-10 23:52:59 +08:00
|
|
|
|
2022-09-10 05:53:43 +08:00
|
|
|
lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
|
2021-08-06 16:12:35 +08:00
|
|
|
ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
|
2022-09-10 05:53:43 +08:00
|
|
|
unlock_extent(&inode->io_tree, page_start, page_end,
|
|
|
|
&cached_state);
|
2021-08-06 16:12:35 +08:00
|
|
|
if (!ordered)
|
|
|
|
break;
|
2012-03-29 21:57:45 +08:00
|
|
|
|
2021-08-06 16:12:35 +08:00
|
|
|
unlock_page(page);
|
|
|
|
btrfs_start_ordered_extent(ordered, 1);
|
|
|
|
btrfs_put_ordered_extent(ordered);
|
|
|
|
lock_page(page);
|
|
|
|
/*
|
|
|
|
* We unlocked the page above, so we need check if it was
|
|
|
|
* released or not.
|
|
|
|
*/
|
|
|
|
if (page->mapping != mapping || !PagePrivate(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
Btrfs: fix defrag to merge tail file extent
The file layout is
[extent 1]...[extent n][4k extent][HOLE][extent x]
extent 1~n and 4k extent can be merged during defrag, and the whole
defrag bytes is larger than our defrag thresh(256k), 4k extent as a
tail is left unmerged since we check if its next extent can be merged
(the next one is a hole, so the check will fail), the layout thus can
be
[new extent][4k extent][HOLE][extent x]
(1~n)
To fix it, beside looking at the next one, this also looks at the
previous one by checking @defrag_end, which is set to 0 when we
decide to stop merging contiguous extents, otherwise, we can merge
the previous one with our extent.
Also, this makes btrfs behave consistent with how xfs and ext4 do.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-07 16:48:41 +08:00
|
|
|
|
2010-03-10 23:52:59 +08:00
|
|
|
/*
|
2021-08-06 16:12:35 +08:00
|
|
|
* Now the page range has no ordered extent any more. Read the page to
|
|
|
|
* make it uptodate.
|
2010-03-10 23:52:59 +08:00
|
|
|
*/
|
2021-08-06 16:12:35 +08:00
|
|
|
if (!PageUptodate(page)) {
|
2022-04-29 23:12:16 +08:00
|
|
|
btrfs_read_folio(NULL, page_folio(page));
|
2021-08-06 16:12:35 +08:00
|
|
|
lock_page(page);
|
|
|
|
if (page->mapping != mapping || !PagePrivate(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
if (!PageUptodate(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
}
|
2010-03-10 23:52:59 +08:00
|
|
|
}
|
2021-08-06 16:12:35 +08:00
|
|
|
return page;
|
2010-03-10 23:52:59 +08:00
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
struct defrag_target_range {
|
|
|
|
struct list_head list;
|
|
|
|
u64 start;
|
|
|
|
u64 len;
|
|
|
|
};
|
|
|
|
|
2011-05-25 03:35:30 +08:00
|
|
|
/*
|
2021-08-06 16:12:36 +08:00
|
|
|
* Collect all valid target extents.
|
2011-05-25 03:35:30 +08:00
|
|
|
*
|
2021-08-06 16:12:36 +08:00
|
|
|
* @start: file offset to lookup
|
|
|
|
* @len: length to lookup
|
|
|
|
* @extent_thresh: file extent size threshold, any extent size >= this value
|
|
|
|
* will be ignored
|
|
|
|
* @newer_than: only defrag extents newer than this value
|
|
|
|
* @do_compress: whether the defrag is doing compression
|
|
|
|
* if true, @extent_thresh will be ignored and all regular
|
|
|
|
* file extents meeting @newer_than will be targets.
|
2021-08-06 16:12:38 +08:00
|
|
|
* @locked: if the range has already held extent lock
|
2021-08-06 16:12:36 +08:00
|
|
|
* @target_list: list of targets file extents
|
2011-05-25 03:35:30 +08:00
|
|
|
*/
|
2021-08-06 16:12:36 +08:00
|
|
|
static int defrag_collect_targets(struct btrfs_inode *inode,
|
|
|
|
u64 start, u64 len, u32 extent_thresh,
|
|
|
|
u64 newer_than, bool do_compress,
|
2022-02-11 14:41:39 +08:00
|
|
|
bool locked, struct list_head *target_list,
|
|
|
|
u64 *last_scanned_ret)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
2022-07-09 07:18:42 +08:00
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
2022-02-11 14:41:39 +08:00
|
|
|
bool last_is_target = false;
|
2021-08-06 16:12:36 +08:00
|
|
|
u64 cur = start;
|
|
|
|
int ret = 0;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
while (cur < start + len) {
|
|
|
|
struct extent_map *em;
|
|
|
|
struct defrag_target_range *new;
|
|
|
|
bool next_mergeable = true;
|
|
|
|
u64 range_len;
|
2012-03-29 21:57:44 +08:00
|
|
|
|
2022-02-11 14:41:39 +08:00
|
|
|
last_is_target = false;
|
2022-02-11 14:46:12 +08:00
|
|
|
em = defrag_lookup_extent(&inode->vfs_inode, cur,
|
|
|
|
newer_than, locked);
|
2021-08-06 16:12:36 +08:00
|
|
|
if (!em)
|
|
|
|
break;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
btrfs: allow defrag to convert inline extents to regular extents
Btrfs defaults to max_inline=2K to make small writes inlined into
metadata.
The default value is always a win, as even DUP/RAID1/RAID10 doubles the
metadata usage, it should still cause less physical space used compared
to a 4K regular extents.
But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
users may find inlined extents causing too much space wasted, and want
to convert those inlined extents back to regular extents.
Unfortunately defrag will unconditionally skip all inline extents, no
matter if the user is trying to converting them back to regular extents.
So this patch will add a small exception for defrag_collect_targets() to
allow defragging inline extents, if and only if the inlined extents are
larger than max_inline, allowing users to convert them to regular ones.
This also allows us to defrag extents like the following:
item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
generation 7 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 16384 ram 16384
extent compression 1 (zlib)
Previously we're unable to do any defrag, since the first extent is
inlined, and the second one has no extent to merge.
Now we can defrag it to just one single extent, saving 48 bytes metadata
space.
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13635584 nr 4096
extent data offset 0 nr 20480 ram 20480
extent compression 1 (zlib)
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-09 20:00:53 +08:00
|
|
|
/*
|
|
|
|
* If the file extent is an inlined one, we may still want to
|
|
|
|
* defrag it (fallthrough) if it will cause a regular extent.
|
|
|
|
* This is for users who want to convert inline extents to
|
|
|
|
* regular ones through max_inline= mount option.
|
|
|
|
*/
|
|
|
|
if (em->block_start == EXTENT_MAP_INLINE &&
|
|
|
|
em->len <= inode->root->fs_info->max_inline)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
/* Skip hole/delalloc/preallocated extents */
|
|
|
|
if (em->block_start == EXTENT_MAP_HOLE ||
|
|
|
|
em->block_start == EXTENT_MAP_DELALLOC ||
|
2021-08-06 16:12:36 +08:00
|
|
|
test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
|
|
|
|
goto next;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
/* Skip older extent */
|
|
|
|
if (em->generation < newer_than)
|
|
|
|
goto next;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2022-02-08 14:54:05 +08:00
|
|
|
/* This em is under writeback, no need to defrag */
|
|
|
|
if (em->generation == (u64)-1)
|
|
|
|
goto next;
|
|
|
|
|
btrfs: fix deadlock when reserving space during defrag
When defragging we can end up collecting a range for defrag that has
already pages under delalloc (dirty), as long as the respective extent
map for their range is not mapped to a hole, a prealloc extent or
the extent map is from an old generation.
Most of the time that is harmless from a functional perspective at
least, however it can result in a deadlock:
1) At defrag_collect_targets() we find an extent map that meets all
requirements but there's delalloc for the range it covers, and we add
its range to list of ranges to defrag;
2) The defrag_collect_targets() function is called at defrag_one_range(),
after it locked a range that overlaps the range of the extent map;
3) At defrag_one_range(), while the range is still locked, we call
defrag_one_locked_target() for the range associated to the extent
map we collected at step 1);
4) Then finally at defrag_one_locked_target() we do a call to
btrfs_delalloc_reserve_space(), which will reserve data and metadata
space. If the space reservations can not be satisfied right away, the
flusher might be kicked in and start flushing delalloc and wait for
the respective ordered extents to complete. If this happens we will
deadlock, because both flushing delalloc and finishing an ordered
extent, requires locking the range in the inode's io tree, which was
already locked at defrag_collect_targets().
So fix this by skipping extent maps for which there's already delalloc.
Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-20 22:27:56 +08:00
|
|
|
/*
|
|
|
|
* Our start offset might be in the middle of an existing extent
|
|
|
|
* map, so take that into account.
|
|
|
|
*/
|
|
|
|
range_len = em->len - (cur - em->start);
|
|
|
|
/*
|
|
|
|
* If this range of the extent map is already flagged for delalloc,
|
|
|
|
* skip it, because:
|
|
|
|
*
|
|
|
|
* 1) We could deadlock later, when trying to reserve space for
|
|
|
|
* delalloc, because in case we can't immediately reserve space
|
|
|
|
* the flusher can start delalloc and wait for the respective
|
|
|
|
* ordered extents to complete. The deadlock would happen
|
|
|
|
* because we do the space reservation while holding the range
|
|
|
|
* locked, and starting writeback, or finishing an ordered
|
|
|
|
* extent, requires locking the range;
|
|
|
|
*
|
|
|
|
* 2) If there's delalloc there, it means there's dirty pages for
|
|
|
|
* which writeback has not started yet (we clean the delalloc
|
|
|
|
* flag when starting writeback and after creating an ordered
|
|
|
|
* extent). If we mark pages in an adjacent range for defrag,
|
|
|
|
* then we will have a larger contiguous range for delalloc,
|
|
|
|
* very likely resulting in a larger extent after writeback is
|
|
|
|
* triggered (except in a case of free space fragmentation).
|
|
|
|
*/
|
|
|
|
if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
|
|
|
|
EXTENT_DELALLOC, 0, NULL))
|
|
|
|
goto next;
|
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
/*
|
|
|
|
* For do_compress case, we want to compress all valid file
|
|
|
|
* extents, thus no @extent_thresh or mergeable check.
|
|
|
|
*/
|
|
|
|
if (do_compress)
|
|
|
|
goto add;
|
|
|
|
|
|
|
|
/* Skip too large extent */
|
btrfs: fix deadlock when reserving space during defrag
When defragging we can end up collecting a range for defrag that has
already pages under delalloc (dirty), as long as the respective extent
map for their range is not mapped to a hole, a prealloc extent or
the extent map is from an old generation.
Most of the time that is harmless from a functional perspective at
least, however it can result in a deadlock:
1) At defrag_collect_targets() we find an extent map that meets all
requirements but there's delalloc for the range it covers, and we add
its range to list of ranges to defrag;
2) The defrag_collect_targets() function is called at defrag_one_range(),
after it locked a range that overlaps the range of the extent map;
3) At defrag_one_range(), while the range is still locked, we call
defrag_one_locked_target() for the range associated to the extent
map we collected at step 1);
4) Then finally at defrag_one_locked_target() we do a call to
btrfs_delalloc_reserve_space(), which will reserve data and metadata
space. If the space reservations can not be satisfied right away, the
flusher might be kicked in and start flushing delalloc and wait for
the respective ordered extents to complete. If this happens we will
deadlock, because both flushing delalloc and finishing an ordered
extent, requires locking the range in the inode's io tree, which was
already locked at defrag_collect_targets().
So fix this by skipping extent maps for which there's already delalloc.
Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-20 22:27:56 +08:00
|
|
|
if (range_len >= extent_thresh)
|
2021-08-06 16:12:36 +08:00
|
|
|
goto next;
|
|
|
|
|
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:21 +08:00
|
|
|
/*
|
|
|
|
* Skip extents already at its max capacity, this is mostly for
|
|
|
|
* compressed extents, which max cap is only 128K.
|
|
|
|
*/
|
2022-07-09 07:18:42 +08:00
|
|
|
if (em->len >= get_extent_max_capacity(fs_info, em))
|
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-28 15:21:21 +08:00
|
|
|
goto next;
|
|
|
|
|
btrfs: allow defrag to convert inline extents to regular extents
Btrfs defaults to max_inline=2K to make small writes inlined into
metadata.
The default value is always a win, as even DUP/RAID1/RAID10 doubles the
metadata usage, it should still cause less physical space used compared
to a 4K regular extents.
But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
users may find inlined extents causing too much space wasted, and want
to convert those inlined extents back to regular extents.
Unfortunately defrag will unconditionally skip all inline extents, no
matter if the user is trying to converting them back to regular extents.
So this patch will add a small exception for defrag_collect_targets() to
allow defragging inline extents, if and only if the inlined extents are
larger than max_inline, allowing users to convert them to regular ones.
This also allows us to defrag extents like the following:
item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
generation 7 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 16384 ram 16384
extent compression 1 (zlib)
Previously we're unable to do any defrag, since the first extent is
inlined, and the second one has no extent to merge.
Now we can defrag it to just one single extent, saving 48 bytes metadata
space.
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13635584 nr 4096
extent data offset 0 nr 20480 ram 20480
extent compression 1 (zlib)
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-09 20:00:53 +08:00
|
|
|
/*
|
|
|
|
* Normally there are no more extents after an inline one, thus
|
|
|
|
* @next_mergeable will normally be false and not defragged.
|
|
|
|
* So if an inline extent passed all above checks, just add it
|
|
|
|
* for defrag, and be converted to regular extents.
|
|
|
|
*/
|
|
|
|
if (em->block_start == EXTENT_MAP_INLINE)
|
|
|
|
goto add;
|
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
|
btrfs: avoid defragging extents whose next extents are not targets
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:
btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).
If mergeable, we will add range A into target for defrag.
However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.
But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.
And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.
[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()
This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.
While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().
Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-15 19:28:05 +08:00
|
|
|
extent_thresh, newer_than, locked);
|
2021-08-06 16:12:36 +08:00
|
|
|
if (!next_mergeable) {
|
|
|
|
struct defrag_target_range *last;
|
|
|
|
|
|
|
|
/* Empty target list, no way to merge with last entry */
|
|
|
|
if (list_empty(target_list))
|
|
|
|
goto next;
|
|
|
|
last = list_entry(target_list->prev,
|
|
|
|
struct defrag_target_range, list);
|
|
|
|
/* Not mergeable with last entry */
|
|
|
|
if (last->start + last->len != cur)
|
|
|
|
goto next;
|
|
|
|
|
|
|
|
/* Mergeable, fall through to add it to @target_list. */
|
2021-01-26 16:34:00 +08:00
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
add:
|
2022-02-11 14:41:39 +08:00
|
|
|
last_is_target = true;
|
2021-08-06 16:12:36 +08:00
|
|
|
range_len = min(extent_map_end(em), start + len) - cur;
|
|
|
|
/*
|
|
|
|
* This one is a good target, check if it can be merged into
|
|
|
|
* last range of the target list.
|
|
|
|
*/
|
|
|
|
if (!list_empty(target_list)) {
|
|
|
|
struct defrag_target_range *last;
|
|
|
|
|
|
|
|
last = list_entry(target_list->prev,
|
|
|
|
struct defrag_target_range, list);
|
|
|
|
ASSERT(last->start + last->len <= cur);
|
|
|
|
if (last->start + last->len == cur) {
|
|
|
|
/* Mergeable, enlarge the last entry */
|
|
|
|
last->len += range_len;
|
|
|
|
goto next;
|
2012-03-29 21:57:44 +08:00
|
|
|
}
|
2021-08-06 16:12:36 +08:00
|
|
|
/* Fall through to allocate a new entry */
|
2012-02-16 15:01:24 +08:00
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
/* Allocate new defrag_target_range */
|
|
|
|
new = kmalloc(sizeof(*new), GFP_NOFS);
|
|
|
|
if (!new) {
|
|
|
|
free_extent_map(em);
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
2011-05-25 03:35:30 +08:00
|
|
|
}
|
2021-08-06 16:12:36 +08:00
|
|
|
new->start = cur;
|
|
|
|
new->len = range_len;
|
|
|
|
list_add_tail(&new->list, target_list);
|
2012-02-16 15:01:24 +08:00
|
|
|
|
2021-08-06 16:12:36 +08:00
|
|
|
next:
|
|
|
|
cur = extent_map_end(em);
|
|
|
|
free_extent_map(em);
|
|
|
|
}
|
|
|
|
if (ret < 0) {
|
|
|
|
struct defrag_target_range *entry;
|
|
|
|
struct defrag_target_range *tmp;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(entry, tmp, target_list, list) {
|
|
|
|
list_del_init(&entry->list);
|
|
|
|
kfree(entry);
|
2012-02-16 15:01:24 +08:00
|
|
|
}
|
2021-08-06 16:12:36 +08:00
|
|
|
}
|
2022-02-11 14:41:39 +08:00
|
|
|
if (!ret && last_scanned_ret) {
|
|
|
|
/*
|
|
|
|
* If the last extent is not a target, the caller can skip to
|
|
|
|
* the end of that extent.
|
|
|
|
* Otherwise, we can only go the end of the specified range.
|
|
|
|
*/
|
|
|
|
if (!last_is_target)
|
|
|
|
*last_scanned_ret = max(cur, *last_scanned_ret);
|
|
|
|
else
|
|
|
|
*last_scanned_ret = max(start + len, *last_scanned_ret);
|
|
|
|
}
|
2021-08-06 16:12:36 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:37 +08:00
|
|
|
#define CLUSTER_SIZE (SZ_256K)
|
2022-02-01 22:42:07 +08:00
|
|
|
static_assert(IS_ALIGNED(CLUSTER_SIZE, PAGE_SIZE));
|
2021-08-06 16:12:37 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Defrag one contiguous target range.
|
|
|
|
*
|
|
|
|
* @inode: target inode
|
|
|
|
* @target: target range to defrag
|
|
|
|
* @pages: locked pages covering the defrag range
|
|
|
|
* @nr_pages: number of locked pages
|
|
|
|
*
|
|
|
|
* Caller should ensure:
|
|
|
|
*
|
|
|
|
* - Pages are prepared
|
|
|
|
* Pages should be locked, no ordered extent in the pages range,
|
|
|
|
* no writeback.
|
|
|
|
*
|
|
|
|
* - Extent bits are locked
|
|
|
|
*/
|
|
|
|
static int defrag_one_locked_target(struct btrfs_inode *inode,
|
|
|
|
struct defrag_target_range *target,
|
|
|
|
struct page **pages, int nr_pages,
|
|
|
|
struct extent_state **cached_state)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
|
|
|
struct extent_changeset *data_reserved = NULL;
|
|
|
|
const u64 start = target->start;
|
|
|
|
const u64 len = target->len;
|
|
|
|
unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
|
|
|
|
unsigned long start_index = start >> PAGE_SHIFT;
|
|
|
|
unsigned long first_index = page_index(pages[0]);
|
|
|
|
int ret = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(last_index - first_index + 1 <= nr_pages);
|
|
|
|
|
|
|
|
ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
clear_extent_bit(&inode->io_tree, start, start + len - 1,
|
|
|
|
EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
|
2022-09-10 05:53:47 +08:00
|
|
|
EXTENT_DEFRAG, cached_state);
|
2021-08-06 16:12:37 +08:00
|
|
|
set_extent_defrag(&inode->io_tree, start, start + len - 1, cached_state);
|
2012-02-16 15:01:24 +08:00
|
|
|
|
2021-08-06 16:12:37 +08:00
|
|
|
/* Update the page status */
|
|
|
|
for (i = start_index - first_index; i <= last_index - first_index; i++) {
|
|
|
|
ClearPageChecked(pages[i]);
|
|
|
|
btrfs_page_clamp_set_dirty(fs_info, pages[i], start, len);
|
2011-05-25 03:35:30 +08:00
|
|
|
}
|
2021-08-06 16:12:37 +08:00
|
|
|
btrfs_delalloc_release_extents(inode, len);
|
|
|
|
extent_changeset_free(data_reserved);
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:37 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
|
2022-02-11 14:41:39 +08:00
|
|
|
u32 extent_thresh, u64 newer_than, bool do_compress,
|
|
|
|
u64 *last_scanned_ret)
|
2021-08-06 16:12:38 +08:00
|
|
|
{
|
|
|
|
struct extent_state *cached_state = NULL;
|
|
|
|
struct defrag_target_range *entry;
|
|
|
|
struct defrag_target_range *tmp;
|
|
|
|
LIST_HEAD(target_list);
|
|
|
|
struct page **pages;
|
|
|
|
const u32 sectorsize = inode->root->fs_info->sectorsize;
|
|
|
|
u64 last_index = (start + len - 1) >> PAGE_SHIFT;
|
|
|
|
u64 start_index = start >> PAGE_SHIFT;
|
|
|
|
unsigned int nr_pages = last_index - start_index + 1;
|
|
|
|
int ret = 0;
|
|
|
|
int i;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
ASSERT(nr_pages <= CLUSTER_SIZE / PAGE_SIZE);
|
|
|
|
ASSERT(IS_ALIGNED(start, sectorsize) && IS_ALIGNED(len, sectorsize));
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
|
|
|
|
if (!pages)
|
|
|
|
return -ENOMEM;
|
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:33 +08:00
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
/* Prepare all pages */
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
pages[i] = defrag_prepare_one_page(inode, start_index + i);
|
|
|
|
if (IS_ERR(pages[i])) {
|
|
|
|
ret = PTR_ERR(pages[i]);
|
|
|
|
pages[i] = NULL;
|
|
|
|
goto free_pages;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (i = 0; i < nr_pages; i++)
|
|
|
|
wait_on_page_writeback(pages[i]);
|
|
|
|
|
|
|
|
/* Lock the pages range */
|
2022-09-10 05:53:43 +08:00
|
|
|
lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
|
|
|
|
(last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
|
|
|
|
&cached_state);
|
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:33 +08:00
|
|
|
/*
|
2021-08-06 16:12:38 +08:00
|
|
|
* Now we have a consistent view about the extent map, re-check
|
|
|
|
* which range really needs to be defragged.
|
|
|
|
*
|
|
|
|
* And this time we have extent locked already, pass @locked = true
|
|
|
|
* so that we won't relock the extent range and cause deadlock.
|
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:33 +08:00
|
|
|
*/
|
2021-08-06 16:12:38 +08:00
|
|
|
ret = defrag_collect_targets(inode, start, len, extent_thresh,
|
|
|
|
newer_than, do_compress, true,
|
2022-02-11 14:41:39 +08:00
|
|
|
&target_list, last_scanned_ret);
|
2021-08-06 16:12:38 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto unlock_extent;
|
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:33 +08:00
|
|
|
|
2021-08-06 16:12:38 +08:00
|
|
|
list_for_each_entry(entry, &target_list, list) {
|
|
|
|
ret = defrag_one_locked_target(inode, entry, pages, nr_pages,
|
|
|
|
&cached_state);
|
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry_safe(entry, tmp, &target_list, list) {
|
|
|
|
list_del_init(&entry->list);
|
|
|
|
kfree(entry);
|
|
|
|
}
|
|
|
|
unlock_extent:
|
2022-09-10 05:53:43 +08:00
|
|
|
unlock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
|
|
|
|
(last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
|
|
|
|
&cached_state);
|
2021-08-06 16:12:38 +08:00
|
|
|
free_pages:
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
if (pages[i]) {
|
|
|
|
unlock_page(pages[i]);
|
|
|
|
put_page(pages[i]);
|
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:33 +08:00
|
|
|
}
|
|
|
|
}
|
2021-08-06 16:12:38 +08:00
|
|
|
kfree(pages);
|
|
|
|
return ret;
|
|
|
|
}
|
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).
So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.
I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 19:07:33 +08:00
|
|
|
|
2021-08-06 16:12:39 +08:00
|
|
|
static int defrag_one_cluster(struct btrfs_inode *inode,
|
|
|
|
struct file_ra_state *ra,
|
|
|
|
u64 start, u32 len, u32 extent_thresh,
|
|
|
|
u64 newer_than, bool do_compress,
|
|
|
|
unsigned long *sectors_defragged,
|
2022-02-11 14:41:39 +08:00
|
|
|
unsigned long max_sectors,
|
|
|
|
u64 *last_scanned_ret)
|
2021-08-06 16:12:39 +08:00
|
|
|
{
|
|
|
|
const u32 sectorsize = inode->root->fs_info->sectorsize;
|
|
|
|
struct defrag_target_range *entry;
|
|
|
|
struct defrag_target_range *tmp;
|
|
|
|
LIST_HEAD(target_list);
|
|
|
|
int ret;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:39 +08:00
|
|
|
ret = defrag_collect_targets(inode, start, len, extent_thresh,
|
|
|
|
newer_than, do_compress, false,
|
2022-02-11 14:41:39 +08:00
|
|
|
&target_list, NULL);
|
2021-08-06 16:12:39 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:39 +08:00
|
|
|
list_for_each_entry(entry, &target_list, list) {
|
|
|
|
u32 range_len = entry->len;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2022-01-18 15:19:04 +08:00
|
|
|
/* Reached or beyond the limit */
|
btrfs: defrag: properly update range->start for autodefrag
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 19:53:52 +08:00
|
|
|
if (max_sectors && *sectors_defragged >= max_sectors) {
|
|
|
|
ret = 1;
|
2021-08-06 16:12:39 +08:00
|
|
|
break;
|
btrfs: defrag: properly update range->start for autodefrag
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 19:53:52 +08:00
|
|
|
}
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2021-08-06 16:12:39 +08:00
|
|
|
if (max_sectors)
|
|
|
|
range_len = min_t(u32, range_len,
|
|
|
|
(max_sectors - *sectors_defragged) * sectorsize);
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2022-02-11 14:41:39 +08:00
|
|
|
/*
|
|
|
|
* If defrag_one_range() has updated last_scanned_ret,
|
|
|
|
* our range may already be invalid (e.g. hole punched).
|
|
|
|
* Skip if our range is before last_scanned_ret, as there is
|
|
|
|
* no need to defrag the range anymore.
|
|
|
|
*/
|
|
|
|
if (entry->start + range_len <= *last_scanned_ret)
|
|
|
|
continue;
|
|
|
|
|
2021-08-06 16:12:39 +08:00
|
|
|
if (ra)
|
|
|
|
page_cache_sync_readahead(inode->vfs_inode.i_mapping,
|
|
|
|
ra, NULL, entry->start >> PAGE_SHIFT,
|
|
|
|
((entry->start + range_len - 1) >> PAGE_SHIFT) -
|
|
|
|
(entry->start >> PAGE_SHIFT) + 1);
|
|
|
|
/*
|
|
|
|
* Here we may not defrag any range if holes are punched before
|
|
|
|
* we locked the pages.
|
|
|
|
* But that's fine, it only affects the @sectors_defragged
|
|
|
|
* accounting.
|
|
|
|
*/
|
|
|
|
ret = defrag_one_range(inode, entry->start, range_len,
|
2022-02-11 14:41:39 +08:00
|
|
|
extent_thresh, newer_than, do_compress,
|
|
|
|
last_scanned_ret);
|
2021-08-06 16:12:39 +08:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
2022-01-18 15:19:04 +08:00
|
|
|
*sectors_defragged += range_len >>
|
|
|
|
inode->root->fs_info->sectorsize_bits;
|
2011-05-25 03:35:30 +08:00
|
|
|
}
|
|
|
|
out:
|
2021-08-06 16:12:39 +08:00
|
|
|
list_for_each_entry_safe(entry, tmp, &target_list, list) {
|
|
|
|
list_del_init(&entry->list);
|
|
|
|
kfree(entry);
|
2011-05-25 03:35:30 +08:00
|
|
|
}
|
2022-02-11 14:41:39 +08:00
|
|
|
if (ret >= 0)
|
|
|
|
*last_scanned_ret = max(*last_scanned_ret, start + len);
|
2011-05-25 03:35:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-08-06 16:12:32 +08:00
|
|
|
/*
|
|
|
|
* Entry point to file defragmentation.
|
|
|
|
*
|
|
|
|
* @inode: inode to be defragged
|
|
|
|
* @ra: readahead state (can be NUL)
|
|
|
|
* @range: defrag options including range and flags
|
|
|
|
* @newer_than: minimum transid to defrag
|
|
|
|
* @max_to_defrag: max number of sectors to be defragged, if 0, the whole inode
|
|
|
|
* will be defragged.
|
2022-01-18 15:19:04 +08:00
|
|
|
*
|
|
|
|
* Return <0 for error.
|
btrfs: defrag: properly update range->start for autodefrag
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 19:53:52 +08:00
|
|
|
* Return >=0 for the number of sectors defragged, and range->start will be updated
|
|
|
|
* to indicate the file offset where next defrag should be started at.
|
|
|
|
* (Mostly for autodefrag, which sets @max_to_defrag thus we may exit early without
|
|
|
|
* defragging all the range).
|
2021-08-06 16:12:32 +08:00
|
|
|
*/
|
|
|
|
int btrfs_defrag_file(struct inode *inode, struct file_ra_state *ra,
|
2011-05-25 03:35:30 +08:00
|
|
|
struct btrfs_ioctl_defrag_range_args *range,
|
|
|
|
u64 newer_than, unsigned long max_to_defrag)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2021-05-27 20:33:22 +08:00
|
|
|
unsigned long sectors_defragged = 0;
|
2011-09-02 15:56:39 +08:00
|
|
|
u64 isize = i_size_read(inode);
|
2021-05-27 20:33:22 +08:00
|
|
|
u64 cur;
|
|
|
|
u64 last_byte;
|
2017-07-18 02:01:59 +08:00
|
|
|
bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS;
|
2021-08-06 16:12:32 +08:00
|
|
|
bool ra_allocated = false;
|
2010-10-25 15:12:50 +08:00
|
|
|
int compress_type = BTRFS_COMPRESS_ZLIB;
|
2021-05-27 20:33:22 +08:00
|
|
|
int ret = 0;
|
2014-07-29 23:32:10 +08:00
|
|
|
u32 extent_thresh = range->extent_thresh;
|
btrfs: update writeback index when starting defrag
When starting a defrag, we should update the writeback index of the
inode's mapping in case it currently has a value beyond the start of the
range we are defragging. This can help performance and often result in
getting less extents after writeback - for e.g., if the current value
of the writeback index sits somewhere in the middle of a range that
gets dirty by the defrag, then after writeback we can get two smaller
extents instead of a single, larger extent.
We used to have this before the refactoring in 5.16, but it was removed
without any reason to do so. Originally it was added in kernel 3.1, by
commit 2a0f7f5769992b ("Btrfs: fix recursive auto-defrag"), in order to
fix a loop with autodefrag resulting in dirtying and writing pages over
and over, but some testing on current code did not show that happening,
at least with the test described in that commit.
So add back the behaviour, as at the very least it is a nice to have
optimization.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-21 01:41:17 +08:00
|
|
|
pgoff_t start_index;
|
2011-05-25 03:35:30 +08:00
|
|
|
|
2013-04-16 17:20:28 +08:00
|
|
|
if (isize == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (range->start >= isize)
|
|
|
|
return -EINVAL;
|
2010-10-25 15:12:50 +08:00
|
|
|
|
2017-07-18 02:01:59 +08:00
|
|
|
if (do_compress) {
|
2019-10-10 15:59:57 +08:00
|
|
|
if (range->compress_type >= BTRFS_NR_COMPRESS_TYPES)
|
2010-10-25 15:12:50 +08:00
|
|
|
return -EINVAL;
|
|
|
|
if (range->compress_type)
|
|
|
|
compress_type = range->compress_type;
|
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2013-04-16 17:20:28 +08:00
|
|
|
if (extent_thresh == 0)
|
2015-12-15 00:42:10 +08:00
|
|
|
extent_thresh = SZ_256K;
|
2010-03-10 23:52:59 +08:00
|
|
|
|
2021-05-27 20:33:22 +08:00
|
|
|
if (range->start + range->len > range->start) {
|
|
|
|
/* Got a specific range */
|
2022-01-18 00:28:29 +08:00
|
|
|
last_byte = min(isize, range->start + range->len);
|
2021-05-27 20:33:22 +08:00
|
|
|
} else {
|
|
|
|
/* Defrag until file end */
|
2022-01-18 00:28:29 +08:00
|
|
|
last_byte = isize;
|
2021-05-27 20:33:22 +08:00
|
|
|
}
|
|
|
|
|
2022-01-18 00:28:29 +08:00
|
|
|
/* Align the range */
|
|
|
|
cur = round_down(range->start, fs_info->sectorsize);
|
|
|
|
last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
|
|
|
|
|
2011-05-25 03:35:30 +08:00
|
|
|
/*
|
2021-08-06 16:12:32 +08:00
|
|
|
* If we were not given a ra, allocate a readahead context. As
|
2017-06-22 09:22:58 +08:00
|
|
|
* readahead is just an optimization, defrag will work without it so
|
|
|
|
* we don't error out.
|
2011-05-25 03:35:30 +08:00
|
|
|
*/
|
2021-08-06 16:12:32 +08:00
|
|
|
if (!ra) {
|
|
|
|
ra_allocated = true;
|
2017-06-22 09:13:02 +08:00
|
|
|
ra = kzalloc(sizeof(*ra), GFP_KERNEL);
|
2017-06-22 09:22:58 +08:00
|
|
|
if (ra)
|
|
|
|
file_ra_state_init(ra, inode->i_mapping);
|
2011-05-25 03:35:30 +08:00
|
|
|
}
|
|
|
|
|
btrfs: update writeback index when starting defrag
When starting a defrag, we should update the writeback index of the
inode's mapping in case it currently has a value beyond the start of the
range we are defragging. This can help performance and often result in
getting less extents after writeback - for e.g., if the current value
of the writeback index sits somewhere in the middle of a range that
gets dirty by the defrag, then after writeback we can get two smaller
extents instead of a single, larger extent.
We used to have this before the refactoring in 5.16, but it was removed
without any reason to do so. Originally it was added in kernel 3.1, by
commit 2a0f7f5769992b ("Btrfs: fix recursive auto-defrag"), in order to
fix a loop with autodefrag resulting in dirtying and writing pages over
and over, but some testing on current code did not show that happening,
at least with the test described in that commit.
So add back the behaviour, as at the very least it is a nice to have
optimization.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-21 01:41:17 +08:00
|
|
|
/*
|
|
|
|
* Make writeback start from the beginning of the range, so that the
|
|
|
|
* defrag range can be written sequentially.
|
|
|
|
*/
|
|
|
|
start_index = cur >> PAGE_SHIFT;
|
|
|
|
if (start_index < inode->i_mapping->writeback_index)
|
|
|
|
inode->i_mapping->writeback_index = start_index;
|
|
|
|
|
2021-05-27 20:33:22 +08:00
|
|
|
while (cur < last_byte) {
|
2022-01-21 01:11:52 +08:00
|
|
|
const unsigned long prev_sectors_defragged = sectors_defragged;
|
2022-02-11 14:41:39 +08:00
|
|
|
u64 last_scanned = cur;
|
2021-05-27 20:33:22 +08:00
|
|
|
u64 cluster_end;
|
2011-09-02 15:57:07 +08:00
|
|
|
|
btrfs: allow defrag to be interruptible
During defrag, at btrfs_defrag_file(), we have this loop that iterates
over a file range in steps no larger than 256K subranges. If the range
is too long, there's no way to interrupt it. So make the loop check in
each iteration if there's signal pending, and if there is, break and
return -AGAIN to userspace.
Before kernel 5.16, we used to allow defrag to be cancelled through a
signal, but that was lost with commit 7b508037d4cac3 ("btrfs: defrag:
use defrag_one_cluster() to implement btrfs_defrag_file()").
This change adds back the possibility to cancel a defrag with a signal
and keeps the same semantics, returning -EAGAIN to user space (and not
the usually more expected -EINTR).
This is also motivated by a recent bug on 5.16 where defragging a 1 byte
file resulted in iterating from file range 0 to (u64)-1, as hitting the
bug triggered a too long loop, basically requiring one to reboot the
machine, as it was not possible to cancel defrag.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 21:43:31 +08:00
|
|
|
if (btrfs_defrag_cancelled(fs_info)) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2021-05-27 20:33:22 +08:00
|
|
|
/* We want the cluster end at page boundary when possible */
|
|
|
|
cluster_end = (((cur >> PAGE_SHIFT) +
|
|
|
|
(SZ_256K >> PAGE_SHIFT)) << PAGE_SHIFT) - 1;
|
|
|
|
cluster_end = min(cluster_end, last_byte);
|
2010-03-10 23:52:59 +08:00
|
|
|
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_lock(inode, 0);
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
if (IS_SWAPFILE(inode)) {
|
|
|
|
ret = -ETXTBSY;
|
2021-05-27 20:33:22 +08:00
|
|
|
btrfs_inode_unlock(inode, 0);
|
|
|
|
break;
|
Btrfs: prevent ioctls from interfering with a swap file
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-04 01:28:12 +08:00
|
|
|
}
|
2021-05-27 20:33:22 +08:00
|
|
|
if (!(inode->i_sb->s_flags & SB_ACTIVE)) {
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_unlock(inode, 0);
|
2021-05-27 20:33:22 +08:00
|
|
|
break;
|
2012-03-29 21:57:44 +08:00
|
|
|
}
|
2021-05-27 20:33:22 +08:00
|
|
|
if (do_compress)
|
|
|
|
BTRFS_I(inode)->defrag_compress = compress_type;
|
|
|
|
ret = defrag_one_cluster(BTRFS_I(inode), ra, cur,
|
|
|
|
cluster_end + 1 - cur, extent_thresh,
|
2022-02-11 14:41:39 +08:00
|
|
|
newer_than, do_compress, §ors_defragged,
|
|
|
|
max_to_defrag, &last_scanned);
|
2022-01-21 01:11:52 +08:00
|
|
|
|
|
|
|
if (sectors_defragged > prev_sectors_defragged)
|
|
|
|
balance_dirty_pages_ratelimited(inode->i_mapping);
|
|
|
|
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_unlock(inode, 0);
|
2021-05-27 20:33:22 +08:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
2022-02-11 14:41:39 +08:00
|
|
|
cur = max(cluster_end + 1, last_scanned);
|
btrfs: defrag: properly update range->start for autodefrag
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 19:53:52 +08:00
|
|
|
if (ret > 0) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
2022-01-30 20:53:15 +08:00
|
|
|
cond_resched();
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
2021-05-27 20:33:22 +08:00
|
|
|
if (ra_allocated)
|
|
|
|
kfree(ra);
|
btrfs: defrag: properly update range->start for autodefrag
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-18 19:53:52 +08:00
|
|
|
/*
|
|
|
|
* Update range.start for autodefrag, this will indicate where to start
|
|
|
|
* in next run.
|
|
|
|
*/
|
|
|
|
range->start = cur;
|
2021-05-27 20:33:22 +08:00
|
|
|
if (sectors_defragged) {
|
|
|
|
/*
|
|
|
|
* We have defragged some sectors, for compression case they
|
|
|
|
* need to be written back immediately.
|
|
|
|
*/
|
|
|
|
if (range->flags & BTRFS_DEFRAG_RANGE_START_IO) {
|
2014-03-01 18:55:54 +08:00
|
|
|
filemap_flush(inode->i_mapping);
|
2021-05-27 20:33:22 +08:00
|
|
|
if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
|
|
|
|
&BTRFS_I(inode)->runtime_flags))
|
|
|
|
filemap_flush(inode->i_mapping);
|
|
|
|
}
|
|
|
|
if (range->compress_type == BTRFS_COMPRESS_LZO)
|
|
|
|
btrfs_set_fs_incompat(fs_info, COMPRESS_LZO);
|
|
|
|
else if (range->compress_type == BTRFS_COMPRESS_ZSTD)
|
|
|
|
btrfs_set_fs_incompat(fs_info, COMPRESS_ZSTD);
|
|
|
|
ret = sectors_defragged;
|
2014-03-01 18:55:54 +08:00
|
|
|
}
|
2017-07-18 02:01:59 +08:00
|
|
|
if (do_compress) {
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_lock(inode, 0);
|
2017-07-18 01:41:31 +08:00
|
|
|
BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_unlock(inode, 0);
|
2013-08-16 22:23:33 +08:00
|
|
|
}
|
2010-03-10 23:52:59 +08:00
|
|
|
return ret;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
2021-05-15 03:32:44 +08:00
|
|
|
/*
|
|
|
|
* Try to start exclusive operation @type or cancel it if it's running.
|
|
|
|
*
|
|
|
|
* Return:
|
|
|
|
* 0 - normal mode, newly claimed op started
|
|
|
|
* >0 - normal mode, something else is running,
|
|
|
|
* return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS to user space
|
|
|
|
* ECANCELED - cancel mode, successful cancel
|
|
|
|
* ENOTCONN - cancel mode, operation not running anymore
|
|
|
|
*/
|
|
|
|
static int exclop_start_or_cancel_reloc(struct btrfs_fs_info *fs_info,
|
|
|
|
enum btrfs_exclusive_operation type, bool cancel)
|
|
|
|
{
|
|
|
|
if (!cancel) {
|
|
|
|
/* Start normal op */
|
|
|
|
if (!btrfs_exclop_start(fs_info, type))
|
|
|
|
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
|
|
|
|
/* Exclusive operation is now claimed */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Cancel running op */
|
|
|
|
if (btrfs_exclop_start_try_lock(fs_info, type)) {
|
|
|
|
/*
|
|
|
|
* This blocks any exclop finish from setting it to NONE, so we
|
|
|
|
* request cancellation. Either it runs and we will wait for it,
|
|
|
|
* or it has finished and no waiting will happen.
|
|
|
|
*/
|
|
|
|
atomic_inc(&fs_info->reloc_cancel_req);
|
|
|
|
btrfs_exclop_start_unlock(fs_info);
|
|
|
|
|
|
|
|
if (test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags))
|
|
|
|
wait_on_bit(&fs_info->flags, BTRFS_FS_RELOC_RUNNING,
|
|
|
|
TASK_INTERRUPTIBLE);
|
|
|
|
|
|
|
|
return -ECANCELED;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Something else is running or none */
|
|
|
|
return -ENOTCONN;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:43:45 +08:00
|
|
|
static noinline int btrfs_ioctl_resize(struct file *file,
|
2009-09-22 04:00:26 +08:00
|
|
|
void __user *arg)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2008-06-12 09:53:53 +08:00
|
|
|
u64 new_size;
|
|
|
|
u64 old_size;
|
|
|
|
u64 devid = 1;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_ioctl_vol_args *vol_args;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
struct btrfs_device *device = NULL;
|
|
|
|
char *sizestr;
|
2014-03-31 18:03:25 +08:00
|
|
|
char *retptr;
|
2008-06-12 09:53:53 +08:00
|
|
|
char *devstr = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
int mod = 0;
|
2021-05-19 03:12:33 +08:00
|
|
|
bool cancel;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2009-01-06 05:57:23 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-11-26 16:43:45 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2021-05-19 03:12:33 +08:00
|
|
|
/*
|
|
|
|
* Read the arguments before checking exclusivity to be able to
|
|
|
|
* distinguish regular resize and cancel
|
|
|
|
*/
|
2009-04-08 15:06:54 +08:00
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
2012-01-17 04:04:47 +08:00
|
|
|
if (IS_ERR(vol_args)) {
|
|
|
|
ret = PTR_ERR(vol_args);
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_drop;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
2008-07-25 00:20:14 +08:00
|
|
|
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
|
2008-06-12 09:53:53 +08:00
|
|
|
sizestr = vol_args->name;
|
2021-05-19 03:12:33 +08:00
|
|
|
cancel = (strcmp("cancel", sizestr) == 0);
|
|
|
|
ret = exclop_start_or_cancel_reloc(fs_info, BTRFS_EXCLOP_RESIZE, cancel);
|
|
|
|
if (ret)
|
|
|
|
goto out_free;
|
|
|
|
/* Exclusive operation is now claimed */
|
|
|
|
|
2008-06-12 09:53:53 +08:00
|
|
|
devstr = strchr(sizestr, ':');
|
|
|
|
if (devstr) {
|
|
|
|
sizestr = devstr + 1;
|
|
|
|
*devstr = '\0';
|
|
|
|
devstr = vol_args->name;
|
2014-05-13 16:36:08 +08:00
|
|
|
ret = kstrtoull(devstr, 10, &devid);
|
|
|
|
if (ret)
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2012-12-21 17:21:30 +08:00
|
|
|
if (!devid) {
|
|
|
|
ret = -EINVAL;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2012-12-21 17:21:30 +08:00
|
|
|
}
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_info(fs_info, "resizing devid %llu", devid);
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
2012-12-21 17:19:51 +08:00
|
|
|
|
2021-10-06 04:12:42 +08:00
|
|
|
args.devid = devid;
|
|
|
|
device = btrfs_find_device(fs_info->fs_devices, &args);
|
2008-06-12 09:53:53 +08:00
|
|
|
if (!device) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_info(fs_info, "resizer unable to find device %llu",
|
|
|
|
devid);
|
2012-12-21 17:21:30 +08:00
|
|
|
ret = -ENODEV;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
2012-12-21 17:19:51 +08:00
|
|
|
|
2017-12-04 12:54:52 +08:00
|
|
|
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_info(fs_info,
|
2013-12-21 00:37:06 +08:00
|
|
|
"resizer unable to apply on readonly device %llu",
|
2013-08-20 19:20:07 +08:00
|
|
|
devid);
|
2012-12-21 17:21:30 +08:00
|
|
|
ret = -EPERM;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2012-06-14 16:23:19 +08:00
|
|
|
}
|
|
|
|
|
2008-06-12 09:53:53 +08:00
|
|
|
if (!strcmp(sizestr, "max"))
|
2021-10-18 18:11:12 +08:00
|
|
|
new_size = bdev_nr_bytes(device->bdev);
|
2008-06-12 09:53:53 +08:00
|
|
|
else {
|
|
|
|
if (sizestr[0] == '-') {
|
|
|
|
mod = -1;
|
|
|
|
sizestr++;
|
|
|
|
} else if (sizestr[0] == '+') {
|
|
|
|
mod = 1;
|
|
|
|
sizestr++;
|
|
|
|
}
|
2014-03-31 18:03:25 +08:00
|
|
|
new_size = memparse(sizestr, &retptr);
|
|
|
|
if (*retptr != '\0' || new_size == 0) {
|
2008-06-12 09:53:53 +08:00
|
|
|
ret = -EINVAL;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-12-04 12:54:55 +08:00
|
|
|
if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
|
2012-12-21 17:21:30 +08:00
|
|
|
ret = -EPERM;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2012-11-06 01:29:28 +08:00
|
|
|
}
|
|
|
|
|
2014-09-03 21:35:38 +08:00
|
|
|
old_size = btrfs_device_get_total_bytes(device);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
if (mod < 0) {
|
|
|
|
if (new_size > old_size) {
|
|
|
|
ret = -EINVAL;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
new_size = old_size - new_size;
|
|
|
|
} else if (mod > 0) {
|
2013-12-20 15:28:56 +08:00
|
|
|
if (new_size > ULLONG_MAX - old_size) {
|
2014-05-29 09:19:58 +08:00
|
|
|
ret = -ERANGE;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2013-12-20 15:28:56 +08:00
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
new_size = old_size + new_size;
|
|
|
|
}
|
|
|
|
|
2015-12-15 00:42:10 +08:00
|
|
|
if (new_size < SZ_256M) {
|
2008-06-12 09:53:53 +08:00
|
|
|
ret = -EINVAL;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
2021-10-18 18:11:12 +08:00
|
|
|
if (new_size > bdev_nr_bytes(device->bdev)) {
|
2008-06-12 09:53:53 +08:00
|
|
|
ret = -EFBIG;
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
2017-07-18 20:39:08 +08:00
|
|
|
new_size = round_down(new_size, fs_info->sectorsize);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
if (new_size > old_size) {
|
2010-05-16 22:48:46 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
2021-05-19 03:12:33 +08:00
|
|
|
goto out_finish;
|
2011-01-20 14:19:37 +08:00
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
ret = btrfs_grow_device(trans, device, new_size);
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_commit_transaction(trans);
|
2011-11-19 02:55:01 +08:00
|
|
|
} else if (new_size < old_size) {
|
2008-06-12 09:53:53 +08:00
|
|
|
ret = btrfs_shrink_device(device, new_size);
|
2012-10-27 20:06:39 +08:00
|
|
|
} /* equal, nothing need to do */
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2020-02-11 21:55:26 +08:00
|
|
|
if (ret == 0 && new_size != old_size)
|
|
|
|
btrfs_info_in_rcu(fs_info,
|
|
|
|
"resize device %s (devid %llu) from %llu to %llu",
|
|
|
|
rcu_str_deref(device->name), device->devid,
|
|
|
|
old_size, new_size);
|
2021-05-19 03:12:33 +08:00
|
|
|
out_finish:
|
|
|
|
btrfs_exclop_finish(fs_info);
|
2012-01-17 04:04:47 +08:00
|
|
|
out_free:
|
2008-06-12 09:53:53 +08:00
|
|
|
kfree(vol_args);
|
2021-05-19 03:12:33 +08:00
|
|
|
out_drop:
|
2013-01-20 21:57:57 +08:00
|
|
|
mnt_drop_write_file(file);
|
2008-06-12 09:53:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-03-13 23:23:19 +08:00
|
|
|
static noinline int __btrfs_ioctl_snap_create(struct file *file,
|
2021-07-27 18:48:52 +08:00
|
|
|
struct user_namespace *mnt_userns,
|
2017-02-15 01:33:53 +08:00
|
|
|
const char *name, unsigned long fd, int subvol,
|
2020-03-13 23:23:19 +08:00
|
|
|
bool readonly,
|
2013-02-07 14:02:44 +08:00
|
|
|
struct btrfs_qgroup_inherit *inherit)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
|
|
|
int namelen;
|
2008-11-18 10:02:50 +08:00
|
|
|
int ret = 0;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2016-09-21 20:31:29 +08:00
|
|
|
if (!S_ISDIR(file_inode(file)->i_mode))
|
|
|
|
return -ENOTDIR;
|
|
|
|
|
2012-06-29 17:58:46 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2010-10-30 03:41:32 +08:00
|
|
|
namelen = strlen(name);
|
|
|
|
if (strchr(name, '/')) {
|
2008-06-12 09:53:53 +08:00
|
|
|
ret = -EINVAL;
|
2012-06-29 17:58:46 +08:00
|
|
|
goto out_drop_write;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
2012-02-21 11:14:55 +08:00
|
|
|
if (name[0] == '.' &&
|
|
|
|
(namelen == 1 || (name[1] == '.' && namelen == 2))) {
|
|
|
|
ret = -EEXIST;
|
2012-06-29 17:58:46 +08:00
|
|
|
goto out_drop_write;
|
2012-02-21 11:14:55 +08:00
|
|
|
}
|
|
|
|
|
2008-11-18 10:02:50 +08:00
|
|
|
if (subvol) {
|
2021-07-27 18:48:52 +08:00
|
|
|
ret = btrfs_mksubvol(&file->f_path, mnt_userns, name,
|
|
|
|
namelen, NULL, readonly, inherit);
|
2008-10-10 01:39:39 +08:00
|
|
|
} else {
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd src = fdget(fd);
|
2008-11-18 10:02:50 +08:00
|
|
|
struct inode *src_inode;
|
2012-08-29 00:52:22 +08:00
|
|
|
if (!src.file) {
|
2008-11-18 10:02:50 +08:00
|
|
|
ret = -EINVAL;
|
2012-06-29 17:58:46 +08:00
|
|
|
goto out_drop_write;
|
2008-11-18 10:02:50 +08:00
|
|
|
}
|
|
|
|
|
2013-01-24 06:07:38 +08:00
|
|
|
src_inode = file_inode(src.file);
|
|
|
|
if (src_inode->i_sb != file_inode(file)->i_sb) {
|
2016-03-25 22:02:41 +08:00
|
|
|
btrfs_info(BTRFS_I(file_inode(file))->root->fs_info,
|
2013-12-21 00:37:06 +08:00
|
|
|
"Snapshot src from another FS");
|
2014-01-30 15:32:02 +08:00
|
|
|
ret = -EXDEV;
|
2021-07-27 18:48:52 +08:00
|
|
|
} else if (!inode_owner_or_capable(mnt_userns, src_inode)) {
|
2014-01-16 01:15:52 +08:00
|
|
|
/*
|
|
|
|
* Subvolume creation is not restricted, but snapshots
|
|
|
|
* are limited to own subvolumes only
|
|
|
|
*/
|
|
|
|
ret = -EPERM;
|
2012-08-27 09:20:24 +08:00
|
|
|
} else {
|
2021-07-27 18:48:52 +08:00
|
|
|
ret = btrfs_mksnapshot(&file->f_path, mnt_userns,
|
|
|
|
name, namelen,
|
|
|
|
BTRFS_I(src_inode)->root,
|
|
|
|
readonly, inherit);
|
2008-11-18 10:02:50 +08:00
|
|
|
}
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(src);
|
2008-10-10 01:39:39 +08:00
|
|
|
}
|
2012-06-29 17:58:46 +08:00
|
|
|
out_drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
2008-06-12 09:53:53 +08:00
|
|
|
out:
|
2010-10-30 03:41:32 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int btrfs_ioctl_snap_create(struct file *file,
|
2010-12-20 15:53:28 +08:00
|
|
|
void __user *arg, int subvol)
|
2010-10-30 03:41:32 +08:00
|
|
|
{
|
2010-12-20 15:53:28 +08:00
|
|
|
struct btrfs_ioctl_vol_args *vol_args;
|
2010-10-30 03:41:32 +08:00
|
|
|
int ret;
|
|
|
|
|
2016-09-21 20:31:29 +08:00
|
|
|
if (!S_ISDIR(file_inode(file)->i_mode))
|
|
|
|
return -ENOTDIR;
|
|
|
|
|
2010-12-20 15:53:28 +08:00
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
|
|
|
if (IS_ERR(vol_args))
|
|
|
|
return PTR_ERR(vol_args);
|
|
|
|
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
|
2010-10-30 03:41:32 +08:00
|
|
|
|
2021-07-27 18:48:52 +08:00
|
|
|
ret = __btrfs_ioctl_snap_create(file, file_mnt_user_ns(file),
|
|
|
|
vol_args->name, vol_args->fd, subvol,
|
|
|
|
false, NULL);
|
2010-12-10 14:41:56 +08:00
|
|
|
|
2010-12-20 15:53:28 +08:00
|
|
|
kfree(vol_args);
|
|
|
|
return ret;
|
|
|
|
}
|
2010-12-10 14:41:56 +08:00
|
|
|
|
2010-12-20 15:53:28 +08:00
|
|
|
static noinline int btrfs_ioctl_snap_create_v2(struct file *file,
|
|
|
|
void __user *arg, int subvol)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_vol_args_v2 *vol_args;
|
|
|
|
int ret;
|
2010-12-20 16:04:08 +08:00
|
|
|
bool readonly = false;
|
2011-09-14 21:58:21 +08:00
|
|
|
struct btrfs_qgroup_inherit *inherit = NULL;
|
2010-12-10 08:36:28 +08:00
|
|
|
|
2016-09-21 20:31:29 +08:00
|
|
|
if (!S_ISDIR(file_inode(file)->i_mode))
|
|
|
|
return -ENOTDIR;
|
|
|
|
|
2010-12-20 15:53:28 +08:00
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
|
|
|
if (IS_ERR(vol_args))
|
|
|
|
return PTR_ERR(vol_args);
|
|
|
|
vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
|
2010-12-10 08:36:28 +08:00
|
|
|
|
2020-02-21 20:24:37 +08:00
|
|
|
if (vol_args->flags & ~BTRFS_SUBVOL_CREATE_ARGS_MASK) {
|
2010-12-20 16:04:08 +08:00
|
|
|
ret = -EOPNOTSUPP;
|
2014-09-04 19:09:15 +08:00
|
|
|
goto free_args;
|
2010-10-30 03:41:32 +08:00
|
|
|
}
|
2010-12-20 15:53:28 +08:00
|
|
|
|
2010-12-20 16:04:08 +08:00
|
|
|
if (vol_args->flags & BTRFS_SUBVOL_RDONLY)
|
|
|
|
readonly = true;
|
2011-09-14 21:58:21 +08:00
|
|
|
if (vol_args->flags & BTRFS_SUBVOL_QGROUP_INHERIT) {
|
2021-02-17 14:04:34 +08:00
|
|
|
u64 nums;
|
|
|
|
|
|
|
|
if (vol_args->size < sizeof(*inherit) ||
|
|
|
|
vol_args->size > PAGE_SIZE) {
|
2011-09-14 21:58:21 +08:00
|
|
|
ret = -EINVAL;
|
2014-09-04 19:09:15 +08:00
|
|
|
goto free_args;
|
2011-09-14 21:58:21 +08:00
|
|
|
}
|
|
|
|
inherit = memdup_user(vol_args->qgroup_inherit, vol_args->size);
|
|
|
|
if (IS_ERR(inherit)) {
|
|
|
|
ret = PTR_ERR(inherit);
|
2014-09-04 19:09:15 +08:00
|
|
|
goto free_args;
|
2011-09-14 21:58:21 +08:00
|
|
|
}
|
2021-02-17 14:04:34 +08:00
|
|
|
|
|
|
|
if (inherit->num_qgroups > PAGE_SIZE ||
|
|
|
|
inherit->num_ref_copies > PAGE_SIZE ||
|
|
|
|
inherit->num_excl_copies > PAGE_SIZE) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto free_inherit;
|
|
|
|
}
|
|
|
|
|
|
|
|
nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
|
|
|
|
2 * inherit->num_excl_copies;
|
|
|
|
if (vol_args->size != struct_size(inherit, qgroups, nums)) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto free_inherit;
|
|
|
|
}
|
2011-09-14 21:58:21 +08:00
|
|
|
}
|
2010-12-20 15:53:28 +08:00
|
|
|
|
2021-07-27 18:48:52 +08:00
|
|
|
ret = __btrfs_ioctl_snap_create(file, file_mnt_user_ns(file),
|
|
|
|
vol_args->name, vol_args->fd, subvol,
|
|
|
|
readonly, inherit);
|
2014-09-04 19:09:15 +08:00
|
|
|
if (ret)
|
|
|
|
goto free_inherit;
|
|
|
|
free_inherit:
|
2011-09-14 21:58:21 +08:00
|
|
|
kfree(inherit);
|
2014-09-04 19:09:15 +08:00
|
|
|
free_args:
|
|
|
|
kfree(vol_args);
|
2008-06-12 09:53:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
static noinline int btrfs_ioctl_subvol_getflags(struct inode *inode,
|
2010-12-20 16:30:25 +08:00
|
|
|
void __user *arg)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2010-12-20 16:30:25 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
int ret = 0;
|
|
|
|
u64 flags = 0;
|
|
|
|
|
2017-01-11 02:35:31 +08:00
|
|
|
if (btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID)
|
2010-12-20 16:30:25 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
down_read(&fs_info->subvol_sem);
|
2010-12-20 16:30:25 +08:00
|
|
|
if (btrfs_root_readonly(root))
|
|
|
|
flags |= BTRFS_SUBVOL_RDONLY;
|
2016-06-23 06:54:23 +08:00
|
|
|
up_read(&fs_info->subvol_sem);
|
2010-12-20 16:30:25 +08:00
|
|
|
|
|
|
|
if (copy_to_user(arg, &flags, sizeof(flags)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int btrfs_ioctl_subvol_setflags(struct file *file,
|
|
|
|
void __user *arg)
|
|
|
|
{
|
2013-01-24 06:07:38 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2010-12-20 16:30:25 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 root_flags;
|
|
|
|
u64 flags;
|
|
|
|
int ret = 0;
|
|
|
|
|
2021-07-27 18:48:56 +08:00
|
|
|
if (!inode_owner_or_capable(file_mnt_user_ns(file), inode))
|
2014-01-16 22:50:22 +08:00
|
|
|
return -EPERM;
|
|
|
|
|
2012-06-29 17:58:49 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2010-12-20 16:30:25 +08:00
|
|
|
|
2017-01-11 02:35:31 +08:00
|
|
|
if (btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) {
|
2012-06-29 17:58:49 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out_drop_write;
|
|
|
|
}
|
2010-12-20 16:30:25 +08:00
|
|
|
|
2012-06-29 17:58:49 +08:00
|
|
|
if (copy_from_user(&flags, arg, sizeof(flags))) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out_drop_write;
|
|
|
|
}
|
2010-12-20 16:30:25 +08:00
|
|
|
|
2012-06-29 17:58:49 +08:00
|
|
|
if (flags & ~BTRFS_SUBVOL_RDONLY) {
|
|
|
|
ret = -EOPNOTSUPP;
|
|
|
|
goto out_drop_write;
|
|
|
|
}
|
2010-12-20 16:30:25 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
down_write(&fs_info->subvol_sem);
|
2010-12-20 16:30:25 +08:00
|
|
|
|
|
|
|
/* nothing to do */
|
|
|
|
if (!!(flags & BTRFS_SUBVOL_RDONLY) == btrfs_root_readonly(root))
|
2012-06-29 17:58:49 +08:00
|
|
|
goto out_drop_sem;
|
2010-12-20 16:30:25 +08:00
|
|
|
|
|
|
|
root_flags = btrfs_root_flags(&root->root_item);
|
2013-12-17 00:34:17 +08:00
|
|
|
if (flags & BTRFS_SUBVOL_RDONLY) {
|
2010-12-20 16:30:25 +08:00
|
|
|
btrfs_set_root_flags(&root->root_item,
|
|
|
|
root_flags | BTRFS_ROOT_SUBVOL_RDONLY);
|
2013-12-17 00:34:17 +08:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Block RO -> RW transition if this subvolume is involved in
|
|
|
|
* send
|
|
|
|
*/
|
|
|
|
spin_lock(&root->root_item_lock);
|
|
|
|
if (root->send_in_progress == 0) {
|
|
|
|
btrfs_set_root_flags(&root->root_item,
|
2010-12-20 16:30:25 +08:00
|
|
|
root_flags & ~BTRFS_ROOT_SUBVOL_RDONLY);
|
2013-12-17 00:34:17 +08:00
|
|
|
spin_unlock(&root->root_item_lock);
|
|
|
|
} else {
|
|
|
|
spin_unlock(&root->root_item_lock);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"Attempt to set subvolume %llu read-write during send",
|
|
|
|
root->root_key.objectid);
|
2013-12-17 00:34:17 +08:00
|
|
|
ret = -EPERM;
|
|
|
|
goto out_drop_sem;
|
|
|
|
}
|
|
|
|
}
|
2010-12-20 16:30:25 +08:00
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 1);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto out_reset;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_update_root(trans, fs_info->tree_root,
|
2010-12-20 16:30:25 +08:00
|
|
|
&root->root_key, &root->root_item);
|
2017-09-28 15:53:17 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
goto out_reset;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = btrfs_commit_transaction(trans);
|
2010-12-20 16:30:25 +08:00
|
|
|
|
|
|
|
out_reset:
|
|
|
|
if (ret)
|
|
|
|
btrfs_set_root_flags(&root->root_item, root_flags);
|
2012-06-29 17:58:49 +08:00
|
|
|
out_drop_sem:
|
2016-06-23 06:54:23 +08:00
|
|
|
up_write(&fs_info->subvol_sem);
|
2012-06-29 17:58:49 +08:00
|
|
|
out_drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
|
|
|
out:
|
2010-12-20 16:30:25 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-03-01 04:39:26 +08:00
|
|
|
static noinline int key_in_sk(struct btrfs_key *key,
|
|
|
|
struct btrfs_ioctl_search_key *sk)
|
|
|
|
{
|
2010-03-19 00:10:08 +08:00
|
|
|
struct btrfs_key test;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
test.objectid = sk->min_objectid;
|
|
|
|
test.type = sk->min_type;
|
|
|
|
test.offset = sk->min_offset;
|
|
|
|
|
|
|
|
ret = btrfs_comp_cpu_keys(key, &test);
|
|
|
|
if (ret < 0)
|
2010-03-01 04:39:26 +08:00
|
|
|
return 0;
|
2010-03-19 00:10:08 +08:00
|
|
|
|
|
|
|
test.objectid = sk->max_objectid;
|
|
|
|
test.type = sk->max_type;
|
|
|
|
test.offset = sk->max_offset;
|
|
|
|
|
|
|
|
ret = btrfs_comp_cpu_keys(key, &test);
|
|
|
|
if (ret > 0)
|
2010-03-01 04:39:26 +08:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2016-06-22 08:18:21 +08:00
|
|
|
static noinline int copy_to_sk(struct btrfs_path *path,
|
2010-03-01 04:39:26 +08:00
|
|
|
struct btrfs_key *key,
|
|
|
|
struct btrfs_ioctl_search_key *sk,
|
2014-01-30 23:24:00 +08:00
|
|
|
size_t *buf_size,
|
2014-01-30 23:24:02 +08:00
|
|
|
char __user *ubuf,
|
2010-03-01 04:39:26 +08:00
|
|
|
unsigned long *sk_offset,
|
|
|
|
int *num_found)
|
|
|
|
{
|
|
|
|
u64 found_transid;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct btrfs_ioctl_search_header sh;
|
2015-06-30 10:25:43 +08:00
|
|
|
struct btrfs_key test;
|
2010-03-01 04:39:26 +08:00
|
|
|
unsigned long item_off;
|
|
|
|
unsigned long item_len;
|
|
|
|
int nritems;
|
|
|
|
int i;
|
|
|
|
int slot;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
nritems = btrfs_header_nritems(leaf);
|
|
|
|
|
|
|
|
if (btrfs_header_generation(leaf) > sk->max_transid) {
|
|
|
|
i = nritems;
|
|
|
|
goto advance_key;
|
|
|
|
}
|
|
|
|
found_transid = btrfs_header_generation(leaf);
|
|
|
|
|
|
|
|
for (i = slot; i < nritems; i++) {
|
|
|
|
item_off = btrfs_item_ptr_offset(leaf, i);
|
2021-10-22 02:58:35 +08:00
|
|
|
item_len = btrfs_item_size(leaf, i);
|
2010-03-01 04:39:26 +08:00
|
|
|
|
2013-05-07 01:40:18 +08:00
|
|
|
btrfs_item_key_to_cpu(leaf, key, i);
|
|
|
|
if (!key_in_sk(key, sk))
|
|
|
|
continue;
|
|
|
|
|
2014-01-30 23:24:00 +08:00
|
|
|
if (sizeof(sh) + item_len > *buf_size) {
|
2014-01-30 23:23:59 +08:00
|
|
|
if (*num_found) {
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* return one empty item back for v1, which does not
|
|
|
|
* handle -EOVERFLOW
|
|
|
|
*/
|
|
|
|
|
2014-01-30 23:24:00 +08:00
|
|
|
*buf_size = sizeof(sh) + item_len;
|
2010-03-01 04:39:26 +08:00
|
|
|
item_len = 0;
|
2014-01-30 23:23:59 +08:00
|
|
|
ret = -EOVERFLOW;
|
|
|
|
}
|
2010-03-01 04:39:26 +08:00
|
|
|
|
2014-01-30 23:24:00 +08:00
|
|
|
if (sizeof(sh) + item_len + *sk_offset > *buf_size) {
|
2010-03-01 04:39:26 +08:00
|
|
|
ret = 1;
|
2014-01-30 23:23:57 +08:00
|
|
|
goto out;
|
2010-03-01 04:39:26 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
sh.objectid = key->objectid;
|
|
|
|
sh.offset = key->offset;
|
|
|
|
sh.type = key->type;
|
|
|
|
sh.len = item_len;
|
|
|
|
sh.transid = found_transid;
|
|
|
|
|
2020-08-10 23:42:27 +08:00
|
|
|
/*
|
|
|
|
* Copy search result header. If we fault then loop again so we
|
|
|
|
* can fault in the pages and -EFAULT there if there's a
|
|
|
|
* problem. Otherwise we'll fault and then copy the buffer in
|
|
|
|
* properly this next time through
|
|
|
|
*/
|
|
|
|
if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
|
|
|
|
ret = 0;
|
2014-01-30 23:24:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-03-01 04:39:26 +08:00
|
|
|
*sk_offset += sizeof(sh);
|
|
|
|
|
|
|
|
if (item_len) {
|
2014-01-30 23:24:02 +08:00
|
|
|
char __user *up = ubuf + *sk_offset;
|
2020-08-10 23:42:27 +08:00
|
|
|
/*
|
|
|
|
* Copy the item, same behavior as above, but reset the
|
|
|
|
* * sk_offset so we copy the full thing again.
|
|
|
|
*/
|
|
|
|
if (read_extent_buffer_to_user_nofault(leaf, up,
|
|
|
|
item_off, item_len)) {
|
|
|
|
ret = 0;
|
|
|
|
*sk_offset -= sizeof(sh);
|
2014-01-30 23:24:02 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-03-01 04:39:26 +08:00
|
|
|
*sk_offset += item_len;
|
|
|
|
}
|
2011-05-15 01:43:41 +08:00
|
|
|
(*num_found)++;
|
2010-03-01 04:39:26 +08:00
|
|
|
|
2014-01-30 23:23:59 +08:00
|
|
|
if (ret) /* -EOVERFLOW from above */
|
|
|
|
goto out;
|
|
|
|
|
2014-01-30 23:23:57 +08:00
|
|
|
if (*num_found >= sk->nr_items) {
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-03-01 04:39:26 +08:00
|
|
|
}
|
|
|
|
advance_key:
|
2010-03-19 00:10:08 +08:00
|
|
|
ret = 0;
|
2015-06-30 10:25:43 +08:00
|
|
|
test.objectid = sk->max_objectid;
|
|
|
|
test.type = sk->max_type;
|
|
|
|
test.offset = sk->max_offset;
|
|
|
|
if (btrfs_comp_cpu_keys(key, &test) >= 0)
|
|
|
|
ret = 1;
|
|
|
|
else if (key->offset < (u64)-1)
|
2010-03-01 04:39:26 +08:00
|
|
|
key->offset++;
|
2015-06-30 10:25:43 +08:00
|
|
|
else if (key->type < (u8)-1) {
|
2010-03-19 00:10:08 +08:00
|
|
|
key->offset = 0;
|
2010-03-01 04:39:26 +08:00
|
|
|
key->type++;
|
2015-06-30 10:25:43 +08:00
|
|
|
} else if (key->objectid < (u64)-1) {
|
2010-03-19 00:10:08 +08:00
|
|
|
key->offset = 0;
|
|
|
|
key->type = 0;
|
2010-03-01 04:39:26 +08:00
|
|
|
key->objectid++;
|
2010-03-19 00:10:08 +08:00
|
|
|
} else
|
|
|
|
ret = 1;
|
2014-01-30 23:23:57 +08:00
|
|
|
out:
|
2014-01-30 23:24:02 +08:00
|
|
|
/*
|
|
|
|
* 0: all items from this leaf copied, continue with next
|
|
|
|
* 1: * more items can be copied, but unused buffer is too small
|
|
|
|
* * all items were found
|
|
|
|
* Either way, it will stops the loop which iterates to the next
|
|
|
|
* leaf
|
|
|
|
* -EOVERFLOW: item was to large for buffer
|
|
|
|
* -EFAULT: could not copy extent buffer back to userspace
|
|
|
|
*/
|
2010-03-01 04:39:26 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int search_ioctl(struct inode *inode,
|
2014-01-30 23:23:58 +08:00
|
|
|
struct btrfs_ioctl_search_key *sk,
|
2014-01-30 23:24:00 +08:00
|
|
|
size_t *buf_size,
|
2014-01-30 23:24:02 +08:00
|
|
|
char __user *ubuf)
|
2010-03-01 04:39:26 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *info = btrfs_sb(inode->i_sb);
|
2010-03-01 04:39:26 +08:00
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
int ret;
|
|
|
|
int num_found = 0;
|
|
|
|
unsigned long sk_offset = 0;
|
|
|
|
|
2014-01-30 23:24:00 +08:00
|
|
|
if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
|
|
|
|
*buf_size = sizeof(struct btrfs_ioctl_search_header);
|
2014-01-30 23:23:58 +08:00
|
|
|
return -EOVERFLOW;
|
2014-01-30 23:24:00 +08:00
|
|
|
}
|
2014-01-30 23:23:58 +08:00
|
|
|
|
2010-03-01 04:39:26 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
if (sk->tree_id == 0) {
|
|
|
|
/* search the root of the inode that was passed */
|
2020-01-24 22:33:01 +08:00
|
|
|
root = btrfs_grab_root(BTRFS_I(inode)->root);
|
2010-03-01 04:39:26 +08:00
|
|
|
} else {
|
2020-05-16 01:35:55 +08:00
|
|
|
root = btrfs_get_fs_root(info, sk->tree_id, true);
|
2010-03-01 04:39:26 +08:00
|
|
|
if (IS_ERR(root)) {
|
|
|
|
btrfs_free_path(path);
|
2018-05-21 12:57:27 +08:00
|
|
|
return PTR_ERR(root);
|
2010-03-01 04:39:26 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = sk->min_objectid;
|
|
|
|
key.type = sk->min_type;
|
|
|
|
key.offset = sk->min_offset;
|
|
|
|
|
2013-10-31 13:03:04 +08:00
|
|
|
while (1) {
|
2021-08-02 19:44:20 +08:00
|
|
|
ret = -EFAULT;
|
2022-04-23 18:07:51 +08:00
|
|
|
/*
|
|
|
|
* Ensure that the whole user buffer is faulted in at sub-page
|
|
|
|
* granularity, otherwise the loop may live-lock.
|
|
|
|
*/
|
|
|
|
if (fault_in_subpage_writeable(ubuf + sk_offset,
|
|
|
|
*buf_size - sk_offset))
|
2020-08-10 23:42:27 +08:00
|
|
|
break;
|
|
|
|
|
2013-10-01 23:13:42 +08:00
|
|
|
ret = btrfs_search_forward(root, &key, path, sk->min_transid);
|
2010-03-01 04:39:26 +08:00
|
|
|
if (ret != 0) {
|
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
|
|
|
goto err;
|
|
|
|
}
|
2016-06-22 08:18:21 +08:00
|
|
|
ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
|
2010-03-01 04:39:26 +08:00
|
|
|
&sk_offset, &num_found);
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2014-01-30 23:23:57 +08:00
|
|
|
if (ret)
|
2010-03-01 04:39:26 +08:00
|
|
|
break;
|
|
|
|
|
|
|
|
}
|
2014-01-30 23:23:59 +08:00
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
2010-03-01 04:39:26 +08:00
|
|
|
err:
|
|
|
|
sk->nr_items = num_found;
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(root);
|
2010-03-01 04:39:26 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
static noinline int btrfs_ioctl_tree_search(struct inode *inode,
|
|
|
|
void __user *argp)
|
2010-03-01 04:39:26 +08:00
|
|
|
{
|
2022-03-31 18:34:08 +08:00
|
|
|
struct btrfs_ioctl_search_args __user *uargs = argp;
|
2014-01-30 23:24:02 +08:00
|
|
|
struct btrfs_ioctl_search_key sk;
|
2014-01-30 23:24:00 +08:00
|
|
|
int ret;
|
|
|
|
size_t buf_size;
|
2010-03-01 04:39:26 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2014-01-30 23:24:02 +08:00
|
|
|
if (copy_from_user(&sk, &uargs->key, sizeof(sk)))
|
|
|
|
return -EFAULT;
|
2010-03-01 04:39:26 +08:00
|
|
|
|
2014-01-30 23:24:02 +08:00
|
|
|
buf_size = sizeof(uargs->buf);
|
2010-03-01 04:39:26 +08:00
|
|
|
|
2014-01-30 23:24:02 +08:00
|
|
|
ret = search_ioctl(inode, &sk, &buf_size, uargs->buf);
|
2014-01-30 23:23:59 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In the origin implementation an overflow is handled by returning a
|
|
|
|
* search header with a len of zero, so reset ret.
|
|
|
|
*/
|
|
|
|
if (ret == -EOVERFLOW)
|
|
|
|
ret = 0;
|
|
|
|
|
2014-01-30 23:24:02 +08:00
|
|
|
if (ret == 0 && copy_to_user(&uargs->key, &sk, sizeof(sk)))
|
2010-03-01 04:39:26 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
static noinline int btrfs_ioctl_tree_search_v2(struct inode *inode,
|
2014-01-30 23:24:03 +08:00
|
|
|
void __user *argp)
|
|
|
|
{
|
2022-03-31 18:34:08 +08:00
|
|
|
struct btrfs_ioctl_search_args_v2 __user *uarg = argp;
|
2014-01-30 23:24:03 +08:00
|
|
|
struct btrfs_ioctl_search_args_v2 args;
|
|
|
|
int ret;
|
|
|
|
size_t buf_size;
|
2015-12-15 00:42:10 +08:00
|
|
|
const size_t buf_limit = SZ_16M;
|
2014-01-30 23:24:03 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/* copy search header and buffer size */
|
|
|
|
if (copy_from_user(&args, uarg, sizeof(args)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf_size = args.buf_size;
|
|
|
|
|
|
|
|
/* limit result size to 16MB */
|
|
|
|
if (buf_size > buf_limit)
|
|
|
|
buf_size = buf_limit;
|
|
|
|
|
|
|
|
ret = search_ioctl(inode, &args.key, &buf_size,
|
2017-08-23 14:46:05 +08:00
|
|
|
(char __user *)(&uarg->buf[0]));
|
2014-01-30 23:24:03 +08:00
|
|
|
if (ret == 0 && copy_to_user(&uarg->key, &args.key, sizeof(args.key)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
else if (ret == -EOVERFLOW &&
|
|
|
|
copy_to_user(&uarg->buf_size, &buf_size, sizeof(buf_size)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
2010-03-01 04:39:26 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-11-18 13:42:14 +08:00
|
|
|
/*
|
2010-03-01 04:39:26 +08:00
|
|
|
* Search INODE_REFs to identify path name of 'dirid' directory
|
|
|
|
* in a 'tree_id' tree. and sets path name to 'name'.
|
|
|
|
*/
|
2009-11-18 13:42:14 +08:00
|
|
|
static noinline int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
|
|
|
|
u64 tree_id, u64 dirid, char *name)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_key key;
|
2010-03-01 04:39:26 +08:00
|
|
|
char *ptr;
|
2009-11-18 13:42:14 +08:00
|
|
|
int ret = -1;
|
|
|
|
int slot;
|
|
|
|
int len;
|
|
|
|
int total_len = 0;
|
|
|
|
struct btrfs_inode_ref *iref;
|
|
|
|
struct extent_buffer *l;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
|
|
|
|
if (dirid == BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
name[0]='\0';
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2017-12-01 17:19:42 +08:00
|
|
|
ptr = &name[BTRFS_INO_LOOKUP_PATH_MAX - 1];
|
2009-11-18 13:42:14 +08:00
|
|
|
|
2020-05-16 01:35:55 +08:00
|
|
|
root = btrfs_get_fs_root(info, tree_id, true);
|
2009-11-18 13:42:14 +08:00
|
|
|
if (IS_ERR(root)) {
|
2018-05-21 12:57:27 +08:00
|
|
|
ret = PTR_ERR(root);
|
2020-01-24 22:32:34 +08:00
|
|
|
root = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-11-18 13:42:14 +08:00
|
|
|
|
|
|
|
key.objectid = dirid;
|
|
|
|
key.type = BTRFS_INODE_REF_KEY;
|
2010-03-19 00:23:10 +08:00
|
|
|
key.offset = (u64)-1;
|
2009-11-18 13:42:14 +08:00
|
|
|
|
2013-10-31 13:03:04 +08:00
|
|
|
while (1) {
|
2021-07-29 16:22:16 +08:00
|
|
|
ret = btrfs_search_backwards(root, &key, path);
|
2009-11-18 13:42:14 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
2013-08-14 10:00:21 +08:00
|
|
|
else if (ret > 0) {
|
2021-07-29 16:22:16 +08:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
2013-08-14 10:00:21 +08:00
|
|
|
}
|
2009-11-18 13:42:14 +08:00
|
|
|
|
|
|
|
l = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
|
|
|
|
iref = btrfs_item_ptr(l, slot, struct btrfs_inode_ref);
|
|
|
|
len = btrfs_inode_ref_name_len(l, iref);
|
|
|
|
ptr -= len + 1;
|
|
|
|
total_len += len + 1;
|
2013-08-14 10:00:20 +08:00
|
|
|
if (ptr < name) {
|
|
|
|
ret = -ENAMETOOLONG;
|
2009-11-18 13:42:14 +08:00
|
|
|
goto out;
|
2013-08-14 10:00:20 +08:00
|
|
|
}
|
2009-11-18 13:42:14 +08:00
|
|
|
|
|
|
|
*(ptr + len) = '/';
|
2013-10-31 13:03:04 +08:00
|
|
|
read_extent_buffer(l, ptr, (unsigned long)(iref + 1), len);
|
2009-11-18 13:42:14 +08:00
|
|
|
|
|
|
|
if (key.offset == BTRFS_FIRST_FREE_OBJECTID)
|
|
|
|
break;
|
|
|
|
|
2011-04-21 07:20:15 +08:00
|
|
|
btrfs_release_path(path);
|
2009-11-18 13:42:14 +08:00
|
|
|
key.objectid = key.offset;
|
2010-03-19 00:23:10 +08:00
|
|
|
key.offset = (u64)-1;
|
2009-11-18 13:42:14 +08:00
|
|
|
dirid = key.objectid;
|
|
|
|
}
|
2011-07-14 11:16:00 +08:00
|
|
|
memmove(name, ptr, total_len);
|
2013-10-31 13:03:04 +08:00
|
|
|
name[total_len] = '\0';
|
2009-11-18 13:42:14 +08:00
|
|
|
ret = 0;
|
|
|
|
out:
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(root);
|
2009-11-18 13:42:14 +08:00
|
|
|
btrfs_free_path(path);
|
2010-03-01 04:39:26 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-07-27 18:48:57 +08:00
|
|
|
static int btrfs_search_path_in_tree_user(struct user_namespace *mnt_userns,
|
|
|
|
struct inode *inode,
|
2018-05-21 09:09:44 +08:00
|
|
|
struct btrfs_ioctl_ino_lookup_user_args *args)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
|
|
|
|
struct super_block *sb = inode->i_sb;
|
|
|
|
struct btrfs_key upper_limit = BTRFS_I(inode)->location;
|
|
|
|
u64 treeid = BTRFS_I(inode)->root->root_key.objectid;
|
|
|
|
u64 dirid = args->dirid;
|
|
|
|
unsigned long item_off;
|
|
|
|
unsigned long item_len;
|
|
|
|
struct btrfs_inode_ref *iref;
|
|
|
|
struct btrfs_root_ref *rref;
|
2020-01-24 22:32:35 +08:00
|
|
|
struct btrfs_root *root = NULL;
|
2018-05-21 09:09:44 +08:00
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key, key2;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct inode *temp_inode;
|
|
|
|
char *ptr;
|
|
|
|
int slot;
|
|
|
|
int len;
|
|
|
|
int total_len = 0;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the bottom subvolume does not exist directly under upper_limit,
|
|
|
|
* construct the path in from the bottom up.
|
|
|
|
*/
|
|
|
|
if (dirid != upper_limit.objectid) {
|
|
|
|
ptr = &args->path[BTRFS_INO_LOOKUP_USER_PATH_MAX - 1];
|
|
|
|
|
2020-05-16 01:35:55 +08:00
|
|
|
root = btrfs_get_fs_root(fs_info, treeid, true);
|
2018-05-21 09:09:44 +08:00
|
|
|
if (IS_ERR(root)) {
|
|
|
|
ret = PTR_ERR(root);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
key.objectid = dirid;
|
|
|
|
key.type = BTRFS_INODE_REF_KEY;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
while (1) {
|
2021-07-29 16:22:16 +08:00
|
|
|
ret = btrfs_search_backwards(root, &key, path);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_put;
|
|
|
|
else if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
|
|
|
|
iref = btrfs_item_ptr(leaf, slot, struct btrfs_inode_ref);
|
|
|
|
len = btrfs_inode_ref_name_len(leaf, iref);
|
|
|
|
ptr -= len + 1;
|
|
|
|
total_len += len + 1;
|
|
|
|
if (ptr < args->path) {
|
|
|
|
ret = -ENAMETOOLONG;
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
*(ptr + len) = '/';
|
|
|
|
read_extent_buffer(leaf, ptr,
|
|
|
|
(unsigned long)(iref + 1), len);
|
|
|
|
|
|
|
|
/* Check the read+exec permission of this directory */
|
|
|
|
ret = btrfs_previous_item(root, path, dirid,
|
|
|
|
BTRFS_INODE_ITEM_KEY);
|
|
|
|
if (ret < 0) {
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
} else if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key2, slot);
|
|
|
|
if (key2.objectid != dirid) {
|
|
|
|
ret = -ENOENT;
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
}
|
|
|
|
|
2020-05-16 01:35:59 +08:00
|
|
|
temp_inode = btrfs_iget(sb, key2.objectid, root);
|
2018-06-04 15:41:07 +08:00
|
|
|
if (IS_ERR(temp_inode)) {
|
|
|
|
ret = PTR_ERR(temp_inode);
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-06-04 15:41:07 +08:00
|
|
|
}
|
2021-07-27 18:48:57 +08:00
|
|
|
ret = inode_permission(mnt_userns, temp_inode,
|
2021-01-21 21:19:24 +08:00
|
|
|
MAY_READ | MAY_EXEC);
|
2018-05-21 09:09:44 +08:00
|
|
|
iput(temp_inode);
|
|
|
|
if (ret) {
|
|
|
|
ret = -EACCES;
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (key.offset == upper_limit.objectid)
|
|
|
|
break;
|
|
|
|
if (key.objectid == BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
ret = -EACCES;
|
2020-01-24 22:32:35 +08:00
|
|
|
goto out_put;
|
2018-05-21 09:09:44 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_release_path(path);
|
|
|
|
key.objectid = key.offset;
|
|
|
|
key.offset = (u64)-1;
|
|
|
|
dirid = key.objectid;
|
|
|
|
}
|
|
|
|
|
|
|
|
memmove(args->path, ptr, total_len);
|
|
|
|
args->path[total_len] = '\0';
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(root);
|
2020-01-24 22:32:35 +08:00
|
|
|
root = NULL;
|
2018-05-21 09:09:44 +08:00
|
|
|
btrfs_release_path(path);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Get the bottom subvolume's name from ROOT_REF */
|
|
|
|
key.objectid = treeid;
|
|
|
|
key.type = BTRFS_ROOT_REF_KEY;
|
|
|
|
key.offset = args->treeid;
|
2020-01-24 22:32:35 +08:00
|
|
|
ret = btrfs_search_slot(NULL, fs_info->tree_root, &key, path, 0, 0);
|
2018-05-21 09:09:44 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
|
|
|
|
item_off = btrfs_item_ptr_offset(leaf, slot);
|
2021-10-22 02:58:35 +08:00
|
|
|
item_len = btrfs_item_size(leaf, slot);
|
2018-05-21 09:09:44 +08:00
|
|
|
/* Check if dirid in ROOT_REF corresponds to passed dirid */
|
|
|
|
rref = btrfs_item_ptr(leaf, slot, struct btrfs_root_ref);
|
|
|
|
if (args->dirid != btrfs_root_ref_dirid(leaf, rref)) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy subvolume's name */
|
|
|
|
item_off += sizeof(struct btrfs_root_ref);
|
|
|
|
item_len -= sizeof(struct btrfs_root_ref);
|
|
|
|
read_extent_buffer(leaf, args->name, item_off, item_len);
|
|
|
|
args->name[item_len] = 0;
|
|
|
|
|
2020-01-24 22:32:35 +08:00
|
|
|
out_put:
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(root);
|
2018-05-21 09:09:44 +08:00
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-01-05 16:30:06 +08:00
|
|
|
static noinline int btrfs_ioctl_ino_lookup(struct btrfs_root *root,
|
2010-03-01 04:39:26 +08:00
|
|
|
void __user *argp)
|
|
|
|
{
|
2018-06-21 01:03:31 +08:00
|
|
|
struct btrfs_ioctl_ino_lookup_args *args;
|
2015-05-13 01:14:49 +08:00
|
|
|
int ret = 0;
|
2010-03-01 04:39:26 +08:00
|
|
|
|
2010-10-30 03:14:18 +08:00
|
|
|
args = memdup_user(argp, sizeof(*args));
|
|
|
|
if (IS_ERR(args))
|
|
|
|
return PTR_ERR(args);
|
2010-03-20 19:24:15 +08:00
|
|
|
|
2015-05-13 01:14:49 +08:00
|
|
|
/*
|
|
|
|
* Unprivileged query to obtain the containing subvolume root id. The
|
|
|
|
* path is reset so it's consistent with btrfs_search_path_in_tree.
|
|
|
|
*/
|
2010-03-19 00:17:05 +08:00
|
|
|
if (args->treeid == 0)
|
2022-01-05 16:30:06 +08:00
|
|
|
args->treeid = root->root_key.objectid;
|
2010-03-19 00:17:05 +08:00
|
|
|
|
2015-05-13 01:14:49 +08:00
|
|
|
if (args->objectid == BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
args->name[0] = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-01-05 16:30:06 +08:00
|
|
|
ret = btrfs_search_path_in_tree(root->fs_info,
|
2010-03-01 04:39:26 +08:00
|
|
|
args->treeid, args->objectid,
|
|
|
|
args->name);
|
|
|
|
|
2015-05-13 01:14:49 +08:00
|
|
|
out:
|
2010-03-01 04:39:26 +08:00
|
|
|
if (ret == 0 && copy_to_user(argp, args, sizeof(*args)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(args);
|
2009-11-18 13:42:14 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-21 09:09:44 +08:00
|
|
|
/*
|
|
|
|
* Version of ino_lookup ioctl (unprivileged)
|
|
|
|
*
|
|
|
|
* The main differences from ino_lookup ioctl are:
|
|
|
|
*
|
|
|
|
* 1. Read + Exec permission will be checked using inode_permission() during
|
|
|
|
* path construction. -EACCES will be returned in case of failure.
|
|
|
|
* 2. Path construction will be stopped at the inode number which corresponds
|
|
|
|
* to the fd with which this ioctl is called. If constructed path does not
|
|
|
|
* exist under fd's inode, -EACCES will be returned.
|
|
|
|
* 3. The name of bottom subvolume is also searched and filled.
|
|
|
|
*/
|
|
|
|
static int btrfs_ioctl_ino_lookup_user(struct file *file, void __user *argp)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_ino_lookup_user_args *args;
|
|
|
|
struct inode *inode;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
args = memdup_user(argp, sizeof(*args));
|
|
|
|
if (IS_ERR(args))
|
|
|
|
return PTR_ERR(args);
|
|
|
|
|
|
|
|
inode = file_inode(file);
|
|
|
|
|
|
|
|
if (args->dirid == BTRFS_FIRST_FREE_OBJECTID &&
|
|
|
|
BTRFS_I(inode)->location.objectid != BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
/*
|
|
|
|
* The subvolume does not exist under fd with which this is
|
|
|
|
* called
|
|
|
|
*/
|
|
|
|
kfree(args);
|
|
|
|
return -EACCES;
|
|
|
|
}
|
|
|
|
|
2021-07-27 18:48:57 +08:00
|
|
|
ret = btrfs_search_path_in_tree_user(file_mnt_user_ns(file), inode, args);
|
2018-05-21 09:09:44 +08:00
|
|
|
|
|
|
|
if (ret == 0 && copy_to_user(argp, args, sizeof(*args)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(args);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-21 09:09:42 +08:00
|
|
|
/* Get the subvolume information in BTRFS_ROOT_ITEM and BTRFS_ROOT_BACKREF */
|
2022-01-16 10:48:47 +08:00
|
|
|
static int btrfs_ioctl_get_subvol_info(struct inode *inode, void __user *argp)
|
2018-05-21 09:09:42 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_get_subvol_info_args *subvol_info;
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_root *root;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_root_item *root_item;
|
|
|
|
struct btrfs_root_ref *rref;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
unsigned long item_off;
|
|
|
|
unsigned long item_len;
|
|
|
|
int slot;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
subvol_info = kzalloc(sizeof(*subvol_info), GFP_KERNEL);
|
|
|
|
if (!subvol_info) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_info = BTRFS_I(inode)->root->fs_info;
|
|
|
|
|
|
|
|
/* Get root_item of inode's subvolume */
|
|
|
|
key.objectid = BTRFS_I(inode)->root->root_key.objectid;
|
2020-05-16 01:35:55 +08:00
|
|
|
root = btrfs_get_fs_root(fs_info, key.objectid, true);
|
2018-05-21 09:09:42 +08:00
|
|
|
if (IS_ERR(root)) {
|
|
|
|
ret = PTR_ERR(root);
|
2020-01-24 22:32:36 +08:00
|
|
|
goto out_free;
|
|
|
|
}
|
2018-05-21 09:09:42 +08:00
|
|
|
root_item = &root->root_item;
|
|
|
|
|
|
|
|
subvol_info->treeid = key.objectid;
|
|
|
|
|
|
|
|
subvol_info->generation = btrfs_root_generation(root_item);
|
|
|
|
subvol_info->flags = btrfs_root_flags(root_item);
|
|
|
|
|
|
|
|
memcpy(subvol_info->uuid, root_item->uuid, BTRFS_UUID_SIZE);
|
|
|
|
memcpy(subvol_info->parent_uuid, root_item->parent_uuid,
|
|
|
|
BTRFS_UUID_SIZE);
|
|
|
|
memcpy(subvol_info->received_uuid, root_item->received_uuid,
|
|
|
|
BTRFS_UUID_SIZE);
|
|
|
|
|
|
|
|
subvol_info->ctransid = btrfs_root_ctransid(root_item);
|
|
|
|
subvol_info->ctime.sec = btrfs_stack_timespec_sec(&root_item->ctime);
|
|
|
|
subvol_info->ctime.nsec = btrfs_stack_timespec_nsec(&root_item->ctime);
|
|
|
|
|
|
|
|
subvol_info->otransid = btrfs_root_otransid(root_item);
|
|
|
|
subvol_info->otime.sec = btrfs_stack_timespec_sec(&root_item->otime);
|
|
|
|
subvol_info->otime.nsec = btrfs_stack_timespec_nsec(&root_item->otime);
|
|
|
|
|
|
|
|
subvol_info->stransid = btrfs_root_stransid(root_item);
|
|
|
|
subvol_info->stime.sec = btrfs_stack_timespec_sec(&root_item->stime);
|
|
|
|
subvol_info->stime.nsec = btrfs_stack_timespec_nsec(&root_item->stime);
|
|
|
|
|
|
|
|
subvol_info->rtransid = btrfs_root_rtransid(root_item);
|
|
|
|
subvol_info->rtime.sec = btrfs_stack_timespec_sec(&root_item->rtime);
|
|
|
|
subvol_info->rtime.nsec = btrfs_stack_timespec_nsec(&root_item->rtime);
|
|
|
|
|
|
|
|
if (key.objectid != BTRFS_FS_TREE_OBJECTID) {
|
|
|
|
/* Search root tree for ROOT_BACKREF of this subvolume */
|
|
|
|
key.type = BTRFS_ROOT_BACKREF_KEY;
|
|
|
|
key.offset = 0;
|
2020-01-24 22:32:36 +08:00
|
|
|
ret = btrfs_search_slot(NULL, fs_info->tree_root, &key, path, 0, 0);
|
2018-05-21 09:09:42 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (path->slots[0] >=
|
|
|
|
btrfs_header_nritems(path->nodes[0])) {
|
2020-01-24 22:32:36 +08:00
|
|
|
ret = btrfs_next_leaf(fs_info->tree_root, path);
|
2018-05-21 09:09:42 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
if (key.objectid == subvol_info->treeid &&
|
|
|
|
key.type == BTRFS_ROOT_BACKREF_KEY) {
|
|
|
|
subvol_info->parent_id = key.offset;
|
|
|
|
|
|
|
|
rref = btrfs_item_ptr(leaf, slot, struct btrfs_root_ref);
|
|
|
|
subvol_info->dirid = btrfs_root_ref_dirid(leaf, rref);
|
|
|
|
|
|
|
|
item_off = btrfs_item_ptr_offset(leaf, slot)
|
|
|
|
+ sizeof(struct btrfs_root_ref);
|
2021-10-22 02:58:35 +08:00
|
|
|
item_len = btrfs_item_size(leaf, slot)
|
2018-05-21 09:09:42 +08:00
|
|
|
- sizeof(struct btrfs_root_ref);
|
|
|
|
read_extent_buffer(leaf, subvol_info->name,
|
|
|
|
item_off, item_len);
|
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-11-10 14:06:31 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
path = NULL;
|
2018-05-21 09:09:42 +08:00
|
|
|
if (copy_to_user(argp, subvol_info, sizeof(*subvol_info)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out:
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(root);
|
2020-01-24 22:32:36 +08:00
|
|
|
out_free:
|
2018-05-21 09:09:42 +08:00
|
|
|
btrfs_free_path(path);
|
2020-06-16 23:31:59 +08:00
|
|
|
kfree(subvol_info);
|
2018-05-21 09:09:42 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-05-21 09:09:43 +08:00
|
|
|
/*
|
|
|
|
* Return ROOT_REF information of the subvolume containing this inode
|
|
|
|
* except the subvolume name.
|
|
|
|
*/
|
2022-01-16 10:48:47 +08:00
|
|
|
static int btrfs_ioctl_get_subvol_rootref(struct btrfs_root *root,
|
2022-01-05 16:30:06 +08:00
|
|
|
void __user *argp)
|
2018-05-21 09:09:43 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_get_subvol_rootref_args *rootrefs;
|
|
|
|
struct btrfs_root_ref *rref;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
u64 objectid;
|
|
|
|
int slot;
|
|
|
|
int ret;
|
|
|
|
u8 found;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rootrefs = memdup_user(argp, sizeof(*rootrefs));
|
|
|
|
if (IS_ERR(rootrefs)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return PTR_ERR(rootrefs);
|
|
|
|
}
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
objectid = root->root_key.objectid;
|
2018-05-21 09:09:43 +08:00
|
|
|
key.objectid = objectid;
|
|
|
|
key.type = BTRFS_ROOT_REF_KEY;
|
|
|
|
key.offset = rootrefs->min_treeid;
|
|
|
|
found = 0;
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
root = root->fs_info->tree_root;
|
2018-05-21 09:09:43 +08:00
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (path->slots[0] >=
|
|
|
|
btrfs_header_nritems(path->nodes[0])) {
|
|
|
|
ret = btrfs_next_leaf(root, path);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
while (1) {
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
slot = path->slots[0];
|
|
|
|
|
|
|
|
btrfs_item_key_to_cpu(leaf, &key, slot);
|
|
|
|
if (key.objectid != objectid || key.type != BTRFS_ROOT_REF_KEY) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (found == BTRFS_MAX_ROOTREF_BUFFER_NUM) {
|
|
|
|
ret = -EOVERFLOW;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
rref = btrfs_item_ptr(leaf, slot, struct btrfs_root_ref);
|
|
|
|
rootrefs->rootref[found].treeid = key.offset;
|
|
|
|
rootrefs->rootref[found].dirid =
|
|
|
|
btrfs_root_ref_dirid(leaf, rref);
|
|
|
|
found++;
|
|
|
|
|
|
|
|
ret = btrfs_next_item(root, path);
|
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret > 0) {
|
|
|
|
ret = -EUCLEAN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
btrfs: free btrfs_path before copying root refs to userspace
Syzbot reported the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted
------------------------------------------------------
syz-executor307/3029 is trying to acquire lock:
ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576
but task is already holding lock:
ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline]
ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline]
ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (btrfs-root-00){++++}-{3:3}:
down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624
__btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline]
btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline]
btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279
btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637
btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944
btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132
commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459
btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343
flush_space+0x66c/0x738 fs/btrfs/space-info.c:786
btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860
-> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603
__mutex_lock kernel/locking/mutex.c:747 [inline]
mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799
btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline]
start_transaction+0x248/0x944 fs/btrfs/transaction.c:752
btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781
btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651
btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697
lookup_open fs/namei.c:3413 [inline]
open_last_lookups fs/namei.c:3481 [inline]
path_openat+0x804/0x11c4 fs/namei.c:3688
do_filp_open+0xdc/0x1b8 fs/namei.c:3718
do_sys_openat2+0xb8/0x22c fs/open.c:1313
do_sys_open fs/open.c:1329 [inline]
__do_sys_openat fs/open.c:1345 [inline]
__se_sys_openat fs/open.c:1340 [inline]
__arm64_sys_openat+0xb0/0xe0 fs/open.c:1340
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
-> #1 (sb_internal#2){.+.+}-{0:0}:
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
__sb_start_write include/linux/fs.h:1826 [inline]
sb_start_intwrite include/linux/fs.h:1948 [inline]
start_transaction+0x360/0x944 fs/btrfs/transaction.c:683
btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795
btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103
btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145
inode_update_time fs/inode.c:1872 [inline]
touch_atime+0x1f0/0x4a8 fs/inode.c:1945
file_accessed include/linux/fs.h:2516 [inline]
btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407
call_mmap include/linux/fs.h:2192 [inline]
mmap_region+0x7fc/0xc14 mm/mmap.c:1752
do_mmap+0x644/0x97c mm/mmap.c:1540
vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552
ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586
__do_sys_mmap arch/arm64/kernel/sys.c:28 [inline]
__se_sys_mmap arch/arm64/kernel/sys.c:21 [inline]
__arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
-> #0 (&mm->mmap_lock){++++}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3095 [inline]
check_prevs_add kernel/locking/lockdep.c:3214 [inline]
validate_chain kernel/locking/lockdep.c:3829 [inline]
__lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053
lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666
__might_fault+0x7c/0xb4 mm/memory.c:5577
_copy_to_user include/linux/uaccess.h:134 [inline]
copy_to_user include/linux/uaccess.h:160 [inline]
btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203
btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
other info that might help us debug this:
Chain exists of:
&mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-root-00);
lock(&fs_info->reloc_mutex);
lock(btrfs-root-00);
lock(&mm->mmap_lock);
*** DEADLOCK ***
1 lock held by syz-executor307/3029:
#0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline]
#0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline]
#0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279
stack backtrace:
CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022
Call trace:
dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156
show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106
dump_stack+0x1c/0x58 lib/dump_stack.c:113
print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053
check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175
check_prev_add kernel/locking/lockdep.c:3095 [inline]
check_prevs_add kernel/locking/lockdep.c:3214 [inline]
validate_chain kernel/locking/lockdep.c:3829 [inline]
__lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053
lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666
__might_fault+0x7c/0xb4 mm/memory.c:5577
_copy_to_user include/linux/uaccess.h:134 [inline]
copy_to_user include/linux/uaccess.h:160 [inline]
btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203
btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
We do generally the right thing here, copying the references into a
temporary buffer, however we are still holding the path when we do
copy_to_user from the temporary buffer. Fix this by freeing the path
before we copy to user space.
Reported-by: syzbot+4ef9e52e464c6ff47d9d@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-08 00:44:51 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
|
2018-05-21 09:09:43 +08:00
|
|
|
if (!ret || ret == -EOVERFLOW) {
|
|
|
|
rootrefs->num_items = found;
|
|
|
|
/* update min_treeid for next search */
|
|
|
|
if (found)
|
|
|
|
rootrefs->min_treeid =
|
|
|
|
rootrefs->rootref[found - 1].treeid + 1;
|
|
|
|
if (copy_to_user(argp, rootrefs, sizeof(*rootrefs)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree(rootrefs);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-09-22 04:00:26 +08:00
|
|
|
static noinline int btrfs_ioctl_snap_destroy(struct file *file,
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
void __user *arg,
|
|
|
|
bool destroy_v2)
|
2009-09-22 04:00:26 +08:00
|
|
|
{
|
2013-09-02 03:57:51 +08:00
|
|
|
struct dentry *parent = file->f_path.dentry;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(parent->d_sb);
|
2009-09-22 04:00:26 +08:00
|
|
|
struct dentry *dentry;
|
2015-03-18 06:25:59 +08:00
|
|
|
struct inode *dir = d_inode(parent);
|
2009-09-22 04:00:26 +08:00
|
|
|
struct inode *inode;
|
|
|
|
struct btrfs_root *root = BTRFS_I(dir)->root;
|
|
|
|
struct btrfs_root *dest = NULL;
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
struct btrfs_ioctl_vol_args *vol_args = NULL;
|
|
|
|
struct btrfs_ioctl_vol_args_v2 *vol_args2 = NULL;
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
struct user_namespace *mnt_userns = file_mnt_user_ns(file);
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
char *subvol_name, *subvol_name_ptr = NULL;
|
|
|
|
int subvol_namelen;
|
2009-09-22 04:00:26 +08:00
|
|
|
int err = 0;
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
bool destroy_parent = false;
|
2009-09-22 04:00:26 +08:00
|
|
|
|
2021-12-16 04:40:03 +08:00
|
|
|
/* We don't support snapshots with extent tree v2 yet. */
|
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"extent tree v2 doesn't support snapshot deletion yet");
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
if (destroy_v2) {
|
|
|
|
vol_args2 = memdup_user(arg, sizeof(*vol_args2));
|
|
|
|
if (IS_ERR(vol_args2))
|
|
|
|
return PTR_ERR(vol_args2);
|
2016-09-21 20:31:29 +08:00
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
if (vol_args2->flags & ~BTRFS_SUBVOL_DELETE_ARGS_MASK) {
|
|
|
|
err = -EOPNOTSUPP;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-09-22 04:00:26 +08:00
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
/*
|
|
|
|
* If SPEC_BY_ID is not set, we are looking for the subvolume by
|
|
|
|
* name, same as v1 currently does.
|
|
|
|
*/
|
|
|
|
if (!(vol_args2->flags & BTRFS_SUBVOL_SPEC_BY_ID)) {
|
|
|
|
vol_args2->name[BTRFS_SUBVOL_NAME_MAX] = 0;
|
|
|
|
subvol_name = vol_args2->name;
|
|
|
|
|
|
|
|
err = mnt_want_write_file(file);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
} else {
|
2021-07-27 18:48:54 +08:00
|
|
|
struct inode *old_dir;
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
if (vol_args2->subvolid < BTRFS_FIRST_FREE_OBJECTID) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = mnt_want_write_file(file);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
dentry = btrfs_get_dentry(fs_info->sb,
|
|
|
|
BTRFS_FIRST_FREE_OBJECTID,
|
2022-10-18 22:06:38 +08:00
|
|
|
vol_args2->subvolid, 0);
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
if (IS_ERR(dentry)) {
|
|
|
|
err = PTR_ERR(dentry);
|
|
|
|
goto out_drop_write;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Change the default parent since the subvolume being
|
|
|
|
* deleted can be outside of the current mount point.
|
|
|
|
*/
|
|
|
|
parent = btrfs_get_parent(dentry);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* At this point dentry->d_name can point to '/' if the
|
|
|
|
* subvolume we want to destroy is outsite of the
|
|
|
|
* current mount point, so we need to release the
|
|
|
|
* current dentry and execute the lookup to return a new
|
|
|
|
* one with ->d_name pointing to the
|
|
|
|
* <mount point>/subvol_name.
|
|
|
|
*/
|
|
|
|
dput(dentry);
|
|
|
|
if (IS_ERR(parent)) {
|
|
|
|
err = PTR_ERR(parent);
|
|
|
|
goto out_drop_write;
|
|
|
|
}
|
2021-07-27 18:48:54 +08:00
|
|
|
old_dir = dir;
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
dir = d_inode(parent);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If v2 was used with SPEC_BY_ID, a new parent was
|
|
|
|
* allocated since the subvolume can be outside of the
|
|
|
|
* current mount point. Later on we need to release this
|
|
|
|
* new parent dentry.
|
|
|
|
*/
|
|
|
|
destroy_parent = true;
|
|
|
|
|
2021-07-27 18:48:54 +08:00
|
|
|
/*
|
|
|
|
* On idmapped mounts, deletion via subvolid is
|
|
|
|
* restricted to subvolumes that are immediate
|
|
|
|
* ancestors of the inode referenced by the file
|
|
|
|
* descriptor in the ioctl. Otherwise the idmapping
|
|
|
|
* could potentially be abused to delete subvolumes
|
|
|
|
* anywhere in the filesystem the user wouldn't be able
|
|
|
|
* to delete without an idmapped mount.
|
|
|
|
*/
|
|
|
|
if (old_dir != dir && mnt_userns != &init_user_ns) {
|
|
|
|
err = -EOPNOTSUPP;
|
|
|
|
goto free_parent;
|
|
|
|
}
|
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
subvol_name_ptr = btrfs_get_subvol_name_from_objectid(
|
|
|
|
fs_info, vol_args2->subvolid);
|
|
|
|
if (IS_ERR(subvol_name_ptr)) {
|
|
|
|
err = PTR_ERR(subvol_name_ptr);
|
|
|
|
goto free_parent;
|
|
|
|
}
|
2021-05-21 23:42:23 +08:00
|
|
|
/* subvol_name_ptr is already nul terminated */
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
subvol_name = (char *)kbasename(subvol_name_ptr);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
|
|
|
if (IS_ERR(vol_args))
|
|
|
|
return PTR_ERR(vol_args);
|
|
|
|
|
|
|
|
vol_args->name[BTRFS_PATH_NAME_MAX] = 0;
|
|
|
|
subvol_name = vol_args->name;
|
|
|
|
|
|
|
|
err = mnt_want_write_file(file);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
2009-09-22 04:00:26 +08:00
|
|
|
}
|
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
subvol_namelen = strlen(subvol_name);
|
2009-09-22 04:00:26 +08:00
|
|
|
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
if (strchr(subvol_name, '/') ||
|
|
|
|
strncmp(subvol_name, "..", subvol_namelen) == 0) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto free_subvol_name;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!S_ISDIR(dir->i_mode)) {
|
|
|
|
err = -ENOTDIR;
|
|
|
|
goto free_subvol_name;
|
|
|
|
}
|
2014-04-15 22:41:44 +08:00
|
|
|
|
2016-05-26 12:05:12 +08:00
|
|
|
err = down_write_killable_nested(&dir->i_rwsem, I_MUTEX_PARENT);
|
|
|
|
if (err == -EINTR)
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
goto free_subvol_name;
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
dentry = lookup_one(mnt_userns, subvol_name, parent, subvol_namelen);
|
2009-09-22 04:00:26 +08:00
|
|
|
if (IS_ERR(dentry)) {
|
|
|
|
err = PTR_ERR(dentry);
|
|
|
|
goto out_unlock_dir;
|
|
|
|
}
|
|
|
|
|
2015-03-18 06:25:59 +08:00
|
|
|
if (d_really_is_negative(dentry)) {
|
2009-09-22 04:00:26 +08:00
|
|
|
err = -ENOENT;
|
|
|
|
goto out_dput;
|
|
|
|
}
|
|
|
|
|
2015-03-18 06:25:59 +08:00
|
|
|
inode = d_inode(dentry);
|
2010-10-30 03:46:43 +08:00
|
|
|
dest = BTRFS_I(inode)->root;
|
2013-10-31 13:03:04 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
2010-10-30 03:46:43 +08:00
|
|
|
/*
|
|
|
|
* Regular user. Only allow this with a special mount
|
|
|
|
* option, when the user has write+exec access to the
|
|
|
|
* subvol root, and when rmdir(2) would have been
|
|
|
|
* allowed.
|
|
|
|
*
|
|
|
|
* Note that this is _not_ check that the subvol is
|
|
|
|
* empty or doesn't contain data that we wouldn't
|
|
|
|
* otherwise be able to delete.
|
|
|
|
*
|
|
|
|
* Users who want to delete empty subvols should try
|
|
|
|
* rmdir(2).
|
|
|
|
*/
|
|
|
|
err = -EPERM;
|
2016-06-23 06:54:23 +08:00
|
|
|
if (!btrfs_test_opt(fs_info, USER_SUBVOL_RM_ALLOWED))
|
2010-10-30 03:46:43 +08:00
|
|
|
goto out_dput;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not allow deletion if the parent dir is the same
|
|
|
|
* as the dir to be deleted. That means the ioctl
|
|
|
|
* must be called on the dentry referencing the root
|
|
|
|
* of the subvol, not a random directory contained
|
|
|
|
* within it.
|
|
|
|
*/
|
|
|
|
err = -EINVAL;
|
|
|
|
if (root == dest)
|
|
|
|
goto out_dput;
|
|
|
|
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
err = inode_permission(mnt_userns, inode, MAY_WRITE | MAY_EXEC);
|
2010-10-30 03:46:43 +08:00
|
|
|
if (err)
|
|
|
|
goto out_dput;
|
|
|
|
}
|
|
|
|
|
2012-10-22 19:39:53 +08:00
|
|
|
/* check if subvolume may be deleted by a user */
|
btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.
Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.
This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.
This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.
In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.
Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:
/* Compile the following program as "delete_by_spec". */
#define _GNU_SOURCE
#include <fcntl.h>
#include <inttypes.h>
#include <linux/btrfs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int rm_subvolume_by_id(int fd, uint64_t subvolid)
{
struct btrfs_ioctl_vol_args_v2 args = {};
int ret;
args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
args.subvolid = subvolid;
ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
if (ret < 0)
return -1;
return 0;
}
int main(int argc, char *argv[])
{
int subvolid = 0;
if (argc < 3)
exit(1);
fprintf(stderr, "Opening %s\n", argv[1]);
int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
if (fd < 0)
exit(2);
subvolid = atoi(argv[2]);
fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
int ret = rm_subvolume_by_id(fd, subvolid);
if (ret < 0)
exit(3);
exit(0);
}
#include <stdio.h>"
#include <stdlib.h>"
#include <linux/btrfs.h"
truncate -s 10G btrfs.img
mkfs.btrfs btrfs.img
export LOOPDEV=$(sudo losetup -f --show btrfs.img)
mount ${LOOPDEV} /mnt
sudo chown $(id -u):$(id -g) /mnt
btrfs subvolume create /mnt/A
btrfs subvolume create /mnt/B/C
# Get subvolume id via:
sudo btrfs subvolume show /mnt/A
# Save subvolid
SUBVOLID=<nr>
sudo umount /mnt
sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
./delete_by_spec /mnt ${SUBVOLID}
With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.
The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-27 18:48:53 +08:00
|
|
|
err = btrfs_may_delete(mnt_userns, dir, dentry, 1);
|
2012-10-22 19:39:53 +08:00
|
|
|
if (err)
|
|
|
|
goto out_dput;
|
|
|
|
|
2017-01-11 02:35:31 +08:00
|
|
|
if (btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) {
|
2009-09-22 04:00:26 +08:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out_dput;
|
|
|
|
}
|
|
|
|
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_lock(inode, 0);
|
2018-04-18 10:34:52 +08:00
|
|
|
err = btrfs_delete_subvolume(dir, dentry);
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_unlock(inode, 0);
|
2022-01-21 05:53:04 +08:00
|
|
|
if (!err)
|
|
|
|
d_delete_notify(dir, dentry);
|
Btrfs: fix cleaner thread not working with inode cache option
Right now inode cache inode is treated as the same as space cache
inode, ie. keep inode in memory till putting super.
But this leads to an awkward situation.
If we're going to delete a snapshot/subvolume, btrfs will not
actually delete it and return free space, but will add it to dead
roots list until the last inode on this snap/subvol being destroyed.
Then we'll fetch deleted roots and cleanup them via cleaner thread.
So here is the problem, if we enable inode cache option, each
snap/subvol has a cached inode which is used to store inode allcation
information. And this cache inode will be kept in memory, as the above
said. So with inode cache, snap/subvol can only be added into
dead roots list during freeing roots stage in umount, so that we can
ONLY get space back after another remount(we cleanup dead roots on mount).
But the real thing is we'll no more use the snap/subvol if we mark it
deleted, so we can safely iput its cache inode when we delete snap/subvol.
Another thing is that we need to change the rules of droping inode, we
don't keep snap/subvol's cache inode in memory till end so that we can
add snap/subvol into dead roots list in time.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 22:10:23 +08:00
|
|
|
|
2009-09-22 04:00:26 +08:00
|
|
|
out_dput:
|
|
|
|
dput(dentry);
|
|
|
|
out_unlock_dir:
|
2021-02-11 06:14:34 +08:00
|
|
|
btrfs_inode_unlock(dir, 0);
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
free_subvol_name:
|
|
|
|
kfree(subvol_name_ptr);
|
|
|
|
free_parent:
|
|
|
|
if (destroy_parent)
|
|
|
|
dput(parent);
|
2016-05-26 12:05:12 +08:00
|
|
|
out_drop_write:
|
2011-12-09 21:06:57 +08:00
|
|
|
mnt_drop_write_file(file);
|
2009-09-22 04:00:26 +08:00
|
|
|
out:
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
kfree(vol_args2);
|
2009-09-22 04:00:26 +08:00
|
|
|
kfree(vol_args);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2010-03-11 22:42:04 +08:00
|
|
|
static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
2013-01-24 06:07:38 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2021-07-28 05:17:30 +08:00
|
|
|
struct btrfs_ioctl_defrag_range_args range = {0};
|
2008-11-13 03:34:12 +08:00
|
|
|
int ret;
|
|
|
|
|
2013-01-20 21:57:57 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2010-12-20 16:04:08 +08:00
|
|
|
|
2013-01-20 21:57:57 +08:00
|
|
|
if (btrfs_root_readonly(root)) {
|
|
|
|
ret = -EROFS;
|
|
|
|
goto out;
|
2012-11-06 00:54:08 +08:00
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
switch (inode->i_mode & S_IFMT) {
|
|
|
|
case S_IFDIR:
|
2009-01-06 05:57:23 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out;
|
|
|
|
}
|
2013-02-01 02:21:12 +08:00
|
|
|
ret = btrfs_defrag_root(root);
|
2008-06-12 09:53:53 +08:00
|
|
|
break;
|
|
|
|
case S_IFREG:
|
2018-07-18 06:08:59 +08:00
|
|
|
/*
|
|
|
|
* Note that this does not check the file descriptor for write
|
|
|
|
* access. This prevents defragmenting executables that are
|
|
|
|
* running and allows defrag on files open in read-only mode.
|
|
|
|
*/
|
|
|
|
if (!capable(CAP_SYS_ADMIN) &&
|
2021-01-21 21:19:24 +08:00
|
|
|
inode_permission(&init_user_ns, inode, MAY_WRITE)) {
|
2018-07-18 06:08:59 +08:00
|
|
|
ret = -EPERM;
|
2009-01-06 05:57:23 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2010-03-11 22:42:04 +08:00
|
|
|
|
|
|
|
if (argp) {
|
2021-07-28 05:17:30 +08:00
|
|
|
if (copy_from_user(&range, argp, sizeof(range))) {
|
2010-03-11 22:42:04 +08:00
|
|
|
ret = -EFAULT;
|
2010-03-20 19:24:48 +08:00
|
|
|
goto out;
|
2010-03-11 22:42:04 +08:00
|
|
|
}
|
|
|
|
/* compression requires us to start the IO */
|
2021-07-28 05:17:30 +08:00
|
|
|
if ((range.flags & BTRFS_DEFRAG_RANGE_COMPRESS)) {
|
|
|
|
range.flags |= BTRFS_DEFRAG_RANGE_START_IO;
|
|
|
|
range.extent_thresh = (u32)-1;
|
2010-03-11 22:42:04 +08:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* the rest are all set to zero by kzalloc */
|
2021-07-28 05:17:30 +08:00
|
|
|
range.len = (u64)-1;
|
2010-03-11 22:42:04 +08:00
|
|
|
}
|
2021-08-06 16:12:32 +08:00
|
|
|
ret = btrfs_defrag_file(file_inode(file), &file->f_ra,
|
2021-07-28 05:17:30 +08:00
|
|
|
&range, BTRFS_OLDEST_GENERATION, 0);
|
2011-05-25 03:35:30 +08:00
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
2008-06-12 09:53:53 +08:00
|
|
|
break;
|
2010-05-16 22:49:58 +08:00
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
2009-01-06 05:57:23 +08:00
|
|
|
out:
|
2013-01-20 21:57:57 +08:00
|
|
|
mnt_drop_write_file(file);
|
2009-01-06 05:57:23 +08:00
|
|
|
return ret;
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_add_dev(struct btrfs_fs_info *fs_info, void __user *arg)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_vol_args *vol_args;
|
2021-11-25 17:14:43 +08:00
|
|
|
bool restore_op = false;
|
2008-06-12 09:53:53 +08:00
|
|
|
int ret;
|
|
|
|
|
2009-01-06 05:57:23 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2021-12-16 04:40:00 +08:00
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_err(fs_info, "device add not supported on extent tree v2 yet");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2021-11-25 17:14:43 +08:00
|
|
|
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_DEV_ADD)) {
|
|
|
|
if (!btrfs_exclop_start_try_lock(fs_info, BTRFS_EXCLOP_DEV_ADD))
|
|
|
|
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can do the device add because we have a paused balanced,
|
|
|
|
* change the exclusive op type and remember we should bring
|
|
|
|
* back the paused balance
|
|
|
|
*/
|
|
|
|
fs_info->exclusive_operation = BTRFS_EXCLOP_DEV_ADD;
|
|
|
|
btrfs_exclop_start_unlock(fs_info);
|
|
|
|
restore_op = true;
|
|
|
|
}
|
2012-01-17 04:04:47 +08:00
|
|
|
|
2009-04-08 15:06:54 +08:00
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
2012-01-17 04:04:47 +08:00
|
|
|
if (IS_ERR(vol_args)) {
|
|
|
|
ret = PTR_ERR(vol_args);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2008-07-25 00:20:14 +08:00
|
|
|
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_init_new_device(fs_info, vol_args->name);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2014-07-01 00:58:56 +08:00
|
|
|
if (!ret)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_info(fs_info, "disk added %s", vol_args->name);
|
2014-07-01 00:58:56 +08:00
|
|
|
|
2008-06-12 09:53:53 +08:00
|
|
|
kfree(vol_args);
|
2012-01-17 04:04:47 +08:00
|
|
|
out:
|
2021-11-25 17:14:43 +08:00
|
|
|
if (restore_op)
|
|
|
|
btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
|
|
|
|
else
|
|
|
|
btrfs_exclop_finish(fs_info);
|
2008-06-12 09:53:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-02-13 10:01:39 +08:00
|
|
|
static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2016-02-13 10:01:39 +08:00
|
|
|
struct btrfs_ioctl_vol_args_v2 *vol_args;
|
2021-07-28 05:01:17 +08:00
|
|
|
struct block_device *bdev = NULL;
|
|
|
|
fmode_t mode;
|
2008-06-12 09:53:53 +08:00
|
|
|
int ret;
|
2021-05-15 03:21:27 +08:00
|
|
|
bool cancel = false;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2009-01-06 05:57:23 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2009-04-08 15:06:54 +08:00
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
2021-11-16 19:50:25 +08:00
|
|
|
if (IS_ERR(vol_args))
|
|
|
|
return PTR_ERR(vol_args);
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2020-02-21 20:30:14 +08:00
|
|
|
if (vol_args->flags & ~BTRFS_DEVICE_REMOVE_ARGS_MASK) {
|
2018-05-23 06:44:01 +08:00
|
|
|
ret = -EOPNOTSUPP;
|
|
|
|
goto out;
|
|
|
|
}
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
|
2021-05-15 03:21:27 +08:00
|
|
|
vol_args->name[BTRFS_SUBVOL_NAME_MAX] = '\0';
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID) {
|
|
|
|
args.devid = vol_args->devid;
|
|
|
|
} else if (!strcmp("cancel", vol_args->name)) {
|
2021-05-15 03:21:27 +08:00
|
|
|
cancel = true;
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
} else {
|
|
|
|
ret = btrfs_get_dev_args_from_path(fs_info, &args, vol_args->name);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2021-05-15 03:21:27 +08:00
|
|
|
ret = exclop_start_or_cancel_reloc(fs_info, BTRFS_EXCLOP_DEV_REMOVE,
|
|
|
|
cancel);
|
|
|
|
if (ret)
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
goto err_drop;
|
2013-05-17 18:52:45 +08:00
|
|
|
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
/* Exclusive operation is now claimed */
|
|
|
|
ret = btrfs_rm_device(fs_info, &args, &bdev, &mode);
|
2021-05-15 03:21:27 +08:00
|
|
|
|
2020-08-25 23:02:32 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2013-05-17 18:52:45 +08:00
|
|
|
|
2016-02-13 10:01:39 +08:00
|
|
|
if (!ret) {
|
2016-02-16 01:15:21 +08:00
|
|
|
if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_info(fs_info, "device deleted: id %llu",
|
2016-02-13 10:01:39 +08:00
|
|
|
vol_args->devid);
|
|
|
|
else
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_info(fs_info, "device deleted: %s",
|
2016-02-13 10:01:39 +08:00
|
|
|
vol_args->name);
|
|
|
|
}
|
2014-09-04 19:09:15 +08:00
|
|
|
err_drop:
|
2013-01-20 21:57:57 +08:00
|
|
|
mnt_drop_write_file(file);
|
2021-07-28 05:01:17 +08:00
|
|
|
if (bdev)
|
|
|
|
blkdev_put(bdev, mode);
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
out:
|
|
|
|
btrfs_put_dev_args_from_path(&args);
|
|
|
|
kfree(vol_args);
|
2008-06-12 09:53:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:44:50 +08:00
|
|
|
static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
|
2008-06-12 09:53:53 +08:00
|
|
|
{
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2008-06-12 09:53:53 +08:00
|
|
|
struct btrfs_ioctl_vol_args *vol_args;
|
2021-07-28 05:01:17 +08:00
|
|
|
struct block_device *bdev = NULL;
|
|
|
|
fmode_t mode;
|
2008-06-12 09:53:53 +08:00
|
|
|
int ret;
|
2022-01-21 21:45:22 +08:00
|
|
|
bool cancel = false;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
2009-01-06 05:57:23 +08:00
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2016-05-04 20:10:47 +08:00
|
|
|
vol_args = memdup_user(arg, sizeof(*vol_args));
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
if (IS_ERR(vol_args))
|
|
|
|
return PTR_ERR(vol_args);
|
|
|
|
|
2016-05-04 20:10:47 +08:00
|
|
|
vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
if (!strcmp("cancel", vol_args->name)) {
|
|
|
|
cancel = true;
|
|
|
|
} else {
|
|
|
|
ret = btrfs_get_dev_args_from_path(fs_info, &args, vol_args->name);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2021-05-15 03:21:27 +08:00
|
|
|
|
|
|
|
ret = exclop_start_or_cancel_reloc(fs_info, BTRFS_EXCLOP_DEV_REMOVE,
|
|
|
|
cancel);
|
|
|
|
if (ret == 0) {
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
ret = btrfs_rm_device(fs_info, &args, &bdev, &mode);
|
2021-05-15 03:21:27 +08:00
|
|
|
if (!ret)
|
|
|
|
btrfs_info(fs_info, "disk deleted %s", vol_args->name);
|
|
|
|
btrfs_exclop_finish(fs_info);
|
|
|
|
}
|
2013-05-17 18:52:45 +08:00
|
|
|
|
2013-01-20 21:57:57 +08:00
|
|
|
mnt_drop_write_file(file);
|
2021-07-28 05:01:17 +08:00
|
|
|
if (bdev)
|
|
|
|
blkdev_put(bdev, mode);
|
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
For device removal and replace we call btrfs_find_device_by_devspec,
which if we give it a device path and nothing else will call
btrfs_get_dev_args_from_path, which opens the block device and reads the
super block and then looks up our device based on that.
However at this point we're holding the sb write "lock", so reading the
block device pulls in the dependency of ->open_mutex, which produces the
following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #405 Not tainted
------------------------------------------------------
losetup/11576 is trying to acquire lock:
ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_get_by_path+0x98/0xa0
btrfs_get_bdev_and_sb+0x1b/0xb0
btrfs_find_device_by_devspec+0x12b/0x1c0
btrfs_rm_device+0x127/0x610
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11576:
#0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f31b02404cb
Instead what we want to do is populate our device lookup args before we
grab any locks, and then pass these args into btrfs_rm_device(). From
there we can find the device and do the appropriate removal.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-06 04:12:44 +08:00
|
|
|
out:
|
|
|
|
btrfs_put_dev_args_from_path(&args);
|
|
|
|
kfree(vol_args);
|
2008-06-12 09:53:53 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2011-03-11 22:41:01 +08:00
|
|
|
{
|
2011-06-08 16:27:56 +08:00
|
|
|
struct btrfs_ioctl_fs_info_args *fi_args;
|
2011-03-11 22:41:01 +08:00
|
|
|
struct btrfs_device *device;
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
|
2020-07-13 20:28:58 +08:00
|
|
|
u64 flags_in;
|
2011-06-08 16:27:56 +08:00
|
|
|
int ret = 0;
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2020-07-13 20:28:58 +08:00
|
|
|
fi_args = memdup_user(arg, sizeof(*fi_args));
|
|
|
|
if (IS_ERR(fi_args))
|
|
|
|
return PTR_ERR(fi_args);
|
|
|
|
|
|
|
|
flags_in = fi_args->flags;
|
|
|
|
memset(fi_args, 0, sizeof(*fi_args));
|
2011-06-08 16:27:56 +08:00
|
|
|
|
2017-06-16 06:09:21 +08:00
|
|
|
rcu_read_lock();
|
2011-06-08 16:27:56 +08:00
|
|
|
fi_args->num_devices = fs_devices->num_devices;
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2017-06-16 06:09:21 +08:00
|
|
|
list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
|
2011-06-08 16:27:56 +08:00
|
|
|
if (device->devid > fi_args->max_id)
|
|
|
|
fi_args->max_id = device->devid;
|
2011-03-11 22:41:01 +08:00
|
|
|
}
|
2017-06-16 06:09:21 +08:00
|
|
|
rcu_read_unlock();
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2018-10-30 22:43:24 +08:00
|
|
|
memcpy(&fi_args->fsid, fs_devices->fsid, sizeof(fi_args->fsid));
|
2017-08-23 14:46:00 +08:00
|
|
|
fi_args->nodesize = fs_info->nodesize;
|
|
|
|
fi_args->sectorsize = fs_info->sectorsize;
|
|
|
|
fi_args->clone_alignment = fs_info->sectorsize;
|
2014-05-08 00:17:06 +08:00
|
|
|
|
2020-07-13 20:28:58 +08:00
|
|
|
if (flags_in & BTRFS_FS_INFO_FLAG_CSUM_INFO) {
|
|
|
|
fi_args->csum_type = btrfs_super_csum_type(fs_info->super_copy);
|
|
|
|
fi_args->csum_size = btrfs_super_csum_size(fs_info->super_copy);
|
|
|
|
fi_args->flags |= BTRFS_FS_INFO_FLAG_CSUM_INFO;
|
|
|
|
}
|
|
|
|
|
2020-07-13 20:28:59 +08:00
|
|
|
if (flags_in & BTRFS_FS_INFO_FLAG_GENERATION) {
|
|
|
|
fi_args->generation = fs_info->generation;
|
|
|
|
fi_args->flags |= BTRFS_FS_INFO_FLAG_GENERATION;
|
|
|
|
}
|
|
|
|
|
2020-07-13 20:29:00 +08:00
|
|
|
if (flags_in & BTRFS_FS_INFO_FLAG_METADATA_UUID) {
|
|
|
|
memcpy(&fi_args->metadata_uuid, fs_devices->metadata_uuid,
|
|
|
|
sizeof(fi_args->metadata_uuid));
|
|
|
|
fi_args->flags |= BTRFS_FS_INFO_FLAG_METADATA_UUID;
|
|
|
|
}
|
|
|
|
|
2011-06-08 16:27:56 +08:00
|
|
|
if (copy_to_user(arg, fi_args, sizeof(*fi_args)))
|
|
|
|
ret = -EFAULT;
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2011-06-08 16:27:56 +08:00
|
|
|
kfree(fi_args);
|
|
|
|
return ret;
|
2011-03-11 22:41:01 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_dev_info(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2011-03-11 22:41:01 +08:00
|
|
|
{
|
2021-10-06 04:12:42 +08:00
|
|
|
BTRFS_DEV_LOOKUP_ARGS(args);
|
2011-03-11 22:41:01 +08:00
|
|
|
struct btrfs_ioctl_dev_info_args *di_args;
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
di_args = memdup_user(arg, sizeof(*di_args));
|
|
|
|
if (IS_ERR(di_args))
|
|
|
|
return PTR_ERR(di_args);
|
|
|
|
|
2021-10-06 04:12:42 +08:00
|
|
|
args.devid = di_args->devid;
|
2013-08-15 23:11:20 +08:00
|
|
|
if (!btrfs_is_empty_uuid(di_args->uuid))
|
2021-10-06 04:12:42 +08:00
|
|
|
args.uuid = di_args->uuid;
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2017-06-16 06:09:21 +08:00
|
|
|
rcu_read_lock();
|
2021-10-06 04:12:42 +08:00
|
|
|
dev = btrfs_find_device(fs_info->fs_devices, &args);
|
2011-03-11 22:41:01 +08:00
|
|
|
if (!dev) {
|
|
|
|
ret = -ENODEV;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
di_args->devid = dev->devid;
|
2014-09-03 21:35:38 +08:00
|
|
|
di_args->bytes_used = btrfs_device_get_bytes_used(dev);
|
|
|
|
di_args->total_bytes = btrfs_device_get_total_bytes(dev);
|
2011-03-11 22:41:01 +08:00
|
|
|
memcpy(di_args->uuid, dev->uuid, sizeof(di_args->uuid));
|
2012-04-27 00:36:56 +08:00
|
|
|
if (dev->name) {
|
2018-08-02 15:19:07 +08:00
|
|
|
strncpy(di_args->path, rcu_str_deref(dev->name),
|
|
|
|
sizeof(di_args->path) - 1);
|
2012-04-27 00:36:56 +08:00
|
|
|
di_args->path[sizeof(di_args->path) - 1] = 0;
|
|
|
|
} else {
|
2012-03-19 23:17:22 +08:00
|
|
|
di_args->path[0] = '\0';
|
2012-04-27 00:36:56 +08:00
|
|
|
}
|
2011-03-11 22:41:01 +08:00
|
|
|
|
|
|
|
out:
|
2017-06-16 06:09:21 +08:00
|
|
|
rcu_read_unlock();
|
2011-03-11 22:41:01 +08:00
|
|
|
if (ret == 0 && copy_to_user(arg, di_args, sizeof(*di_args)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(di_args);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-12 05:11:29 +08:00
|
|
|
static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
|
|
|
|
{
|
2013-01-24 06:07:38 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2009-12-12 05:11:29 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_root *new_root;
|
|
|
|
struct btrfs_dir_item *di;
|
|
|
|
struct btrfs_trans_handle *trans;
|
2020-01-24 22:32:37 +08:00
|
|
|
struct btrfs_path *path = NULL;
|
2009-12-12 05:11:29 +08:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
u64 objectid = 0;
|
|
|
|
u64 dir_id;
|
2012-11-26 16:43:07 +08:00
|
|
|
int ret;
|
2009-12-12 05:11:29 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-11-26 16:43:07 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (copy_from_user(&objectid, argp, sizeof(objectid))) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-12-12 05:11:29 +08:00
|
|
|
|
|
|
|
if (!objectid)
|
2013-09-13 22:04:10 +08:00
|
|
|
objectid = BTRFS_FS_TREE_OBJECTID;
|
2009-12-12 05:11:29 +08:00
|
|
|
|
2020-05-16 01:35:55 +08:00
|
|
|
new_root = btrfs_get_fs_root(fs_info, objectid, true);
|
2012-11-26 16:43:07 +08:00
|
|
|
if (IS_ERR(new_root)) {
|
|
|
|
ret = PTR_ERR(new_root);
|
|
|
|
goto out;
|
|
|
|
}
|
2020-01-24 22:32:37 +08:00
|
|
|
if (!is_fstree(new_root->root_key.objectid)) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out_free;
|
|
|
|
}
|
2009-12-12 05:11:29 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
2012-11-26 16:43:07 +08:00
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
2020-01-24 22:32:37 +08:00
|
|
|
goto out_free;
|
2012-11-26 16:43:07 +08:00
|
|
|
}
|
2009-12-12 05:11:29 +08:00
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 1);
|
2011-01-20 14:19:37 +08:00
|
|
|
if (IS_ERR(trans)) {
|
2012-11-26 16:43:07 +08:00
|
|
|
ret = PTR_ERR(trans);
|
2020-01-24 22:32:37 +08:00
|
|
|
goto out_free;
|
2009-12-12 05:11:29 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
dir_id = btrfs_super_root_dir(fs_info->super_copy);
|
|
|
|
di = btrfs_lookup_dir_item(trans, fs_info->tree_root, path,
|
2009-12-12 05:11:29 +08:00
|
|
|
dir_id, "default", 7, 1);
|
2010-05-29 17:47:24 +08:00
|
|
|
if (IS_ERR_OR_NULL(di)) {
|
2020-01-24 22:32:37 +08:00
|
|
|
btrfs_release_path(path);
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info,
|
2016-09-20 22:05:00 +08:00
|
|
|
"Umm, you don't have the default diritem, this isn't going to work");
|
2012-11-26 16:43:07 +08:00
|
|
|
ret = -ENOENT;
|
2020-01-24 22:32:37 +08:00
|
|
|
goto out_free;
|
2009-12-12 05:11:29 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_cpu_key_to_disk(&disk_key, &new_root->root_key);
|
|
|
|
btrfs_set_dir_item_key(path->nodes[0], di, &disk_key);
|
|
|
|
btrfs_mark_buffer_dirty(path->nodes[0]);
|
2020-01-24 22:32:37 +08:00
|
|
|
btrfs_release_path(path);
|
2009-12-12 05:11:29 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_set_fs_incompat(fs_info, DEFAULT_SUBVOL);
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2020-01-24 22:32:37 +08:00
|
|
|
out_free:
|
2020-01-24 22:33:01 +08:00
|
|
|
btrfs_put_root(new_root);
|
2020-01-24 22:32:37 +08:00
|
|
|
btrfs_free_path(path);
|
2012-11-26 16:43:07 +08:00
|
|
|
out:
|
|
|
|
mnt_drop_write_file(file);
|
|
|
|
return ret;
|
2009-12-12 05:11:29 +08:00
|
|
|
}
|
|
|
|
|
2018-04-02 17:24:11 +08:00
|
|
|
static void get_block_group_info(struct list_head *groups_list,
|
|
|
|
struct btrfs_ioctl_space_info *space)
|
2010-09-29 23:22:36 +08:00
|
|
|
{
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *block_group;
|
2010-09-29 23:22:36 +08:00
|
|
|
|
|
|
|
space->total_bytes = 0;
|
|
|
|
space->used_bytes = 0;
|
|
|
|
space->flags = 0;
|
|
|
|
list_for_each_entry(block_group, groups_list, list) {
|
|
|
|
space->flags = block_group->flags;
|
2019-10-24 00:48:22 +08:00
|
|
|
space->total_bytes += block_group->length;
|
2019-10-24 00:48:11 +08:00
|
|
|
space->used_bytes += block_group->used;
|
2010-09-29 23:22:36 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_space_info(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2010-01-14 02:19:06 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_space_args space_args;
|
|
|
|
struct btrfs_ioctl_space_info space;
|
|
|
|
struct btrfs_ioctl_space_info *dest;
|
2010-03-17 03:40:10 +08:00
|
|
|
struct btrfs_ioctl_space_info *dest_orig;
|
2011-04-11 23:56:31 +08:00
|
|
|
struct btrfs_ioctl_space_info __user *user_dest;
|
2010-01-14 02:19:06 +08:00
|
|
|
struct btrfs_space_info *info;
|
2017-09-19 23:01:23 +08:00
|
|
|
static const u64 types[] = {
|
|
|
|
BTRFS_BLOCK_GROUP_DATA,
|
|
|
|
BTRFS_BLOCK_GROUP_SYSTEM,
|
|
|
|
BTRFS_BLOCK_GROUP_METADATA,
|
|
|
|
BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA
|
|
|
|
};
|
2010-09-29 23:22:36 +08:00
|
|
|
int num_types = 4;
|
2010-03-17 03:40:10 +08:00
|
|
|
int alloc_size;
|
2010-01-14 02:19:06 +08:00
|
|
|
int ret = 0;
|
2011-02-15 05:04:23 +08:00
|
|
|
u64 slot_count = 0;
|
2010-09-29 23:22:36 +08:00
|
|
|
int i, c;
|
2010-01-14 02:19:06 +08:00
|
|
|
|
|
|
|
if (copy_from_user(&space_args,
|
|
|
|
(struct btrfs_ioctl_space_args __user *)arg,
|
|
|
|
sizeof(space_args)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2010-09-29 23:22:36 +08:00
|
|
|
for (i = 0; i < num_types; i++) {
|
|
|
|
struct btrfs_space_info *tmp;
|
|
|
|
|
|
|
|
info = NULL;
|
2020-09-02 05:40:37 +08:00
|
|
|
list_for_each_entry(tmp, &fs_info->space_info, list) {
|
2010-09-29 23:22:36 +08:00
|
|
|
if (tmp->flags == types[i]) {
|
|
|
|
info = tmp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!info)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
down_read(&info->groups_sem);
|
|
|
|
for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
|
|
|
|
if (!list_empty(&info->block_groups[c]))
|
|
|
|
slot_count++;
|
|
|
|
}
|
|
|
|
up_read(&info->groups_sem);
|
|
|
|
}
|
2010-03-17 03:40:10 +08:00
|
|
|
|
2014-02-07 21:34:12 +08:00
|
|
|
/*
|
|
|
|
* Global block reserve, exported as a space_info
|
|
|
|
*/
|
|
|
|
slot_count++;
|
|
|
|
|
2010-03-17 03:40:10 +08:00
|
|
|
/* space_slots == 0 means they are asking for a count */
|
|
|
|
if (space_args.space_slots == 0) {
|
|
|
|
space_args.total_spaces = slot_count;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-09-29 23:22:36 +08:00
|
|
|
|
2011-02-15 05:04:23 +08:00
|
|
|
slot_count = min_t(u64, space_args.space_slots, slot_count);
|
2010-09-29 23:22:36 +08:00
|
|
|
|
2010-03-17 03:40:10 +08:00
|
|
|
alloc_size = sizeof(*dest) * slot_count;
|
2010-09-29 23:22:36 +08:00
|
|
|
|
2010-03-17 03:40:10 +08:00
|
|
|
/* we generally have at most 6 or so space infos, one for each raid
|
|
|
|
* level. So, a whole page should be more than enough for everyone
|
|
|
|
*/
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
if (alloc_size > PAGE_SIZE)
|
2010-03-17 03:40:10 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-01-14 02:19:06 +08:00
|
|
|
space_args.total_spaces = 0;
|
2015-11-04 22:38:29 +08:00
|
|
|
dest = kmalloc(alloc_size, GFP_KERNEL);
|
2010-03-17 03:40:10 +08:00
|
|
|
if (!dest)
|
|
|
|
return -ENOMEM;
|
|
|
|
dest_orig = dest;
|
2010-01-14 02:19:06 +08:00
|
|
|
|
2010-03-17 03:40:10 +08:00
|
|
|
/* now we have a buffer to copy into */
|
2010-09-29 23:22:36 +08:00
|
|
|
for (i = 0; i < num_types; i++) {
|
|
|
|
struct btrfs_space_info *tmp;
|
|
|
|
|
2011-02-15 05:04:23 +08:00
|
|
|
if (!slot_count)
|
|
|
|
break;
|
|
|
|
|
2010-09-29 23:22:36 +08:00
|
|
|
info = NULL;
|
2020-09-02 05:40:37 +08:00
|
|
|
list_for_each_entry(tmp, &fs_info->space_info, list) {
|
2010-09-29 23:22:36 +08:00
|
|
|
if (tmp->flags == types[i]) {
|
|
|
|
info = tmp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2010-03-17 03:40:10 +08:00
|
|
|
|
2010-09-29 23:22:36 +08:00
|
|
|
if (!info)
|
|
|
|
continue;
|
|
|
|
down_read(&info->groups_sem);
|
|
|
|
for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
|
|
|
|
if (!list_empty(&info->block_groups[c])) {
|
2018-04-02 17:24:11 +08:00
|
|
|
get_block_group_info(&info->block_groups[c],
|
|
|
|
&space);
|
2010-09-29 23:22:36 +08:00
|
|
|
memcpy(dest, &space, sizeof(space));
|
|
|
|
dest++;
|
|
|
|
space_args.total_spaces++;
|
2011-02-15 05:04:23 +08:00
|
|
|
slot_count--;
|
2010-09-29 23:22:36 +08:00
|
|
|
}
|
2011-02-15 05:04:23 +08:00
|
|
|
if (!slot_count)
|
|
|
|
break;
|
2010-09-29 23:22:36 +08:00
|
|
|
}
|
|
|
|
up_read(&info->groups_sem);
|
2010-01-14 02:19:06 +08:00
|
|
|
}
|
|
|
|
|
2014-02-07 21:34:12 +08:00
|
|
|
/*
|
|
|
|
* Add global block reserve
|
|
|
|
*/
|
|
|
|
if (slot_count) {
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
|
2014-02-07 21:34:12 +08:00
|
|
|
|
|
|
|
spin_lock(&block_rsv->lock);
|
|
|
|
space.total_bytes = block_rsv->size;
|
|
|
|
space.used_bytes = block_rsv->size - block_rsv->reserved;
|
|
|
|
spin_unlock(&block_rsv->lock);
|
|
|
|
space.flags = BTRFS_SPACE_INFO_GLOBAL_RSV;
|
|
|
|
memcpy(dest, &space, sizeof(space));
|
|
|
|
space_args.total_spaces++;
|
|
|
|
}
|
|
|
|
|
2012-04-26 00:37:14 +08:00
|
|
|
user_dest = (struct btrfs_ioctl_space_info __user *)
|
2010-03-17 03:40:10 +08:00
|
|
|
(arg + sizeof(struct btrfs_ioctl_space_args));
|
|
|
|
|
|
|
|
if (copy_to_user(user_dest, dest_orig, alloc_size))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(dest_orig);
|
|
|
|
out:
|
|
|
|
if (ret == 0 && copy_to_user(arg, &space_args, sizeof(space_args)))
|
2010-01-14 02:19:06 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:40:43 +08:00
|
|
|
static noinline long btrfs_ioctl_start_sync(struct btrfs_root *root,
|
|
|
|
void __user *argp)
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 transid;
|
|
|
|
|
Btrfs: fix uncompleted transaction
In some cases, we need commit the current transaction, but don't want
to start a new one if there is no running transaction, so we introduce
the function - btrfs_attach_transaction(), which can catch the current
transaction, and return -ENOENT if there is no running transaction.
But no running transaction doesn't mean the current transction completely,
because we removed the running transaction before it completes. In some
cases, it doesn't matter. But in some special cases, such as freeze fs, we
hope the transaction is fully on disk, it will introduce some bugs, for
example, we may feeze the fs and dump the data in the disk, if the transction
doesn't complete, we would dump inconsistent data. So we need fix the above
problem for those cases.
We fixes this problem by introducing a function:
btrfs_attach_transaction_barrier()
if we hope all the transaction is fully on the disk, even they are not
running, we can use this function.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 17:17:06 +08:00
|
|
|
trans = btrfs_attach_transaction_barrier(root);
|
2012-11-26 16:41:29 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
if (PTR_ERR(trans) != -ENOENT)
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
|
|
|
|
/* No running transaction, don't bother */
|
|
|
|
transid = root->fs_info->last_trans_committed;
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
transid = trans->transid;
|
2021-11-06 04:45:28 +08:00
|
|
|
btrfs_commit_transaction_async(trans);
|
2012-11-26 16:41:29 +08:00
|
|
|
out:
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
if (argp)
|
|
|
|
if (copy_to_user(argp, &transid, sizeof(transid)))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static noinline long btrfs_ioctl_wait_sync(struct btrfs_fs_info *fs_info,
|
2012-11-26 16:40:43 +08:00
|
|
|
void __user *argp)
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
{
|
|
|
|
u64 transid;
|
|
|
|
|
|
|
|
if (argp) {
|
|
|
|
if (copy_from_user(&transid, argp, sizeof(transid)))
|
|
|
|
return -EFAULT;
|
|
|
|
} else {
|
|
|
|
transid = 0; /* current trans */
|
|
|
|
}
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_wait_for_commit(fs_info, transid);
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
}
|
|
|
|
|
2012-11-26 16:48:01 +08:00
|
|
|
static long btrfs_ioctl_scrub(struct file *file, void __user *arg)
|
2011-03-11 22:41:01 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(file_inode(file)->i_sb);
|
2011-03-11 22:41:01 +08:00
|
|
|
struct btrfs_ioctl_scrub_args *sa;
|
2012-11-26 16:48:01 +08:00
|
|
|
int ret;
|
2011-03-11 22:41:01 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2021-12-16 04:40:02 +08:00
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_err(fs_info, "scrub is not supported on extent tree v2 yet");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2011-03-11 22:41:01 +08:00
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
|
|
|
if (IS_ERR(sa))
|
|
|
|
return PTR_ERR(sa);
|
|
|
|
|
2012-11-26 16:48:01 +08:00
|
|
|
if (!(sa->flags & BTRFS_SCRUB_READONLY)) {
|
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_scrub_dev(fs_info, sa->devid, sa->start, sa->end,
|
2012-11-06 01:29:28 +08:00
|
|
|
&sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
|
|
|
|
0);
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2020-01-16 19:29:20 +08:00
|
|
|
/*
|
|
|
|
* Copy scrub args to user space even if btrfs_scrub_dev() returned an
|
|
|
|
* error. This is important as it allows user space to know how much
|
|
|
|
* progress scrub has done. For example, if scrub is canceled we get
|
|
|
|
* -ECANCELED from btrfs_scrub_dev() and return that error back to user
|
|
|
|
* space. Later user space can inspect the progress from the structure
|
|
|
|
* btrfs_ioctl_scrub_args and resume scrub from where it left off
|
|
|
|
* previously (btrfs-progs does this).
|
|
|
|
* If we fail to copy the btrfs_ioctl_scrub_args structure to user space
|
|
|
|
* then return -EFAULT to signal the structure was not copied or it may
|
|
|
|
* be corrupt and unreliable due to a partial copy.
|
|
|
|
*/
|
|
|
|
if (copy_to_user(arg, sa, sizeof(*sa)))
|
2011-03-11 22:41:01 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
|
2012-11-26 16:48:01 +08:00
|
|
|
if (!(sa->flags & BTRFS_SCRUB_READONLY))
|
|
|
|
mnt_drop_write_file(file);
|
|
|
|
out:
|
2011-03-11 22:41:01 +08:00
|
|
|
kfree(sa);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_scrub_cancel(struct btrfs_fs_info *fs_info)
|
2011-03-11 22:41:01 +08:00
|
|
|
{
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_scrub_cancel(fs_info);
|
2011-03-11 22:41:01 +08:00
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_scrub_progress(struct btrfs_fs_info *fs_info,
|
2011-03-11 22:41:01 +08:00
|
|
|
void __user *arg)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_scrub_args *sa;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
|
|
|
if (IS_ERR(sa))
|
|
|
|
return PTR_ERR(sa);
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_scrub_progress(fs_info, sa->devid, &sa->progress);
|
2011-03-11 22:41:01 +08:00
|
|
|
|
2018-12-15 03:45:13 +08:00
|
|
|
if (ret == 0 && copy_to_user(arg, sa, sizeof(*sa)))
|
2011-03-11 22:41:01 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(sa);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_get_dev_stats(struct btrfs_fs_info *fs_info,
|
2012-06-22 20:30:39 +08:00
|
|
|
void __user *arg)
|
2012-05-25 22:06:09 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_get_dev_stats *sa;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
|
|
|
if (IS_ERR(sa))
|
|
|
|
return PTR_ERR(sa);
|
|
|
|
|
2012-06-22 20:30:39 +08:00
|
|
|
if ((sa->flags & BTRFS_DEV_STATS_RESET) && !capable(CAP_SYS_ADMIN)) {
|
|
|
|
kfree(sa);
|
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_get_dev_stats(fs_info, sa);
|
2012-05-25 22:06:09 +08:00
|
|
|
|
2018-12-15 03:45:22 +08:00
|
|
|
if (ret == 0 && copy_to_user(arg, sa, sizeof(*sa)))
|
2012-05-25 22:06:09 +08:00
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(sa);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_dev_replace(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2012-11-06 22:08:53 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_dev_replace_args *p;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2021-12-16 04:40:00 +08:00
|
|
|
if (btrfs_fs_incompat(fs_info, EXTENT_TREE_V2)) {
|
|
|
|
btrfs_err(fs_info, "device replace not supported on extent tree v2 yet");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2012-11-06 22:08:53 +08:00
|
|
|
p = memdup_user(arg, sizeof(*p));
|
|
|
|
if (IS_ERR(p))
|
|
|
|
return PTR_ERR(p);
|
|
|
|
|
|
|
|
switch (p->cmd) {
|
|
|
|
case BTRFS_IOCTL_DEV_REPLACE_CMD_START:
|
2017-07-17 15:45:34 +08:00
|
|
|
if (sb_rdonly(fs_info->sb)) {
|
2013-10-11 01:39:28 +08:00
|
|
|
ret = -EROFS;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-08-25 23:02:32 +08:00
|
|
|
if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_DEV_REPLACE)) {
|
2013-08-21 11:44:48 +08:00
|
|
|
ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
|
2012-11-06 22:08:53 +08:00
|
|
|
} else {
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = btrfs_dev_replace_by_ioctl(fs_info, p);
|
2020-08-25 23:02:32 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2012-11-06 22:08:53 +08:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
case BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS:
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_dev_replace_status(fs_info, p);
|
2012-11-06 22:08:53 +08:00
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
case BTRFS_IOCTL_DEV_REPLACE_CMD_CANCEL:
|
2018-02-12 23:33:30 +08:00
|
|
|
p->result = btrfs_dev_replace_cancel(fs_info);
|
2018-02-12 23:33:29 +08:00
|
|
|
ret = 0;
|
2012-11-06 22:08:53 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2019-01-08 19:42:09 +08:00
|
|
|
if ((ret == 0 || ret == -ECANCELED) && copy_to_user(arg, p, sizeof(*p)))
|
2012-11-06 22:08:53 +08:00
|
|
|
ret = -EFAULT;
|
2013-10-11 01:39:28 +08:00
|
|
|
out:
|
2012-11-06 22:08:53 +08:00
|
|
|
kfree(p);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-07-07 22:48:38 +08:00
|
|
|
static long btrfs_ioctl_ino_to_path(struct btrfs_root *root, void __user *arg)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
int i;
|
2011-11-03 03:48:34 +08:00
|
|
|
u64 rel_ptr;
|
2011-07-07 22:48:38 +08:00
|
|
|
int size;
|
2011-11-06 16:07:10 +08:00
|
|
|
struct btrfs_ioctl_ino_path_args *ipa = NULL;
|
2011-07-07 22:48:38 +08:00
|
|
|
struct inode_fs_paths *ipath = NULL;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
|
2013-01-28 19:33:31 +08:00
|
|
|
if (!capable(CAP_DAC_READ_SEARCH))
|
2011-07-07 22:48:38 +08:00
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ipa = memdup_user(arg, sizeof(*ipa));
|
|
|
|
if (IS_ERR(ipa)) {
|
|
|
|
ret = PTR_ERR(ipa);
|
|
|
|
ipa = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
size = min_t(u32, ipa->size, 4096);
|
|
|
|
ipath = init_ipath(size, root, path);
|
|
|
|
if (IS_ERR(ipath)) {
|
|
|
|
ret = PTR_ERR(ipath);
|
|
|
|
ipath = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = paths_from_inode(ipa->inum, ipath);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
for (i = 0; i < ipath->fspath->elem_cnt; ++i) {
|
2011-11-20 20:31:57 +08:00
|
|
|
rel_ptr = ipath->fspath->val[i] -
|
|
|
|
(u64)(unsigned long)ipath->fspath->val;
|
2011-11-03 03:48:34 +08:00
|
|
|
ipath->fspath->val[i] = rel_ptr;
|
2011-07-07 22:48:38 +08:00
|
|
|
}
|
|
|
|
|
2022-11-10 14:06:29 +08:00
|
|
|
btrfs_free_path(path);
|
|
|
|
path = NULL;
|
2017-08-23 14:46:05 +08:00
|
|
|
ret = copy_to_user((void __user *)(unsigned long)ipa->fspath,
|
|
|
|
ipath->fspath, size);
|
2011-07-07 22:48:38 +08:00
|
|
|
if (ret) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
free_ipath(ipath);
|
|
|
|
kfree(ipa);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
void __user *arg, int version)
|
2011-07-07 22:48:38 +08:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
int size;
|
|
|
|
struct btrfs_ioctl_logical_ino_args *loi;
|
|
|
|
struct btrfs_data_container *inodes = NULL;
|
|
|
|
struct btrfs_path *path = NULL;
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
bool ignore_offset;
|
2011-07-07 22:48:38 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
loi = memdup_user(arg, sizeof(*loi));
|
2016-11-10 17:47:41 +08:00
|
|
|
if (IS_ERR(loi))
|
|
|
|
return PTR_ERR(loi);
|
2011-07-07 22:48:38 +08:00
|
|
|
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
if (version == 1) {
|
|
|
|
ignore_offset = false;
|
2017-09-23 01:58:47 +08:00
|
|
|
size = min_t(u32, loi->size, SZ_64K);
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
} else {
|
|
|
|
/* All reserved bits must be 0 for now */
|
|
|
|
if (memchr_inv(loi->reserved, 0, sizeof(loi->reserved))) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out_loi;
|
|
|
|
}
|
|
|
|
/* Only accept flags we have defined so far */
|
|
|
|
if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out_loi;
|
|
|
|
}
|
|
|
|
ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
|
2017-09-23 01:58:47 +08:00
|
|
|
size = min_t(u32, loi->size, SZ_16M);
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
}
|
|
|
|
|
2011-07-07 22:48:38 +08:00
|
|
|
inodes = init_data_container(size);
|
|
|
|
if (IS_ERR(inodes)) {
|
|
|
|
ret = PTR_ERR(inodes);
|
2022-11-10 14:06:28 +08:00
|
|
|
goto out_loi;
|
2011-07-07 22:48:38 +08:00
|
|
|
}
|
|
|
|
|
2022-11-10 14:06:28 +08:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
|
2022-06-07 01:32:59 +08:00
|
|
|
inodes, ignore_offset);
|
2022-11-10 14:06:28 +08:00
|
|
|
btrfs_free_path(path);
|
2012-09-08 10:01:29 +08:00
|
|
|
if (ret == -EINVAL)
|
2011-07-07 22:48:38 +08:00
|
|
|
ret = -ENOENT;
|
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2017-08-23 14:46:05 +08:00
|
|
|
ret = copy_to_user((void __user *)(unsigned long)loi->inodes, inodes,
|
|
|
|
size);
|
2011-07-07 22:48:38 +08:00
|
|
|
if (ret)
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out:
|
2017-06-01 01:32:09 +08:00
|
|
|
kvfree(inodes);
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
out_loi:
|
2011-07-07 22:48:38 +08:00
|
|
|
kfree(loi);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-03-21 09:05:27 +08:00
|
|
|
void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_ioctl_balance_args *bargs)
|
|
|
|
{
|
|
|
|
struct btrfs_balance_control *bctl = fs_info->balance_ctl;
|
|
|
|
|
|
|
|
bargs->flags = bctl->flags;
|
|
|
|
|
2018-03-21 08:31:04 +08:00
|
|
|
if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags))
|
2012-01-17 04:04:49 +08:00
|
|
|
bargs->state |= BTRFS_BALANCE_STATE_RUNNING;
|
|
|
|
if (atomic_read(&fs_info->balance_pause_req))
|
|
|
|
bargs->state |= BTRFS_BALANCE_STATE_PAUSE_REQ;
|
2012-01-17 04:04:49 +08:00
|
|
|
if (atomic_read(&fs_info->balance_cancel_req))
|
|
|
|
bargs->state |= BTRFS_BALANCE_STATE_CANCEL_REQ;
|
2012-01-17 04:04:49 +08:00
|
|
|
|
2012-01-17 04:04:47 +08:00
|
|
|
memcpy(&bargs->data, &bctl->data, sizeof(bargs->data));
|
|
|
|
memcpy(&bargs->meta, &bctl->meta, sizeof(bargs->meta));
|
|
|
|
memcpy(&bargs->sys, &bctl->sys, sizeof(bargs->sys));
|
2012-01-17 04:04:49 +08:00
|
|
|
|
2018-03-21 09:05:27 +08:00
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
memcpy(&bargs->stat, &bctl->stat, sizeof(bargs->stat));
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2022-05-03 16:36:36 +08:00
|
|
|
/**
|
|
|
|
* Try to acquire fs_info::balance_mutex as well as set BTRFS_EXLCOP_BALANCE as
|
|
|
|
* required.
|
|
|
|
*
|
|
|
|
* @fs_info: the filesystem
|
|
|
|
* @excl_acquired: ptr to boolean value which is set to false in case balance
|
|
|
|
* is being resumed
|
|
|
|
*
|
|
|
|
* Return 0 on success in which case both fs_info::balance is acquired as well
|
|
|
|
* as exclusive ops are blocked. In case of failure return an error code.
|
|
|
|
*/
|
|
|
|
static int btrfs_try_lock_balance(struct btrfs_fs_info *fs_info, bool *excl_acquired)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Exclusive operation is locked. Three possibilities:
|
|
|
|
* (1) some other op is running
|
|
|
|
* (2) balance is running
|
|
|
|
* (3) balance is paused -- special case (think resume)
|
|
|
|
*/
|
|
|
|
while (1) {
|
|
|
|
if (btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
|
|
|
|
*excl_acquired = true;
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
if (fs_info->balance_ctl) {
|
|
|
|
/* This is either (2) or (3) */
|
|
|
|
if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
|
|
|
|
/* This is (2) */
|
|
|
|
ret = -EINPROGRESS;
|
|
|
|
goto out_failure;
|
|
|
|
|
|
|
|
} else {
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
/*
|
|
|
|
* Lock released to allow other waiters to
|
|
|
|
* continue, we'll reexamine the status again.
|
|
|
|
*/
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
|
|
|
|
if (fs_info->balance_ctl &&
|
|
|
|
!test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
|
|
|
|
/* This is (3) */
|
|
|
|
*excl_acquired = false;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/* This is (1) */
|
|
|
|
ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
|
|
|
|
goto out_failure;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
out_failure:
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
*excl_acquired = false;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-05-11 18:11:26 +08:00
|
|
|
static long btrfs_ioctl_balance(struct file *file, void __user *arg)
|
2012-01-17 04:04:47 +08:00
|
|
|
{
|
2013-01-24 06:07:38 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(file_inode(file))->root;
|
2012-01-17 04:04:47 +08:00
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_ioctl_balance_args *bargs;
|
|
|
|
struct btrfs_balance_control *bctl;
|
2022-05-05 15:08:25 +08:00
|
|
|
bool need_unlock = true;
|
2012-01-17 04:04:47 +08:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-06-29 17:58:48 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
2012-05-11 18:11:26 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2022-03-30 17:14:07 +08:00
|
|
|
bargs = memdup_user(arg, sizeof(*bargs));
|
|
|
|
if (IS_ERR(bargs)) {
|
|
|
|
ret = PTR_ERR(bargs);
|
|
|
|
bargs = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-05-05 15:08:25 +08:00
|
|
|
ret = btrfs_try_lock_balance(fs_info, &need_unlock);
|
|
|
|
if (ret)
|
2013-01-20 21:57:57 +08:00
|
|
|
goto out;
|
|
|
|
|
2022-05-05 15:08:25 +08:00
|
|
|
lockdep_assert_held(&fs_info->balance_mutex);
|
|
|
|
|
2022-03-30 17:14:06 +08:00
|
|
|
if (bargs->flags & BTRFS_BALANCE_RESUME) {
|
|
|
|
if (!fs_info->balance_ctl) {
|
|
|
|
ret = -ENOTCONN;
|
2013-01-20 21:57:57 +08:00
|
|
|
goto out_unlock;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
2012-01-17 04:04:49 +08:00
|
|
|
|
2022-03-30 17:14:06 +08:00
|
|
|
bctl = fs_info->balance_ctl;
|
|
|
|
spin_lock(&fs_info->balance_lock);
|
|
|
|
bctl->flags |= BTRFS_BALANCE_RESUME;
|
|
|
|
spin_unlock(&fs_info->balance_lock);
|
|
|
|
btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE);
|
2012-01-17 04:04:49 +08:00
|
|
|
|
2022-03-30 17:14:06 +08:00
|
|
|
goto do_balance;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
2012-01-17 04:04:49 +08:00
|
|
|
|
2022-03-30 17:14:07 +08:00
|
|
|
if (bargs->flags & ~(BTRFS_BALANCE_ARGS_MASK | BTRFS_BALANCE_TYPE_MASK)) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out_unlock;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2013-01-20 21:57:57 +08:00
|
|
|
if (fs_info->balance_ctl) {
|
2012-01-17 04:04:49 +08:00
|
|
|
ret = -EINPROGRESS;
|
2022-03-30 17:14:07 +08:00
|
|
|
goto out_unlock;
|
2012-01-17 04:04:49 +08:00
|
|
|
}
|
|
|
|
|
2015-11-04 22:38:29 +08:00
|
|
|
bctl = kzalloc(sizeof(*bctl), GFP_KERNEL);
|
2012-01-17 04:04:47 +08:00
|
|
|
if (!bctl) {
|
|
|
|
ret = -ENOMEM;
|
2022-03-30 17:14:07 +08:00
|
|
|
goto out_unlock;
|
2012-01-17 04:04:47 +08:00
|
|
|
}
|
|
|
|
|
2022-03-30 17:14:06 +08:00
|
|
|
memcpy(&bctl->data, &bargs->data, sizeof(bctl->data));
|
|
|
|
memcpy(&bctl->meta, &bargs->meta, sizeof(bctl->meta));
|
|
|
|
memcpy(&bctl->sys, &bargs->sys, sizeof(bctl->sys));
|
2015-10-12 22:55:54 +08:00
|
|
|
|
2022-03-30 17:14:06 +08:00
|
|
|
bctl->flags = bargs->flags;
|
2012-01-17 04:04:49 +08:00
|
|
|
do_balance:
|
2012-01-17 04:04:47 +08:00
|
|
|
/*
|
2020-08-25 23:02:32 +08:00
|
|
|
* Ownership of bctl and exclusive operation goes to btrfs_balance.
|
|
|
|
* bctl is freed in reset_balance_state, or, if restriper was paused
|
|
|
|
* all the way until unmount, in free_fs_info. The flag should be
|
|
|
|
* cleared after reset_balance_state.
|
2012-01-17 04:04:47 +08:00
|
|
|
*/
|
2013-01-20 21:57:57 +08:00
|
|
|
need_unlock = false;
|
|
|
|
|
2018-05-07 23:44:03 +08:00
|
|
|
ret = btrfs_balance(fs_info, bctl, bargs);
|
2015-10-21 06:50:06 +08:00
|
|
|
bctl = NULL;
|
2013-01-20 21:57:57 +08:00
|
|
|
|
2022-03-30 17:14:06 +08:00
|
|
|
if (ret == 0 || ret == -ECANCELED) {
|
2012-01-17 04:04:47 +08:00
|
|
|
if (copy_to_user(arg, bargs, sizeof(*bargs)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
|
2015-10-21 06:50:06 +08:00
|
|
|
kfree(bctl);
|
2013-01-20 21:57:57 +08:00
|
|
|
out_unlock:
|
2012-01-17 04:04:47 +08:00
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
2013-01-20 21:57:57 +08:00
|
|
|
if (need_unlock)
|
2020-08-25 23:02:32 +08:00
|
|
|
btrfs_exclop_finish(fs_info);
|
2013-01-20 21:57:57 +08:00
|
|
|
out:
|
2012-06-29 17:58:48 +08:00
|
|
|
mnt_drop_write_file(file);
|
2022-03-30 17:14:07 +08:00
|
|
|
kfree(bargs);
|
2012-01-17 04:04:47 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_balance_ctl(struct btrfs_fs_info *fs_info, int cmd)
|
2012-01-17 04:04:49 +08:00
|
|
|
{
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
switch (cmd) {
|
|
|
|
case BTRFS_BALANCE_CTL_PAUSE:
|
2016-06-23 06:54:23 +08:00
|
|
|
return btrfs_pause_balance(fs_info);
|
2012-01-17 04:04:49 +08:00
|
|
|
case BTRFS_BALANCE_CTL_CANCEL:
|
2016-06-23 06:54:23 +08:00
|
|
|
return btrfs_cancel_balance(fs_info);
|
2012-01-17 04:04:49 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static long btrfs_ioctl_balance_progress(struct btrfs_fs_info *fs_info,
|
2012-01-17 04:04:49 +08:00
|
|
|
void __user *arg)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_balance_args *bargs;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
mutex_lock(&fs_info->balance_mutex);
|
|
|
|
if (!fs_info->balance_ctl) {
|
|
|
|
ret = -ENOTCONN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2015-11-04 22:38:29 +08:00
|
|
|
bargs = kzalloc(sizeof(*bargs), GFP_KERNEL);
|
2012-01-17 04:04:49 +08:00
|
|
|
if (!bargs) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-03-21 09:05:27 +08:00
|
|
|
btrfs_update_ioctl_balance_args(fs_info, bargs);
|
2012-01-17 04:04:49 +08:00
|
|
|
|
|
|
|
if (copy_to_user(arg, bargs, sizeof(*bargs)))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
kfree(bargs);
|
|
|
|
out:
|
|
|
|
mutex_unlock(&fs_info->balance_mutex);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
|
2011-09-14 21:53:51 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2011-09-14 21:53:51 +08:00
|
|
|
struct btrfs_ioctl_quota_ctl_args *sa;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
2012-11-26 16:50:11 +08:00
|
|
|
if (IS_ERR(sa)) {
|
|
|
|
ret = PTR_ERR(sa);
|
|
|
|
goto drop_write;
|
|
|
|
}
|
2011-09-14 21:53:51 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
down_write(&fs_info->subvol_sem);
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
switch (sa->cmd) {
|
|
|
|
case BTRFS_QUOTA_CTL_ENABLE:
|
2018-07-05 19:50:48 +08:00
|
|
|
ret = btrfs_quota_enable(fs_info);
|
2011-09-14 21:53:51 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_QUOTA_CTL_DISABLE:
|
2018-07-05 19:50:48 +08:00
|
|
|
ret = btrfs_quota_disable(fs_info);
|
2011-09-14 21:53:51 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree(sa);
|
2016-06-23 06:54:23 +08:00
|
|
|
up_write(&fs_info->subvol_sem);
|
2012-11-26 16:50:11 +08:00
|
|
|
drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
2011-09-14 21:53:51 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
static long btrfs_ioctl_qgroup_assign(struct file *file, void __user *arg)
|
2011-09-14 21:53:51 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-09-14 21:53:51 +08:00
|
|
|
struct btrfs_ioctl_qgroup_assign_args *sa;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int ret;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
2012-11-26 16:50:11 +08:00
|
|
|
if (IS_ERR(sa)) {
|
|
|
|
ret = PTR_ERR(sa);
|
|
|
|
goto drop_write;
|
|
|
|
}
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sa->assign) {
|
2018-07-18 14:45:30 +08:00
|
|
|
ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst);
|
2011-09-14 21:53:51 +08:00
|
|
|
} else {
|
2018-07-18 14:45:32 +08:00
|
|
|
ret = btrfs_del_qgroup_relation(trans, sa->src, sa->dst);
|
2011-09-14 21:53:51 +08:00
|
|
|
}
|
|
|
|
|
2015-02-27 16:24:28 +08:00
|
|
|
/* update qgroup status and info */
|
2018-07-18 14:45:40 +08:00
|
|
|
err = btrfs_run_qgroups(trans);
|
2015-02-27 16:24:28 +08:00
|
|
|
if (err < 0)
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_handle_fs_error(fs_info, err,
|
|
|
|
"failed to update qgroup status and info");
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_end_transaction(trans);
|
2011-09-14 21:53:51 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(sa);
|
2012-11-26 16:50:11 +08:00
|
|
|
drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
2011-09-14 21:53:51 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
static long btrfs_ioctl_qgroup_create(struct file *file, void __user *arg)
|
2011-09-14 21:53:51 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-09-14 21:53:51 +08:00
|
|
|
struct btrfs_ioctl_qgroup_create_args *sa;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int ret;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
2012-11-26 16:50:11 +08:00
|
|
|
if (IS_ERR(sa)) {
|
|
|
|
ret = PTR_ERR(sa);
|
|
|
|
goto drop_write;
|
|
|
|
}
|
2011-09-14 21:53:51 +08:00
|
|
|
|
2012-11-15 19:35:41 +08:00
|
|
|
if (!sa->qgroupid) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2011-09-14 21:53:51 +08:00
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sa->create) {
|
2018-07-18 14:45:33 +08:00
|
|
|
ret = btrfs_create_qgroup(trans, sa->qgroupid);
|
2011-09-14 21:53:51 +08:00
|
|
|
} else {
|
2018-07-18 14:45:34 +08:00
|
|
|
ret = btrfs_remove_qgroup(trans, sa->qgroupid);
|
2011-09-14 21:53:51 +08:00
|
|
|
}
|
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_end_transaction(trans);
|
2011-09-14 21:53:51 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(sa);
|
2012-11-26 16:50:11 +08:00
|
|
|
drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
2011-09-14 21:53:51 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
static long btrfs_ioctl_qgroup_limit(struct file *file, void __user *arg)
|
2011-09-14 21:53:51 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2011-09-14 21:53:51 +08:00
|
|
|
struct btrfs_ioctl_qgroup_limit_args *sa;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int ret;
|
|
|
|
int err;
|
|
|
|
u64 qgroupid;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2012-11-26 16:50:11 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
2012-11-26 16:50:11 +08:00
|
|
|
if (IS_ERR(sa)) {
|
|
|
|
ret = PTR_ERR(sa);
|
|
|
|
goto drop_write;
|
|
|
|
}
|
2011-09-14 21:53:51 +08:00
|
|
|
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
qgroupid = sa->qgroupid;
|
|
|
|
if (!qgroupid) {
|
|
|
|
/* take the current subvol as qgroup */
|
|
|
|
qgroupid = root->root_key.objectid;
|
|
|
|
}
|
|
|
|
|
2018-07-18 14:45:35 +08:00
|
|
|
ret = btrfs_limit_qgroup(trans, qgroupid, &sa->lim);
|
2011-09-14 21:53:51 +08:00
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
err = btrfs_end_transaction(trans);
|
2011-09-14 21:53:51 +08:00
|
|
|
if (err && !ret)
|
|
|
|
ret = err;
|
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(sa);
|
2012-11-26 16:50:11 +08:00
|
|
|
drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
2011-09-14 21:53:51 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-04-26 00:04:51 +08:00
|
|
|
static long btrfs_ioctl_quota_rescan(struct file *file, void __user *arg)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2013-04-26 00:04:51 +08:00
|
|
|
struct btrfs_ioctl_quota_rescan_args *qsa;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
qsa = memdup_user(arg, sizeof(*qsa));
|
|
|
|
if (IS_ERR(qsa)) {
|
|
|
|
ret = PTR_ERR(qsa);
|
|
|
|
goto drop_write;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (qsa->flags) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_qgroup_rescan(fs_info);
|
2013-04-26 00:04:51 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(qsa);
|
|
|
|
drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-10-11 08:23:11 +08:00
|
|
|
static long btrfs_ioctl_quota_rescan_status(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2013-04-26 00:04:51 +08:00
|
|
|
{
|
2021-07-28 05:17:29 +08:00
|
|
|
struct btrfs_ioctl_quota_rescan_args qsa = {0};
|
2013-04-26 00:04:51 +08:00
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
if (fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_RESCAN) {
|
2021-07-28 05:17:29 +08:00
|
|
|
qsa.flags = 1;
|
|
|
|
qsa.progress = fs_info->qgroup_rescan_progress.objectid;
|
2013-04-26 00:04:51 +08:00
|
|
|
}
|
|
|
|
|
2021-07-28 05:17:29 +08:00
|
|
|
if (copy_to_user(arg, &qsa, sizeof(qsa)))
|
2021-07-28 14:20:41 +08:00
|
|
|
return -EFAULT;
|
2013-04-26 00:04:51 +08:00
|
|
|
|
2021-07-28 14:20:41 +08:00
|
|
|
return 0;
|
2013-04-26 00:04:51 +08:00
|
|
|
}
|
|
|
|
|
2019-10-11 08:23:11 +08:00
|
|
|
static long btrfs_ioctl_quota_rescan_wait(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2013-05-07 03:14:17 +08:00
|
|
|
{
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
return btrfs_qgroup_wait_for_completion(fs_info, true);
|
2013-05-07 03:14:17 +08:00
|
|
|
}
|
|
|
|
|
2014-01-31 04:17:00 +08:00
|
|
|
static long _btrfs_ioctl_set_received_subvol(struct file *file,
|
2021-07-27 18:48:55 +08:00
|
|
|
struct user_namespace *mnt_userns,
|
2014-01-31 04:17:00 +08:00
|
|
|
struct btrfs_ioctl_received_subvol_args *sa)
|
2012-07-25 23:35:53 +08:00
|
|
|
{
|
2013-01-24 06:07:38 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2012-07-25 23:35:53 +08:00
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_root_item *root_item = &root->root_item;
|
|
|
|
struct btrfs_trans_handle *trans;
|
vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.
The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.
The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.
virtual patch
@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
current_time ( ... )
{
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
...
- return timespec_trunc(
+ return timespec64_trunc(
... );
}
@ depends on patch @
identifier xtime;
@@
struct \( iattr \| inode \| kstat \) {
...
- struct timespec xtime;
+ struct timespec64 xtime;
...
}
@ depends on patch @
identifier t;
@@
struct inode_operations {
...
int (*update_time) (...,
- struct timespec t,
+ struct timespec64 t,
...);
...
}
@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
...) { ... }
@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
) { ... }
@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)
<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)
@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}
@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}
@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
|
node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
stat->xtime = node2->i_xtime1;
|
stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-05-09 10:36:02 +08:00
|
|
|
struct timespec64 ct = current_time(inode);
|
2012-07-25 23:35:53 +08:00
|
|
|
int ret = 0;
|
2013-08-15 23:11:20 +08:00
|
|
|
int received_uuid_changed;
|
2012-07-25 23:35:53 +08:00
|
|
|
|
2021-07-27 18:48:55 +08:00
|
|
|
if (!inode_owner_or_capable(mnt_userns, inode))
|
2014-01-16 22:50:22 +08:00
|
|
|
return -EPERM;
|
|
|
|
|
2012-07-25 23:35:53 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
down_write(&fs_info->subvol_sem);
|
2012-07-25 23:35:53 +08:00
|
|
|
|
2017-01-11 02:35:31 +08:00
|
|
|
if (btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) {
|
2012-07-25 23:35:53 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (btrfs_root_readonly(root)) {
|
|
|
|
ret = -EROFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2013-08-15 23:11:20 +08:00
|
|
|
/*
|
|
|
|
* 1 - root item
|
|
|
|
* 2 - uuid items (received uuid + subvol uuid)
|
|
|
|
*/
|
|
|
|
trans = btrfs_start_transaction(root, 3);
|
2012-07-25 23:35:53 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
trans = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
sa->rtransid = trans->transid;
|
|
|
|
sa->rtime.sec = ct.tv_sec;
|
|
|
|
sa->rtime.nsec = ct.tv_nsec;
|
|
|
|
|
2013-08-15 23:11:20 +08:00
|
|
|
received_uuid_changed = memcmp(root_item->received_uuid, sa->uuid,
|
|
|
|
BTRFS_UUID_SIZE);
|
|
|
|
if (received_uuid_changed &&
|
2018-03-12 20:48:09 +08:00
|
|
|
!btrfs_is_empty_uuid(root_item->received_uuid)) {
|
2018-05-29 15:01:54 +08:00
|
|
|
ret = btrfs_uuid_tree_remove(trans, root_item->received_uuid,
|
2018-03-12 20:48:09 +08:00
|
|
|
BTRFS_UUID_KEY_RECEIVED_SUBVOL,
|
|
|
|
root->root_key.objectid);
|
|
|
|
if (ret && ret != -ENOENT) {
|
|
|
|
btrfs_abort_transaction(trans, ret);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2012-07-25 23:35:53 +08:00
|
|
|
memcpy(root_item->received_uuid, sa->uuid, BTRFS_UUID_SIZE);
|
|
|
|
btrfs_set_root_stransid(root_item, sa->stransid);
|
|
|
|
btrfs_set_root_rtransid(root_item, sa->rtransid);
|
2013-07-16 11:19:18 +08:00
|
|
|
btrfs_set_stack_timespec_sec(&root_item->stime, sa->stime.sec);
|
|
|
|
btrfs_set_stack_timespec_nsec(&root_item->stime, sa->stime.nsec);
|
|
|
|
btrfs_set_stack_timespec_sec(&root_item->rtime, sa->rtime.sec);
|
|
|
|
btrfs_set_stack_timespec_nsec(&root_item->rtime, sa->rtime.nsec);
|
2012-07-25 23:35:53 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_update_root(trans, fs_info->tree_root,
|
2012-07-25 23:35:53 +08:00
|
|
|
&root->root_key, &root->root_item);
|
|
|
|
if (ret < 0) {
|
2016-09-10 09:39:03 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2012-07-25 23:35:53 +08:00
|
|
|
goto out;
|
2013-08-15 23:11:20 +08:00
|
|
|
}
|
|
|
|
if (received_uuid_changed && !btrfs_is_empty_uuid(sa->uuid)) {
|
2018-05-29 15:01:53 +08:00
|
|
|
ret = btrfs_uuid_tree_add(trans, sa->uuid,
|
2013-08-15 23:11:20 +08:00
|
|
|
BTRFS_UUID_KEY_RECEIVED_SUBVOL,
|
|
|
|
root->root_key.objectid);
|
|
|
|
if (ret < 0 && ret != -EEXIST) {
|
2016-06-11 06:19:25 +08:00
|
|
|
btrfs_abort_transaction(trans, ret);
|
2017-09-28 16:45:26 +08:00
|
|
|
btrfs_end_transaction(trans);
|
2012-07-25 23:35:53 +08:00
|
|
|
goto out;
|
2013-08-15 23:11:20 +08:00
|
|
|
}
|
|
|
|
}
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2014-01-31 04:17:00 +08:00
|
|
|
out:
|
2016-06-23 06:54:23 +08:00
|
|
|
up_write(&fs_info->subvol_sem);
|
2014-01-31 04:17:00 +08:00
|
|
|
mnt_drop_write_file(file);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_64BIT
|
|
|
|
static long btrfs_ioctl_set_received_subvol_32(struct file *file,
|
|
|
|
void __user *arg)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_received_subvol_args_32 *args32 = NULL;
|
|
|
|
struct btrfs_ioctl_received_subvol_args *args64 = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
args32 = memdup_user(arg, sizeof(*args32));
|
2016-11-10 17:47:41 +08:00
|
|
|
if (IS_ERR(args32))
|
|
|
|
return PTR_ERR(args32);
|
2014-01-31 04:17:00 +08:00
|
|
|
|
2015-11-04 22:38:29 +08:00
|
|
|
args64 = kmalloc(sizeof(*args64), GFP_KERNEL);
|
2014-03-28 16:06:00 +08:00
|
|
|
if (!args64) {
|
|
|
|
ret = -ENOMEM;
|
2014-01-31 04:17:00 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(args64->uuid, args32->uuid, BTRFS_UUID_SIZE);
|
|
|
|
args64->stransid = args32->stransid;
|
|
|
|
args64->rtransid = args32->rtransid;
|
|
|
|
args64->stime.sec = args32->stime.sec;
|
|
|
|
args64->stime.nsec = args32->stime.nsec;
|
|
|
|
args64->rtime.sec = args32->rtime.sec;
|
|
|
|
args64->rtime.nsec = args32->rtime.nsec;
|
|
|
|
args64->flags = args32->flags;
|
|
|
|
|
2021-07-27 18:48:55 +08:00
|
|
|
ret = _btrfs_ioctl_set_received_subvol(file, file_mnt_user_ns(file), args64);
|
2014-01-31 04:17:00 +08:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
memcpy(args32->uuid, args64->uuid, BTRFS_UUID_SIZE);
|
|
|
|
args32->stransid = args64->stransid;
|
|
|
|
args32->rtransid = args64->rtransid;
|
|
|
|
args32->stime.sec = args64->stime.sec;
|
|
|
|
args32->stime.nsec = args64->stime.nsec;
|
|
|
|
args32->rtime.sec = args64->rtime.sec;
|
|
|
|
args32->rtime.nsec = args64->rtime.nsec;
|
|
|
|
args32->flags = args64->flags;
|
|
|
|
|
|
|
|
ret = copy_to_user(arg, args32, sizeof(*args32));
|
|
|
|
if (ret)
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(args32);
|
|
|
|
kfree(args64);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static long btrfs_ioctl_set_received_subvol(struct file *file,
|
|
|
|
void __user *arg)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_received_subvol_args *sa = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
sa = memdup_user(arg, sizeof(*sa));
|
2016-11-10 17:47:41 +08:00
|
|
|
if (IS_ERR(sa))
|
|
|
|
return PTR_ERR(sa);
|
2014-01-31 04:17:00 +08:00
|
|
|
|
2021-07-27 18:48:55 +08:00
|
|
|
ret = _btrfs_ioctl_set_received_subvol(file, file_mnt_user_ns(file), sa);
|
2014-01-31 04:17:00 +08:00
|
|
|
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2012-07-25 23:35:53 +08:00
|
|
|
ret = copy_to_user(arg, sa, sizeof(*sa));
|
|
|
|
if (ret)
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out:
|
|
|
|
kfree(sa);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-10-11 08:23:11 +08:00
|
|
|
static int btrfs_ioctl_get_fslabel(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2013-01-05 10:48:01 +08:00
|
|
|
{
|
2013-07-19 17:39:32 +08:00
|
|
|
size_t len;
|
2013-01-05 10:48:01 +08:00
|
|
|
int ret;
|
2013-07-19 17:39:32 +08:00
|
|
|
char label[BTRFS_LABEL_SIZE];
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
spin_lock(&fs_info->super_lock);
|
|
|
|
memcpy(label, fs_info->super_copy->label, BTRFS_LABEL_SIZE);
|
|
|
|
spin_unlock(&fs_info->super_lock);
|
2013-07-19 17:39:32 +08:00
|
|
|
|
|
|
|
len = strnlen(label, BTRFS_LABEL_SIZE);
|
2013-01-05 10:48:01 +08:00
|
|
|
|
|
|
|
if (len == BTRFS_LABEL_SIZE) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"label is too long, return the first %zu bytes",
|
|
|
|
--len);
|
2013-01-05 10:48:01 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
ret = copy_to_user(arg, label, len);
|
|
|
|
|
|
|
|
return ret ? -EFAULT : 0;
|
|
|
|
}
|
|
|
|
|
2013-01-05 10:48:08 +08:00
|
|
|
static int btrfs_ioctl_set_fslabel(struct file *file, void __user *arg)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_super_block *super_block = fs_info->super_copy;
|
2013-01-05 10:48:08 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
char label[BTRFS_LABEL_SIZE];
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
if (copy_from_user(label, arg, sizeof(label)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (strnlen(label, BTRFS_LABEL_SIZE) == BTRFS_LABEL_SIZE) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_err(fs_info,
|
2016-09-20 22:05:00 +08:00
|
|
|
"unable to set label with more than %d bytes",
|
|
|
|
BTRFS_LABEL_SIZE - 1);
|
2013-01-05 10:48:08 +08:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
trans = btrfs_start_transaction(root, 0);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
spin_lock(&fs_info->super_lock);
|
2013-01-05 10:48:08 +08:00
|
|
|
strcpy(super_block->label, label);
|
2016-06-23 06:54:23 +08:00
|
|
|
spin_unlock(&fs_info->super_lock);
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2013-01-05 10:48:08 +08:00
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
mnt_drop_write_file(file);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-11-16 04:33:55 +08:00
|
|
|
#define INIT_FEATURE_FLAGS(suffix) \
|
|
|
|
{ .compat_flags = BTRFS_FEATURE_COMPAT_##suffix, \
|
|
|
|
.compat_ro_flags = BTRFS_FEATURE_COMPAT_RO_##suffix, \
|
|
|
|
.incompat_flags = BTRFS_FEATURE_INCOMPAT_##suffix }
|
|
|
|
|
2016-02-17 22:26:27 +08:00
|
|
|
int btrfs_ioctl_get_supported_features(void __user *arg)
|
2013-11-16 04:33:55 +08:00
|
|
|
{
|
2015-11-19 18:42:31 +08:00
|
|
|
static const struct btrfs_ioctl_feature_flags features[3] = {
|
2013-11-16 04:33:55 +08:00
|
|
|
INIT_FEATURE_FLAGS(SUPP),
|
|
|
|
INIT_FEATURE_FLAGS(SAFE_SET),
|
|
|
|
INIT_FEATURE_FLAGS(SAFE_CLEAR)
|
|
|
|
};
|
|
|
|
|
|
|
|
if (copy_to_user(arg, &features, sizeof(features)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-10-11 08:23:11 +08:00
|
|
|
static int btrfs_ioctl_get_features(struct btrfs_fs_info *fs_info,
|
|
|
|
void __user *arg)
|
2013-11-16 04:33:55 +08:00
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct btrfs_super_block *super_block = fs_info->super_copy;
|
2013-11-16 04:33:55 +08:00
|
|
|
struct btrfs_ioctl_feature_flags features;
|
|
|
|
|
|
|
|
features.compat_flags = btrfs_super_compat_flags(super_block);
|
|
|
|
features.compat_ro_flags = btrfs_super_compat_ro_flags(super_block);
|
|
|
|
features.incompat_flags = btrfs_super_incompat_flags(super_block);
|
|
|
|
|
|
|
|
if (copy_to_user(arg, &features, sizeof(features)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
static int check_feature_bits(struct btrfs_fs_info *fs_info,
|
2013-11-02 01:07:02 +08:00
|
|
|
enum btrfs_feature_set set,
|
2013-11-16 04:33:55 +08:00
|
|
|
u64 change_mask, u64 flags, u64 supported_flags,
|
|
|
|
u64 safe_set, u64 safe_clear)
|
|
|
|
{
|
2019-08-02 01:07:55 +08:00
|
|
|
const char *type = btrfs_feature_set_name(set);
|
2013-11-02 01:07:02 +08:00
|
|
|
char *names;
|
2013-11-16 04:33:55 +08:00
|
|
|
u64 disallowed, unsupported;
|
|
|
|
u64 set_mask = flags & change_mask;
|
|
|
|
u64 clear_mask = ~flags & change_mask;
|
|
|
|
|
|
|
|
unsupported = set_mask & ~supported_flags;
|
|
|
|
if (unsupported) {
|
2013-11-02 01:07:02 +08:00
|
|
|
names = btrfs_printable_features(set, unsupported);
|
|
|
|
if (names) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"this kernel does not support the %s feature bit%s",
|
|
|
|
names, strchr(names, ',') ? "s" : "");
|
2013-11-02 01:07:02 +08:00
|
|
|
kfree(names);
|
|
|
|
} else
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"this kernel does not support %s bits 0x%llx",
|
|
|
|
type, unsupported);
|
2013-11-16 04:33:55 +08:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
|
|
|
|
disallowed = set_mask & ~safe_set;
|
|
|
|
if (disallowed) {
|
2013-11-02 01:07:02 +08:00
|
|
|
names = btrfs_printable_features(set, disallowed);
|
|
|
|
if (names) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"can't set the %s feature bit%s while mounted",
|
|
|
|
names, strchr(names, ',') ? "s" : "");
|
2013-11-02 01:07:02 +08:00
|
|
|
kfree(names);
|
|
|
|
} else
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"can't set %s bits 0x%llx while mounted",
|
|
|
|
type, disallowed);
|
2013-11-16 04:33:55 +08:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
|
|
|
disallowed = clear_mask & ~safe_clear;
|
|
|
|
if (disallowed) {
|
2013-11-02 01:07:02 +08:00
|
|
|
names = btrfs_printable_features(set, disallowed);
|
|
|
|
if (names) {
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"can't clear the %s feature bit%s while mounted",
|
|
|
|
names, strchr(names, ',') ? "s" : "");
|
2013-11-02 01:07:02 +08:00
|
|
|
kfree(names);
|
|
|
|
} else
|
2016-06-23 06:54:23 +08:00
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"can't clear %s bits 0x%llx while mounted",
|
|
|
|
type, disallowed);
|
2013-11-16 04:33:55 +08:00
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
#define check_feature(fs_info, change_mask, flags, mask_base) \
|
|
|
|
check_feature_bits(fs_info, FEAT_##mask_base, change_mask, flags, \
|
2013-11-16 04:33:55 +08:00
|
|
|
BTRFS_FEATURE_ ## mask_base ## _SUPP, \
|
|
|
|
BTRFS_FEATURE_ ## mask_base ## _SAFE_SET, \
|
|
|
|
BTRFS_FEATURE_ ## mask_base ## _SAFE_CLEAR)
|
|
|
|
|
|
|
|
static int btrfs_ioctl_set_features(struct file *file, void __user *arg)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
|
|
|
struct btrfs_super_block *super_block = fs_info->super_copy;
|
2013-11-16 04:33:55 +08:00
|
|
|
struct btrfs_ioctl_feature_flags flags[2];
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 newflags;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
if (copy_from_user(flags, arg, sizeof(flags)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* Nothing to do */
|
|
|
|
if (!flags[0].compat_flags && !flags[0].compat_ro_flags &&
|
|
|
|
!flags[0].incompat_flags)
|
|
|
|
return 0;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = check_feature(fs_info, flags[0].compat_flags,
|
2013-11-16 04:33:55 +08:00
|
|
|
flags[1].compat_flags, COMPAT);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = check_feature(fs_info, flags[0].compat_ro_flags,
|
2013-11-16 04:33:55 +08:00
|
|
|
flags[1].compat_ro_flags, COMPAT_RO);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2016-06-23 06:54:24 +08:00
|
|
|
ret = check_feature(fs_info, flags[0].incompat_flags,
|
2013-11-16 04:33:55 +08:00
|
|
|
flags[1].incompat_flags, INCOMPAT);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2016-05-04 17:32:00 +08:00
|
|
|
ret = mnt_want_write_file(file);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2014-02-07 21:34:04 +08:00
|
|
|
trans = btrfs_start_transaction(root, 0);
|
2016-05-04 17:32:00 +08:00
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
goto out_drop_write;
|
|
|
|
}
|
2013-11-16 04:33:55 +08:00
|
|
|
|
2016-06-23 06:54:23 +08:00
|
|
|
spin_lock(&fs_info->super_lock);
|
2013-11-16 04:33:55 +08:00
|
|
|
newflags = btrfs_super_compat_flags(super_block);
|
|
|
|
newflags |= flags[0].compat_flags & flags[1].compat_flags;
|
|
|
|
newflags &= ~(flags[0].compat_flags & ~flags[1].compat_flags);
|
|
|
|
btrfs_set_super_compat_flags(super_block, newflags);
|
|
|
|
|
|
|
|
newflags = btrfs_super_compat_ro_flags(super_block);
|
|
|
|
newflags |= flags[0].compat_ro_flags & flags[1].compat_ro_flags;
|
|
|
|
newflags &= ~(flags[0].compat_ro_flags & ~flags[1].compat_ro_flags);
|
|
|
|
btrfs_set_super_compat_ro_flags(super_block, newflags);
|
|
|
|
|
|
|
|
newflags = btrfs_super_incompat_flags(super_block);
|
|
|
|
newflags |= flags[0].incompat_flags & flags[1].incompat_flags;
|
|
|
|
newflags &= ~(flags[0].incompat_flags & ~flags[1].incompat_flags);
|
|
|
|
btrfs_set_super_incompat_flags(super_block, newflags);
|
2016-06-23 06:54:23 +08:00
|
|
|
spin_unlock(&fs_info->super_lock);
|
2013-11-16 04:33:55 +08:00
|
|
|
|
2016-09-10 09:39:03 +08:00
|
|
|
ret = btrfs_commit_transaction(trans);
|
2016-05-04 17:32:00 +08:00
|
|
|
out_drop_write:
|
|
|
|
mnt_drop_write_file(file);
|
|
|
|
|
|
|
|
return ret;
|
2013-11-16 04:33:55 +08:00
|
|
|
}
|
|
|
|
|
2022-01-16 10:48:47 +08:00
|
|
|
static int _btrfs_ioctl_send(struct inode *inode, void __user *argp, bool compat)
|
2017-09-27 22:43:13 +08:00
|
|
|
{
|
|
|
|
struct btrfs_ioctl_send_args *arg;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (compat) {
|
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
|
|
|
|
struct btrfs_ioctl_send_args_32 args32;
|
|
|
|
|
|
|
|
ret = copy_from_user(&args32, argp, sizeof(args32));
|
|
|
|
if (ret)
|
|
|
|
return -EFAULT;
|
|
|
|
arg = kzalloc(sizeof(*arg), GFP_KERNEL);
|
|
|
|
if (!arg)
|
|
|
|
return -ENOMEM;
|
|
|
|
arg->send_fd = args32.send_fd;
|
|
|
|
arg->clone_sources_count = args32.clone_sources_count;
|
|
|
|
arg->clone_sources = compat_ptr(args32.clone_sources);
|
|
|
|
arg->parent_root = args32.parent_root;
|
|
|
|
arg->flags = args32.flags;
|
|
|
|
memcpy(arg->reserved, args32.reserved,
|
|
|
|
sizeof(args32.reserved));
|
|
|
|
#else
|
|
|
|
return -ENOTTY;
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
arg = memdup_user(argp, sizeof(*arg));
|
|
|
|
if (IS_ERR(arg))
|
|
|
|
return PTR_ERR(arg);
|
|
|
|
}
|
2022-01-16 10:48:47 +08:00
|
|
|
ret = btrfs_ioctl_send(inode, arg);
|
2017-09-27 22:43:13 +08:00
|
|
|
kfree(arg);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
|
|
|
|
bool compat)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_encoded_io_args args = { 0 };
|
|
|
|
size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
|
|
|
|
flags);
|
|
|
|
size_t copy_end;
|
|
|
|
struct iovec iovstack[UIO_FASTIOV];
|
|
|
|
struct iovec *iov = iovstack;
|
|
|
|
struct iov_iter iter;
|
|
|
|
loff_t pos;
|
|
|
|
struct kiocb kiocb;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (compat) {
|
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
|
|
|
|
struct btrfs_ioctl_encoded_io_args_32 args32;
|
|
|
|
|
|
|
|
copy_end = offsetofend(struct btrfs_ioctl_encoded_io_args_32,
|
|
|
|
flags);
|
|
|
|
if (copy_from_user(&args32, argp, copy_end)) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
args.iov = compat_ptr(args32.iov);
|
|
|
|
args.iovcnt = args32.iovcnt;
|
|
|
|
args.offset = args32.offset;
|
|
|
|
args.flags = args32.flags;
|
|
|
|
#else
|
|
|
|
return -ENOTTY;
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
copy_end = copy_end_kernel;
|
|
|
|
if (copy_from_user(&args, argp, copy_end)) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (args.flags != 0) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = import_iovec(READ, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
|
|
|
|
&iov, &iter);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_acct;
|
|
|
|
|
|
|
|
if (iov_iter_count(&iter) == 0) {
|
|
|
|
ret = 0;
|
|
|
|
goto out_iov;
|
|
|
|
}
|
|
|
|
pos = args.offset;
|
|
|
|
ret = rw_verify_area(READ, file, &pos, args.len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_iov;
|
|
|
|
|
|
|
|
init_sync_kiocb(&kiocb, file);
|
|
|
|
kiocb.ki_pos = pos;
|
|
|
|
|
|
|
|
ret = btrfs_encoded_read(&kiocb, &iter, &args);
|
|
|
|
if (ret >= 0) {
|
|
|
|
fsnotify_access(file);
|
|
|
|
if (copy_to_user(argp + copy_end,
|
|
|
|
(char *)&args + copy_end_kernel,
|
|
|
|
sizeof(args) - copy_end_kernel))
|
|
|
|
ret = -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
out_iov:
|
|
|
|
kfree(iov);
|
|
|
|
out_acct:
|
|
|
|
if (ret > 0)
|
|
|
|
add_rchar(current, ret);
|
|
|
|
inc_syscr(current);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-08-14 07:00:02 +08:00
|
|
|
static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool compat)
|
|
|
|
{
|
|
|
|
struct btrfs_ioctl_encoded_io_args args;
|
|
|
|
struct iovec iovstack[UIO_FASTIOV];
|
|
|
|
struct iovec *iov = iovstack;
|
|
|
|
struct iov_iter iter;
|
|
|
|
loff_t pos;
|
|
|
|
struct kiocb kiocb;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
ret = -EPERM;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(file->f_mode & FMODE_WRITE)) {
|
|
|
|
ret = -EBADF;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (compat) {
|
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
|
|
|
|
struct btrfs_ioctl_encoded_io_args_32 args32;
|
|
|
|
|
|
|
|
if (copy_from_user(&args32, argp, sizeof(args32))) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
args.iov = compat_ptr(args32.iov);
|
|
|
|
args.iovcnt = args32.iovcnt;
|
|
|
|
args.offset = args32.offset;
|
|
|
|
args.flags = args32.flags;
|
|
|
|
args.len = args32.len;
|
|
|
|
args.unencoded_len = args32.unencoded_len;
|
|
|
|
args.unencoded_offset = args32.unencoded_offset;
|
|
|
|
args.compression = args32.compression;
|
|
|
|
args.encryption = args32.encryption;
|
|
|
|
memcpy(args.reserved, args32.reserved, sizeof(args.reserved));
|
|
|
|
#else
|
|
|
|
return -ENOTTY;
|
|
|
|
#endif
|
|
|
|
} else {
|
|
|
|
if (copy_from_user(&args, argp, sizeof(args))) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto out_acct;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (args.flags != 0)
|
|
|
|
goto out_acct;
|
|
|
|
if (memchr_inv(args.reserved, 0, sizeof(args.reserved)))
|
|
|
|
goto out_acct;
|
|
|
|
if (args.compression == BTRFS_ENCODED_IO_COMPRESSION_NONE &&
|
|
|
|
args.encryption == BTRFS_ENCODED_IO_ENCRYPTION_NONE)
|
|
|
|
goto out_acct;
|
|
|
|
if (args.compression >= BTRFS_ENCODED_IO_COMPRESSION_TYPES ||
|
|
|
|
args.encryption >= BTRFS_ENCODED_IO_ENCRYPTION_TYPES)
|
|
|
|
goto out_acct;
|
|
|
|
if (args.unencoded_offset > args.unencoded_len)
|
|
|
|
goto out_acct;
|
|
|
|
if (args.len > args.unencoded_len - args.unencoded_offset)
|
|
|
|
goto out_acct;
|
|
|
|
|
|
|
|
ret = import_iovec(WRITE, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
|
|
|
|
&iov, &iter);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_acct;
|
|
|
|
|
|
|
|
file_start_write(file);
|
|
|
|
|
|
|
|
if (iov_iter_count(&iter) == 0) {
|
|
|
|
ret = 0;
|
|
|
|
goto out_end_write;
|
|
|
|
}
|
|
|
|
pos = args.offset;
|
|
|
|
ret = rw_verify_area(WRITE, file, &pos, args.len);
|
|
|
|
if (ret < 0)
|
|
|
|
goto out_end_write;
|
|
|
|
|
|
|
|
init_sync_kiocb(&kiocb, file);
|
|
|
|
ret = kiocb_set_rw_flags(&kiocb, 0);
|
|
|
|
if (ret)
|
|
|
|
goto out_end_write;
|
|
|
|
kiocb.ki_pos = pos;
|
|
|
|
|
|
|
|
ret = btrfs_do_write_iter(&kiocb, &iter, &args);
|
|
|
|
if (ret > 0)
|
|
|
|
fsnotify_modify(file);
|
|
|
|
|
|
|
|
out_end_write:
|
|
|
|
file_end_write(file);
|
|
|
|
kfree(iov);
|
|
|
|
out_acct:
|
|
|
|
if (ret > 0)
|
|
|
|
add_wchar(current, ret);
|
|
|
|
inc_syscw(current);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-06-12 09:53:53 +08:00
|
|
|
long btrfs_ioctl(struct file *file, unsigned int
|
|
|
|
cmd, unsigned long arg)
|
|
|
|
{
|
2016-06-23 06:54:23 +08:00
|
|
|
struct inode *inode = file_inode(file);
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
|
|
|
struct btrfs_root *root = BTRFS_I(inode)->root;
|
2008-12-02 19:36:08 +08:00
|
|
|
void __user *argp = (void __user *)arg;
|
2008-06-12 09:53:53 +08:00
|
|
|
|
|
|
|
switch (cmd) {
|
2009-04-17 16:37:41 +08:00
|
|
|
case FS_IOC_GETVERSION:
|
2022-01-05 16:30:06 +08:00
|
|
|
return btrfs_ioctl_getversion(inode, argp);
|
2019-07-18 01:39:20 +08:00
|
|
|
case FS_IOC_GETFSLABEL:
|
2019-10-11 08:23:11 +08:00
|
|
|
return btrfs_ioctl_get_fslabel(fs_info, argp);
|
2019-07-18 01:39:20 +08:00
|
|
|
case FS_IOC_SETFSLABEL:
|
|
|
|
return btrfs_ioctl_set_fslabel(file, argp);
|
2011-03-24 18:24:28 +08:00
|
|
|
case FITRIM:
|
2019-10-11 08:23:11 +08:00
|
|
|
return btrfs_ioctl_fitrim(fs_info, argp);
|
2008-06-12 09:53:53 +08:00
|
|
|
case BTRFS_IOC_SNAP_CREATE:
|
2010-12-20 15:53:28 +08:00
|
|
|
return btrfs_ioctl_snap_create(file, argp, 0);
|
2010-12-10 14:41:56 +08:00
|
|
|
case BTRFS_IOC_SNAP_CREATE_V2:
|
2010-12-20 15:53:28 +08:00
|
|
|
return btrfs_ioctl_snap_create_v2(file, argp, 0);
|
2008-11-18 10:02:50 +08:00
|
|
|
case BTRFS_IOC_SUBVOL_CREATE:
|
2010-12-20 15:53:28 +08:00
|
|
|
return btrfs_ioctl_snap_create(file, argp, 1);
|
2011-09-14 21:58:21 +08:00
|
|
|
case BTRFS_IOC_SUBVOL_CREATE_V2:
|
|
|
|
return btrfs_ioctl_snap_create_v2(file, argp, 1);
|
2009-09-22 04:00:26 +08:00
|
|
|
case BTRFS_IOC_SNAP_DESTROY:
|
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-07 21:05:46 +08:00
|
|
|
return btrfs_ioctl_snap_destroy(file, argp, false);
|
|
|
|
case BTRFS_IOC_SNAP_DESTROY_V2:
|
|
|
|
return btrfs_ioctl_snap_destroy(file, argp, true);
|
2010-12-20 16:30:25 +08:00
|
|
|
case BTRFS_IOC_SUBVOL_GETFLAGS:
|
2022-01-16 10:48:47 +08:00
|
|
|
return btrfs_ioctl_subvol_getflags(inode, argp);
|
2010-12-20 16:30:25 +08:00
|
|
|
case BTRFS_IOC_SUBVOL_SETFLAGS:
|
|
|
|
return btrfs_ioctl_subvol_setflags(file, argp);
|
2009-12-12 05:11:29 +08:00
|
|
|
case BTRFS_IOC_DEFAULT_SUBVOL:
|
|
|
|
return btrfs_ioctl_default_subvol(file, argp);
|
2008-06-12 09:53:53 +08:00
|
|
|
case BTRFS_IOC_DEFRAG:
|
2010-03-11 22:42:04 +08:00
|
|
|
return btrfs_ioctl_defrag(file, NULL);
|
|
|
|
case BTRFS_IOC_DEFRAG_RANGE:
|
|
|
|
return btrfs_ioctl_defrag(file, argp);
|
2008-06-12 09:53:53 +08:00
|
|
|
case BTRFS_IOC_RESIZE:
|
2012-11-26 16:43:45 +08:00
|
|
|
return btrfs_ioctl_resize(file, argp);
|
2008-06-12 09:53:53 +08:00
|
|
|
case BTRFS_IOC_ADD_DEV:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_add_dev(fs_info, argp);
|
2008-06-12 09:53:53 +08:00
|
|
|
case BTRFS_IOC_RM_DEV:
|
2012-11-26 16:44:50 +08:00
|
|
|
return btrfs_ioctl_rm_dev(file, argp);
|
2016-02-13 10:01:39 +08:00
|
|
|
case BTRFS_IOC_RM_DEV_V2:
|
|
|
|
return btrfs_ioctl_rm_dev_v2(file, argp);
|
2011-03-11 22:41:01 +08:00
|
|
|
case BTRFS_IOC_FS_INFO:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_fs_info(fs_info, argp);
|
2011-03-11 22:41:01 +08:00
|
|
|
case BTRFS_IOC_DEV_INFO:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_dev_info(fs_info, argp);
|
2010-03-01 04:39:26 +08:00
|
|
|
case BTRFS_IOC_TREE_SEARCH:
|
2022-01-16 10:48:47 +08:00
|
|
|
return btrfs_ioctl_tree_search(inode, argp);
|
2014-01-30 23:24:03 +08:00
|
|
|
case BTRFS_IOC_TREE_SEARCH_V2:
|
2022-01-16 10:48:47 +08:00
|
|
|
return btrfs_ioctl_tree_search_v2(inode, argp);
|
2010-03-01 04:39:26 +08:00
|
|
|
case BTRFS_IOC_INO_LOOKUP:
|
2022-01-05 16:30:06 +08:00
|
|
|
return btrfs_ioctl_ino_lookup(root, argp);
|
2011-07-07 22:48:38 +08:00
|
|
|
case BTRFS_IOC_INO_PATHS:
|
|
|
|
return btrfs_ioctl_ino_to_path(root, argp);
|
|
|
|
case BTRFS_IOC_LOGICAL_INO:
|
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-23 01:58:46 +08:00
|
|
|
return btrfs_ioctl_logical_to_ino(fs_info, argp, 1);
|
|
|
|
case BTRFS_IOC_LOGICAL_INO_V2:
|
|
|
|
return btrfs_ioctl_logical_to_ino(fs_info, argp, 2);
|
2010-01-14 02:19:06 +08:00
|
|
|
case BTRFS_IOC_SPACE_INFO:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_space_info(fs_info, argp);
|
2013-09-23 18:35:11 +08:00
|
|
|
case BTRFS_IOC_SYNC: {
|
|
|
|
int ret;
|
|
|
|
|
2021-01-11 18:58:11 +08:00
|
|
|
ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
|
2013-09-23 18:35:11 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2016-06-23 06:54:23 +08:00
|
|
|
ret = btrfs_sync_fs(inode->i_sb, 1);
|
2014-07-23 20:39:35 +08:00
|
|
|
/*
|
|
|
|
* The transaction thread may want to do more work,
|
2016-05-20 09:18:45 +08:00
|
|
|
* namely it pokes the cleaner kthread that will start
|
2014-07-23 20:39:35 +08:00
|
|
|
* processing uncleaned subvols.
|
|
|
|
*/
|
2016-06-23 06:54:23 +08:00
|
|
|
wake_up_process(fs_info->transaction_kthread);
|
2013-09-23 18:35:11 +08:00
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
case BTRFS_IOC_START_SYNC:
|
2012-11-26 16:40:43 +08:00
|
|
|
return btrfs_ioctl_start_sync(root, argp);
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 03:41:32 +08:00
|
|
|
case BTRFS_IOC_WAIT_SYNC:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_wait_sync(fs_info, argp);
|
2011-03-11 22:41:01 +08:00
|
|
|
case BTRFS_IOC_SCRUB:
|
2012-11-26 16:48:01 +08:00
|
|
|
return btrfs_ioctl_scrub(file, argp);
|
2011-03-11 22:41:01 +08:00
|
|
|
case BTRFS_IOC_SCRUB_CANCEL:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_scrub_cancel(fs_info);
|
2011-03-11 22:41:01 +08:00
|
|
|
case BTRFS_IOC_SCRUB_PROGRESS:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_scrub_progress(fs_info, argp);
|
2012-01-17 04:04:47 +08:00
|
|
|
case BTRFS_IOC_BALANCE_V2:
|
2012-05-11 18:11:26 +08:00
|
|
|
return btrfs_ioctl_balance(file, argp);
|
2012-01-17 04:04:49 +08:00
|
|
|
case BTRFS_IOC_BALANCE_CTL:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_balance_ctl(fs_info, arg);
|
2012-01-17 04:04:49 +08:00
|
|
|
case BTRFS_IOC_BALANCE_PROGRESS:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_balance_progress(fs_info, argp);
|
2012-07-25 23:35:53 +08:00
|
|
|
case BTRFS_IOC_SET_RECEIVED_SUBVOL:
|
|
|
|
return btrfs_ioctl_set_received_subvol(file, argp);
|
2014-01-31 04:17:00 +08:00
|
|
|
#ifdef CONFIG_64BIT
|
|
|
|
case BTRFS_IOC_SET_RECEIVED_SUBVOL_32:
|
|
|
|
return btrfs_ioctl_set_received_subvol_32(file, argp);
|
|
|
|
#endif
|
2012-07-26 05:19:24 +08:00
|
|
|
case BTRFS_IOC_SEND:
|
2022-01-16 10:48:47 +08:00
|
|
|
return _btrfs_ioctl_send(inode, argp, false);
|
2017-09-27 22:43:13 +08:00
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
|
|
|
|
case BTRFS_IOC_SEND_32:
|
2022-01-16 10:48:47 +08:00
|
|
|
return _btrfs_ioctl_send(inode, argp, true);
|
2017-09-27 22:43:13 +08:00
|
|
|
#endif
|
2012-05-25 22:06:09 +08:00
|
|
|
case BTRFS_IOC_GET_DEV_STATS:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_get_dev_stats(fs_info, argp);
|
2011-09-14 21:53:51 +08:00
|
|
|
case BTRFS_IOC_QUOTA_CTL:
|
2012-11-26 16:50:11 +08:00
|
|
|
return btrfs_ioctl_quota_ctl(file, argp);
|
2011-09-14 21:53:51 +08:00
|
|
|
case BTRFS_IOC_QGROUP_ASSIGN:
|
2012-11-26 16:50:11 +08:00
|
|
|
return btrfs_ioctl_qgroup_assign(file, argp);
|
2011-09-14 21:53:51 +08:00
|
|
|
case BTRFS_IOC_QGROUP_CREATE:
|
2012-11-26 16:50:11 +08:00
|
|
|
return btrfs_ioctl_qgroup_create(file, argp);
|
2011-09-14 21:53:51 +08:00
|
|
|
case BTRFS_IOC_QGROUP_LIMIT:
|
2012-11-26 16:50:11 +08:00
|
|
|
return btrfs_ioctl_qgroup_limit(file, argp);
|
2013-04-26 00:04:51 +08:00
|
|
|
case BTRFS_IOC_QUOTA_RESCAN:
|
|
|
|
return btrfs_ioctl_quota_rescan(file, argp);
|
|
|
|
case BTRFS_IOC_QUOTA_RESCAN_STATUS:
|
2019-10-11 08:23:11 +08:00
|
|
|
return btrfs_ioctl_quota_rescan_status(fs_info, argp);
|
2013-05-07 03:14:17 +08:00
|
|
|
case BTRFS_IOC_QUOTA_RESCAN_WAIT:
|
2019-10-11 08:23:11 +08:00
|
|
|
return btrfs_ioctl_quota_rescan_wait(fs_info, argp);
|
2012-11-06 22:08:53 +08:00
|
|
|
case BTRFS_IOC_DEV_REPLACE:
|
2016-06-23 06:54:24 +08:00
|
|
|
return btrfs_ioctl_dev_replace(fs_info, argp);
|
2013-11-16 04:33:55 +08:00
|
|
|
case BTRFS_IOC_GET_SUPPORTED_FEATURES:
|
2016-02-17 22:26:27 +08:00
|
|
|
return btrfs_ioctl_get_supported_features(argp);
|
2013-11-16 04:33:55 +08:00
|
|
|
case BTRFS_IOC_GET_FEATURES:
|
2019-10-11 08:23:11 +08:00
|
|
|
return btrfs_ioctl_get_features(fs_info, argp);
|
2013-11-16 04:33:55 +08:00
|
|
|
case BTRFS_IOC_SET_FEATURES:
|
|
|
|
return btrfs_ioctl_set_features(file, argp);
|
2018-05-21 09:09:42 +08:00
|
|
|
case BTRFS_IOC_GET_SUBVOL_INFO:
|
2022-01-16 10:48:47 +08:00
|
|
|
return btrfs_ioctl_get_subvol_info(inode, argp);
|
2018-05-21 09:09:43 +08:00
|
|
|
case BTRFS_IOC_GET_SUBVOL_ROOTREF:
|
2022-01-05 16:30:06 +08:00
|
|
|
return btrfs_ioctl_get_subvol_rootref(root, argp);
|
2018-05-21 09:09:44 +08:00
|
|
|
case BTRFS_IOC_INO_LOOKUP_USER:
|
|
|
|
return btrfs_ioctl_ino_lookup_user(file, argp);
|
2021-07-01 04:01:49 +08:00
|
|
|
case FS_IOC_ENABLE_VERITY:
|
|
|
|
return fsverity_ioctl_enable(file, (const void __user *)argp);
|
|
|
|
case FS_IOC_MEASURE_VERITY:
|
|
|
|
return fsverity_ioctl_measure(file, argp);
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
case BTRFS_IOC_ENCODED_READ:
|
|
|
|
return btrfs_ioctl_encoded_read(file, argp, false);
|
2019-08-14 07:00:02 +08:00
|
|
|
case BTRFS_IOC_ENCODED_WRITE:
|
|
|
|
return btrfs_ioctl_encoded_write(file, argp, false);
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
|
|
|
|
case BTRFS_IOC_ENCODED_READ_32:
|
|
|
|
return btrfs_ioctl_encoded_read(file, argp, true);
|
2019-08-14 07:00:02 +08:00
|
|
|
case BTRFS_IOC_ENCODED_WRITE_32:
|
|
|
|
return btrfs_ioctl_encoded_write(file, argp, true);
|
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-10 08:59:07 +08:00
|
|
|
#endif
|
2008-06-12 09:53:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return -ENOTTY;
|
|
|
|
}
|
2015-10-29 16:22:21 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
long btrfs_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
2017-02-07 08:39:09 +08:00
|
|
|
/*
|
|
|
|
* These all access 32-bit values anyway so no further
|
|
|
|
* handling is necessary.
|
|
|
|
*/
|
2015-10-29 16:22:21 +08:00
|
|
|
switch (cmd) {
|
|
|
|
case FS_IOC32_GETVERSION:
|
|
|
|
cmd = FS_IOC_GETVERSION;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return btrfs_ioctl(file, cmd, (unsigned long) compat_ptr(arg));
|
|
|
|
}
|
|
|
|
#endif
|