When the inode given to did_overwrite_ref() matches the current progress
and has a reference that collides with the reference of other inode that
has the same number as the current progress, we were always telling our
caller that the inode's reference was overwritten, which is incorrect
because the other inode might be a new inode (different generation number)
in which case we must return false from did_overwrite_ref() so that its
callers don't use an orphanized path for the inode (as it will never be
orphanized, instead it will be unlinked and the new inode created later).
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single extent of 64K.
mkdir -p $SCRATCH_MNT/foo
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \
| _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/mysnap1
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
$SCRATCH_MNT/mysnap2
echo "File digest before being replaced:"
md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch
# Remove the file and then create a new one in the same location with
# the same name but with different content. This new file ends up
# getting the same inode number as the previous one, because that inode
# number was the highest inode number used by the snapshot's root and
# therefore when attempting to find the a new inode number for the new
# file, we end up reusing the same inode number. This happens because
# currently btrfs uses the highest inode number summed by 1 for the
# first inode created once a snapshot's root is loaded (done at
# fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel
# tree).
# Having these two different files in the snapshots with the same inode
# number (but different generation numbers) caused the btrfs send code
# to emit an incorrect path for the file when issuing an unlink
# operation because it failed to realize they were different files.
rm -f $SCRATCH_MNT/mysnap2/foo/bar
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \
$SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \
$SCRATCH_MNT/mysnap2_ro
_run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
_run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \
$SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap
echo "File digest in the original filesystem after being replaced:"
md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
# Now recreate the filesystem by receiving both send streams and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
_run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap
echo "File digest in the new filesystem:"
# Must match the digest from the new file.
md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
status=0
exit
Reported-by: Martin Raiber <martin@urbackup.org>
Fixes: 8b191a6849 ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Neil Horman pointed out a problem where if he did something like this
receive A
snap A B
change B
send -p A B
and then on another box do
recieve A
receive B
the receive B would fail because we use the UUID of A for the clone sources for
B. This makes sense most of the time because normally you are sending from the
original sources, not a received source. However when you use a recieved subvol
its UUID is going to be something completely different, so if you then try to
receive the diff on a different volume it won't find the UUID because the new A
will be something else. The only constant is the received uuid. So instead
check to see if we have received_uuid set on the root, and if so use that as the
clone source, as btrfs receive looks for matches either in received_uuid or
uuid. Thanks,
Reported-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
Marc reported a problem where the receiving end of an incremental send
was performing clone operations that failed with -EINVAL. This happened
because, unlike for uncompressed extents, we were not checking if the
source clone offset and length, after summing the data offset, falls
within the source file's boundaries.
So make sure we do such checks when attempting to issue clone operations
for compressed extents.
Problem reproducible with the following steps:
$ mkfs.btrfs -f /dev/sdb
$ mount -o compress /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount -o compress /dev/sdc /mnt2
# Create the file with a single extent of 128K. This creates a metadata file
# extent item with a data start offset of 0 and a logical length of 128K.
$ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo
# Now rewrite the range 64K to 112K of our file. This will make the inode's
# metadata continue to point to the 128K extent we created before, but now
# with an extent item that points to the extent with a data start offset of
# 112K and a logical length of 16K.
# That metadata file extent item is associated with the logical file offset
# at 176K and covers the logical file range 176K to 192K.
$ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo
# Now rewrite the range 180K to 12K. This will make the inode's metadata
# continue to point the the 128K extent we created earlier, with a single
# extent item that points to it with a start offset of 112K and a logical
# length of 4K.
# That metadata file extent item is associated with the logical file offset
# at 176K and covers the logical file range 176K to 180K.
$ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ touch /mnt/bar
# Calls the btrfs clone ioctl.
$ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \
-l $((4 * 1024)) /mnt/foo /mnt/bar
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
At subvol /mnt/snap1
At subvol snap1
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
At subvol /mnt/snap2
At snapshot snap2
ERROR: failed to clone extents to bar
Invalid argument
A test case for fstests follows soon.
Reported-by: Marc MERLIN <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.cz>
Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir -p /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/t6/t7
$ mkdir /mnt/data/t5
$ mkdir /mnt/data/t7
$ mkdir /mnt/data/n4/t2
$ mkdir /mnt/data/t4
$ mkdir /mnt/data/t3
$ mv /mnt/data/t7 /mnt/data/n4/t2
$ mv /mnt/data/t4 /mnt/data/n4/t2/t7
$ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4
$ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5
$ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2
$ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4
$ mv /mnt/data/n4/t2 /mnt/data/n4/n1
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2
$ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6
$ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3
$ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
Where the parent snapshot directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- t2/ (ino 265)
|-- t7/ (ino 264)
|-- t4/ (ino 266)
|-- t5/ (ino 263)
|-- t6/ (ino 261)
|-- n1/ (ino 258)
|-- n2/ (ino 259)
|-- t7/ (ino 262)
|-- t3/ (ino 267)
And the send snapshot's directory hierarchy is the following:
. (ino 256)
|-- data/ (ino 257)
|-- n4/ (ino 260)
|-- n1/ (ino 258)
|-- t2/ (ino 265)
|-- n2/ (ino 259)
|-- t3/ (ino 267)
| |-- t7 (ino 264)
|
|-- t6/ (ino 261)
| |-- t4/ (ino 266)
| |-- t5/ (ino 263)
|
|-- t7/ (ino 262)
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Even though we delay the rename of directories when they become
descendents of other directories that were also renamed in the send
root to prevent infinite path build loops, we were doing it in cases
where this was not needed and was actually harmful resulting in
infinite path build loops as we ended up with a circular dependency
of delayed directory renames.
Consider the following reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ mkdir /mnt/data
$ mkdir /mnt/data/n1
$ mkdir /mnt/data/n1/n2
$ mkdir /mnt/data/n4
$ mkdir /mnt/data/n1/n2/p1
$ mkdir /mnt/data/n1/n2/p1/p2
$ mkdir /mnt/data/t6
$ mkdir /mnt/data/t7
$ mkdir -p /mnt/data/t5/t7
$ mkdir /mnt/data/t2
$ mkdir /mnt/data/t4
$ mkdir -p /mnt/data/t1/t3
$ mkdir /mnt/data/p1
$ mv /mnt/data/t1 /mnt/data/p1
$ mkdir -p /mnt/data/p1/p2
$ mv /mnt/data/t4 /mnt/data/p1/p2/t1
$ mv /mnt/data/t5 /mnt/data/n4/t5
$ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2
$ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7
$ mv /mnt/data/t2 /mnt/data/n4/t1
$ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1
$ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2
$ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1
$ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7
$ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3
$ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1
$ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5
$ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1
$ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1
$ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/
$ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2
$ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1
$ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1
$ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3
$ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2
$ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7
$ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1
$ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5
$ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5
$ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2
$ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2
ERROR: send ioctl failed with -12: Cannot allocate memory
This reproducer resulted in an infinite path build loop when building the
path for inode 266 because the following circular dependency of delayed
directory renames was created:
ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261
Where the notation "X <- Y" means the rename of inode X is delayed by the
rename of inode Y (X will be renamed after Y is renamed). This resulted
in an infinite path build loop of inode 266 because that inode has inode
261 as an ancestor in the send root and inode 261 is in the circular
dependency of delayed renames listed above.
Fix this by not delaying the rename of a directory inode if an ancestor of
the inode in the send root, which has a delayed rename operation, is not
also a descendent of the inode in the parent root.
Thanks to Robbie Ko for sending the reproducer example.
A test case for xfstests follows soon.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
The logic to detect path loops when attempting to apply a pending
directory rename, introduced in commit
f959492fc1 (Btrfs: send, fix more issues related to directory renames)
is no longer needed, and the respective fstests test case for that commit,
btrfs/045, now passes without this code (as well as all the other test
cases for send/receive).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If a directory's reference ends up being orphanized, because the inode
currently being processed has a new path that matches that directory's
path, make sure we evict the name of the directory from the name cache.
This is because there might be descendent inodes (either directories or
regular files) that will be orphanized later too, and therefore the
orphan name of the ancestor must be used, otherwise we send issue rename
operations with a wrong path in the send stream.
Reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir -p /mnt/data/n1/n2/p1/p2
$ mkdir /mnt/data/n4
$ mkdir -p /mnt/data/p1/p2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/data/p1/p2 /mnt/data
$ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1
$ mv /mnt/data/p2 /mnt/data/n1/n2/p1
$ mv /mnt/data/n1/n2 /mnt/data/p1
$ mv /mnt/data/p1 /mnt/data/n4
$ mv /mnt/data/n4/p1/n2/p1 /mnt/data
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.send
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ btrfs receive /mnt2 -f /tmp/1.send
$ btrfs receive /mnt2 -f /tmp/2.send
ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory
Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent
snapshot are both orphanized during the incremental send, and as soon as
data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2
we use a source path of o263-6-o/p2 for the rename operation instead of
the old path data/p1/p2 (the one before the orphanization of inode 263).
A test case for xfstests follows soon.
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the clone root was not readonly or the dead flag was set on it, we were
leaving without decrementing the root's send_progress counter (and before
we just incremented it). If a concurrent snapshot deletion was in progress
and ended up being aborted, it would be impossible to later attempt to
delete again the snapshot, since the root's send_in_progress counter could
never go back to 0.
We were also setting clone_sources_to_rollback to i + 1 too early - if we
bailed out because the clone root we got is not readonly or flagged as dead
we ended up later derreferencing a null pointer because we didn't assign
the clone root to sctx->clone_roots[i].root:
for (i = 0; sctx && i < clone_sources_to_rollback; i++)
btrfs_root_dec_send_in_progress(
sctx->clone_roots[i].root);
So just don't increment the send_in_progress counter if the root is readonly
or flagged as dead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
After we locked the root's root item, a concurrent snapshot deletion
call might have set the dead flag on it. So check if the dead flag
is set and abort if it is, just like we do for the parent root.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/a
$ mkdir /mnt/b
$ mkdir /mnt/c
$ touch /mnt/a/file
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/c /mnt/x
$ mv /mnt/a /mnt/x/y
$ mv /mnt/b /mnt/a
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.send
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ btrfs receive /mnt2 -f /tmp/1.send
$ btrfs receive /mnt2 -f /tmp/2.send
ERROR: rename b -> a failed. Directory not empty
A test case for xfstests follows soon.
Reported-by: Ames Cornish <ames@cornishes.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If between two snapshots we rename an existing directory named X to Y and
make it a child (direct or not) of a new inode named X, we were delaying
the move/rename of the former directory unnecessarily, which would result
in attempting to rename the new directory from its orphan name to name X
prematurely.
Minimal reproducer:
$ mkfs.btrfs -f /dev/vdd
$ mount /dev/vdd /mnt
$ mkdir -p /mnt/merlin/RC/OSD/Source
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ mkdir /mnt/OSD
$ mv /mnt/merlin/RC/OSD /mnt/OSD/OSD-Plane_788
$ mv /mnt/OSD /mnt/merlin/RC
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
$ btrfs send /mnt/mysnap1 -f /tmp/1.snap
$ btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap
$ mkfs.btrfs -f /dev/vdc
$ mount /dev/vdc /mnt2
$ btrfs receive /mnt2 -f /tmp/1.snap
$ btrfs receive /mnt2 -f /tmp/2.snap
The second receive (from an incremental send) failed with the following
error message: "rename o261-7-0 -> merlin/RC/OSD failed".
This is a regression introduced in the 3.16 kernel.
A test case for xfstests follows.
Reported-by: Marc Merlin <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Maximum xattr size can be up to nearly the leaf size. For an fs with a
leaf size larger than the page size, using kmalloc requires allocating
multiple pages that are contiguous, which might not be possible if
there's heavy memory fragmentation. Therefore fallback to vmalloc if
we fail to allocate with kmalloc. Also start with a smaller buffer size,
since xattr values typically are smaller than a page.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Fix the following sparse warning:
fs/btrfs/send.c:518:51: warning: incorrect type in argument 2 (different address spaces)
fs/btrfs/send.c:518:51: expected char const [noderef] <asn:1>*<noident>
fs/btrfs/send.c:518:51: got char *
We can safely use (const char __user *) with set_fs(KERNEL_DS)
__force added to avoid sparse-all warning:
fs/btrfs/send.c:518:40: warning: cast adds address space to expression (<asn:1>)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Chris Mason <clm@fb.com>
We were limiting the sum of the xattr name and value lengths to PATH_MAX,
which is not correct, specially on filesystems created with btrfs-progs
v3.12 or higher, where the default leaf size is max(16384, PAGE_SIZE), or
systems with page sizes larger than 4096 bytes.
Xattrs have their own specific maximum name and value lengths, which depend
on the leaf size, therefore use these limits to be able to send xattrs with
sizes larger than PATH_MAX.
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we are doing an incremental send and the base snapshot has a
directory with name X that doesn't exist anymore in the second
snapshot and a new subvolume/snapshot exists in the second snapshot
that has the same name as the directory (name X), the incremental
send would fail with -ENOENT error. This is because it attempts
to lookup for an inode with a number matching the objectid of a
root, which doesn't exist.
Steps to reproduce:
mkfs.btrfs -f /dev/sdd
mount /dev/sdd /mnt
mkdir /mnt/testdir
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
rmdir /mnt/testdir
btrfs subvolume create /mnt/testdir
btrfs subvolume snapshot -r /mnt /mnt/mysnap2
btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data
A test case for xfstests follows.
Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
I've noticed an extra line after "use no compression", but search
revealed much more in messages of more critical levels and rare errors.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This is a continuation of the previous changes titled:
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
There's a few more cases where a directory rename/move must be delayed which was
previously overlooked. If our immediate ancestor has a lower inode number than
ours and it doesn't have a delayed rename/move operation associated to it, it
doesn't mean there isn't any non-direct ancestor of our current inode that needs
to be renamed/moved before our current inode (i.e. with a higher inode number
than ours).
So we can't stop the search if our immediate ancestor has a lower inode number than
ours, we need to navigate the directory hierarchy upwards until we hit the root or:
1) find an ancestor with an higher inode number that was renamed/moved in the send
root too (or already has a pending rename/move registered);
2) find an ancestor that is a new directory (higher inode number than ours and
exists only in the send root).
Reproducer for case 1)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir -p /mnt/a/c/d
$ mkdir /mnt/a/b/e
$ mkdir /mnt/a/c/d/f
$ mv /mnt/a/b /mnt/a/c/d/2b
$ mkdir /mnt/a/x
$ mkdir /mnt/a/y
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x /mnt/a/y
$ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
$ mv /mnt/a/c/d /mnt/a/h/2d
$ mv /mnt/a/c /mnt/a/h/2d/2b/2c
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
Simple reproducer for case 2)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir /mnt/a/c
$ mv /mnt/a/b /mnt/a/c/b2
$ mkdir /mnt/a/e
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/c/b2 /mnt/a/e/b3
$ mkdir /mnt/a/e/b3/f
$ mkdir /mnt/a/h
$ mv /mnt/a/c /mnt/a/e/b3/f/c2
$ mv /mnt/a/e /mnt/a/h/e2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
Another simple reproducer for case 2)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir /mnt/a/c
$ mkdir /mnt/a/b/d
$ mkdir /mnt/a/c/e
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/b/d/f
$ mkdir /mnt/a/b/g
$ mv /mnt/a/c/e /mnt/a/b/g/e2
$ mv /mnt/a/c /mnt/a/b/d/f/c2
$ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
More complex reproducer for case 2)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir -p /mnt/a/c/d
$ mkdir /mnt/a/b/e
$ mkdir /mnt/a/c/d/f
$ mv /mnt/a/b /mnt/a/c/d/2b
$ mkdir /mnt/a/x
$ mkdir /mnt/a/y
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x /mnt/a/y
$ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
$ mv /mnt/a/c/d /mnt/a/h/2d
$ mv /mnt/a/c /mnt/a/h/2d/2b/2c
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
For both cases the incremental send would enter an infinite loop when building
path strings.
While solving these cases, this change also re-implements the code to detect
when directory moves/renames should be delayed. Instead of dealing with several
specific cases separately, it's now more generic handling all cases with a simple
detection algorithm and if when applying a delayed move/rename there's a path loop
detected, it further delays the move/rename registering a new ancestor inode as
the dependency inode (so our rename happens after that ancestor is renamed).
Tests for these cases is being added to xfstests too.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we have directories with a pending move/rename operation, we must take into
account any orphan directories that got created before executing the pending
move/rename. Those orphan directories are directories with an inode number higher
then the current send progress and that don't exist in the parent snapshot, they
are created before current progress reaches their inode number, with a generated
name of the form oN-M-I and at the root of the filesystem tree, and later when
progress matches their inode number, moved/renamed to their final location.
Reproducer:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b/c/d
$ mkdir /mnt/a/b/e
$ mv /mnt/a/b/c /mnt/a/b/e/CC
$ mkdir /mnt/a/b/e/CC/d/f
$ mkdir /mnt/a/g
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/g/h
$ mv /mnt/a/b/e /mnt/a/g/h/EE
$ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The second receive command failed with the following error:
ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Regardless of whether the caller is interested or not in knowing the inode's
generation (dir_gen != NULL), get_first_ref always does a btree lookup to get
the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which
is in some cases).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The patch "Btrfs: fix protection between send and root deletion"
(18f687d538) does not actually prevent to delete the snapshot
and just takes care during background cleaning, but this seems rather
user unfriendly, this patch implements the idea presented in
http://www.spinics.net/lists/linux-btrfs/msg30813.html
- add an internal root_item flag to denote a dead root
- check if the send_in_progress is set and refuse to delete, otherwise
set the flag and proceed
- check the flag in send similar to the btrfs_root_readonly checks, for
all involved roots
The root lookup in send via btrfs_read_fs_root_no_name will check if the
root is really dead or not. If it is, ENOENT, aborted send. If it's
alive, it's protected by send_in_progress, send can continue.
CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
If a path has more than 230 characters, we allocate a new buffer to
use for the path, but we were forgotting to copy the contents of the
previous buffer into the new one, which has random content from the
kmalloc call.
Test:
mkfs.btrfs -f /dev/sdd
mount /dev/sdd /mnt
TEST_PATH="/mnt/fdmanana/.config/google-chrome-mysetup/Default/Pepper_Data/Shockwave_Flash/WritableRoot/#SharedObjects/JSHJ4ZKN/s.wsj.net/[[IMPORT]]/players.edgesuite.net/flash/plugins/osmf/advanced-streaming-plugin/v2.7/osmf1.6/Ak#"
mkdir -p $TEST_PATH
echo "hello world" > $TEST_PATH/amaiAdvancedStreamingPlugin.txt
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
btrfs send /mnt/mysnap1 -f /tmp/1.snap
A test for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Cc: Marc Merlin <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Chris Mason <clm@fb.com>
fs_path_ensure_buf is used to make sure our path buffers for
send are big enough for the path names as we construct them.
The buffer size is limited to 32K by the length field in
the struct.
But bugs in the path construction can end up trying to build
a huge buffer, and we'll do invalid memmmoves when the
buffer length field wraps.
This patch is step one, preventing the overflows.
Signed-off-by: Chris Mason <clm@fb.com>
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
For an incremental send, fix the process of determining whether the directory
inode we're currently processing needs to have its move/rename operation delayed.
We were ignoring the fact that if the inode's new immediate ancestor has a higher
inode number than ours but wasn't renamed/moved, we might still need to delay our
move/rename, because some other ancestor directory higher in the hierarchy might
have an inode number higher than ours *and* was renamed/moved too - in this case
we have to wait for rename/move of that ancestor to happen before our current
directory's rename/move operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/x1/x2
$ mkdir /mnt/a/Z
$ mkdir -p /mnt/a/x1/x2/x3/x4/x5
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
$ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory Z after its references are processed.
A more complex scenario:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b/c/d
$ mkdir /mnt/a/b/c/d/e
$ mkdir /mnt/a/b/c/d/f
$ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
$ mkdir /mmt/a/b/c/g
$ mv /mnt/a/b/c/d /mnt/a/b/D2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/o
$ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
$ mv /mnt/a/b/D2 /mnt/a/b/dd
$ mv /mnt/a/b/c /mnt/a/C2
$ mv /mnt/a/b/dd/f /mnt/a/o/FF
$ mv /mnt/a/b /mnt/a/o/FF/E2/BB
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's possible to change the parent/child relationship between directories
in such a way that if a child directory has a higher inode number than
its parent, it doesn't necessarily means the child rename/move operation
can be performed immediately. The parent migth have its own rename/move
operation delayed, therefore in this case the child needs to have its
rename/move operation delayed too, and be performed after its new parent's
rename/move.
Steps to reproduce the issue:
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C
$ mv /mnt/C /mnt/A
$ mv /mnt/B /mnt/A/C
$ mkdir /mnt/A/C/D
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/A/C/D /mnt/A/D2
$ mv /mnt/A/C/B /mnt/A/D2/B2
$ mv /mnt/A/C /mnt/A/D2/B2/C2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory C after its references are processed.
The necessary conditions here are that C has an inode number higher than both
A and B, and B as an higher inode number higher than A, and D has the highest
inode number, that is:
inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
The same issue could happen if after the first snapshot there's any number
of intermediary parent directories between A2 and B2, and between B2 and C2.
A test case for xfstests follows, covering this simple case and more advanced
ones, with files and hard links created inside the directories.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
No need to search in the send tree for the generation number of the inode,
we already have it in the recorded_ref structure passed to us.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs send reads data from disk and then writes to a stream via pipe or
a file via flush.
Currently we're going to read each page a time, so every page results
in a disk read, which is not friendly to disks, esp. HDD. Given that,
the performance can be gained by adding readahead for those pages.
Here is a quick test:
$ btrfs subvolume create send
$ xfs_io -f -c "pwrite 0 1G" send/foobar
$ btrfs subvolume snap -r send ro
$ time "btrfs send ro -f /dev/null"
w/o w
real 1m37.527s 0m9.097s
user 0m0.122s 0m0.086s
sys 0m53.191s 0m12.857s
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This has no functional change, only picks out the same part of two functions,
and makes it shared.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When we're finishing processing of an inode, if we're dealing with a
directory inode that has a pending move/rename operation, we don't
need to send a utimes update instruction to the send stream, as we'll
do it later after doing the move/rename operation. Therefore we save
some time here building paths and doing btree lookups.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
It is really unnecessary to search tree again for @gen, @mode and @rdev
in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
and @gen, @mode and @rdev can easily be fetched.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
In "btrfs: send: lower memory requirements in common case" the code to
save the old_buf_len was incorrectly moved to a wrong place and broke
the original logic.
Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When doing an incremental send, if we had a directory pending a move/rename
operation and none of its parents, except for the immediate parent, were
pending a move/rename, after processing the directory's references, we would
be issuing utimes, chown and chmod intructions against am outdated path - a
path which matched the one in the parent root.
This change also simplifies a bit the code that deals with building a path
for a directory which has a move/rename operation delayed.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d/e
$ mkdir /mnt/btrfs/a/b/c/f
$ chmod 0777 /mnt/btrfs/a/b/c/d/e
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
$ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
$ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
$ chmod 0700 /mnt/btrfs/a/b/f2/e2
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: chmod a/b/c/d/e failed. No such file or directory
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The incremental send algorithm assumed that it was possible to issue
a directory remove (rmdir) if the the inode number it was currently
processing was greater than (or equal) to any inode that referenced
the directory's inode. This wasn't a valid assumption because any such
inode might be a child directory that is pending a move/rename operation,
because it was moved into a directory that has a higher inode number and
was moved/renamed too - in other words, the case the following commit
addressed:
9f03740a95
(Btrfs: fix infinite path build loops in incremental send)
This made an incremental send issue an rmdir operation before the
target directory was actually empty, which made btrfs receive fail.
Therefore it needs to wait for all pending child directory inodes to
be moved/renamed before sending an rmdir operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/x
$ mkdir /mnt/btrfs/a/b/y
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
$ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
$ rmdir /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: rmdir o259-6-0 failed. Directory not empty
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When doing an incremental send, if we delete a directory that has N > 1
hardlinks for the same file and that file has the highest inode number
inside the directory contents, an incremental send would send N times an
rmdir operation against the directory. This made the btrfs receive command
fail on the second rmdir instruction, as the target directory didn't exist
anymore.
Steps to reproduce the issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c
$ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
$ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ rm -f /mnt/btrfs/a/b/c/foo.txt
$ rm -f /mnt/btrfs/a/b/c/bar.txt
$ rmdir /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: rmdir o259-6-0 failed. No such file or directory
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This fixes yet one more case not caught by the commit titled:
Btrfs: fix infinite path build loops in incremental send
In this case, even before the initial full send, we have a directory
which is a child of a directory with a higher inode number. Then we
perform the initial send, and after we rename both the child and the
parent, without moving them around. After doing these 2 renames, an
incremental send sent a rename instruction for the child directory
which contained an invalid "from" path (referenced the parent's old
name, not the new one), which made the btrfs receive command fail.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b
$ mkdir /mnt/btrfs/d
$ mkdir /mnt/btrfs/a/b/c
$ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
$ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umout /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
"ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This reverts commit 41ce9970a8.
Previously i was thinking we can use readonly root's commit root
safely while it is not true, readonly root may be cowed with the
following cases.
1.snapshot send root will cow source root.
2.balance,device operations will also cow readonly send root
to relocate.
So i have two ideas to make us safe to use commit root.
-->approach 1:
make it protected by transaction and end transaction properly and we research
next item from root node(see btrfs_search_slot_for_read()).
-->approach 2:
add another counter to local root structure to sync snapshot with send.
and add a global counter to sync send with exclusive device operations.
So with approach 2, send can use commit root safely, because we make sure
send root can not be cowed during send. Unfortunately, it make codes *ugly*
and more complex to maintain.
To make snapshot and send exclusively, device operations and send operation
exclusively with each other is a little confusing for common users.
So why not drop into previous way.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The fs_path structure uses an inline buffer and falls back to a chain of
allocations, but vmalloc is not necessary because PATH_MAX fits into
PAGE_SIZE.
The size of fs_path has been reduced to 256 bytes from PAGE_SIZE,
usually 4k. Experimental measurements show that most paths on a single
filesystem do not exceed 200 bytes, and these get stored into the inline
buffer directly, which is now 230 bytes. Longer paths are kmalloced when
needed.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We have this pattern where we do search for a contiguous group of
items in a tree and everytime we find an item, we process it, then
we release our path, increment the offset of the search key, do
another full tree search and repeat these steps until a tree search
can't find more items we're interested in.
Instead of doing these full tree searches after processing each item,
just process the next item/slot in our leaf and don't release the path.
Since all these trees are read only and we always use the commit root
for a search and skip node/leaf locks, we're not affecting concurrency
on the trees.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This was a leftover from the commit:
74dd17fbe3
(Btrfs: fix btrfs send for inline items and compression)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If cleaning the name cache fails, we could try to proceed at the cost of
some memory leak. This is not expected to happen often.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
There are only 2 static callers, the BUG would normally be never
reached, but let's be nice.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We know that buf_len is at most PATH_MAX, 4k, and can merge it with the
reversed member. This saves 3 bytes in favor of inline_buf.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We don't need to keep track of that, it's available via is_vmalloc_addr.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The member is used only to return value back from
fs_path_prepare_for_add, we can do it locally and save 8 bytes for the
inline_buf path.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The buffer passed to snprintf can hold the fully expanded format string,
64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the
check entirely is a good idea, hence the ASSERT.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The commit titled "Btrfs: fix infinite path build loops in incremental send"
didn't cover a particular case where the parent-child relationship inversion
of directories doesn't imply a rename of the new parent directory. This was
due to a simple logic mistake, a logical and instead of a logical or.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44
$ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44
$ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3
$ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
A patch to update the test btrfs/030 from xfstests, so that it covers
this case, will be submitted soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This fixes a case that the commit titled:
Btrfs: fix infinite path build loops in incremental send
didn't cover. If the parent-child relationship between 2 directories
is inverted, both get renamed, and the former parent has a file that
got renamed too (but remains a child of that directory), the incremental
send operation would use the file's old path after sending an unlink
operation for that old path, causing receive to fail on future operations
like changing owner, permissions or utimes of the corresponding inode.
This is not a regression from the commit mentioned before, as without
that commit we would fall into the issues that commit fixed, so it's
just one case that wasn't covered before.
Simple steps to reproduce this issue are:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ touch /mnt/btrfs/a/b/c/d/file
$ mkdir -p /mnt/btrfs/a/b/x
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2
$ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
A patch to update the test btrfs/030 from xfstests, so that it covers
this case, will be submitted soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Function wait_for_parent_move() returns negative value if an error
happened, 0 if we don't need to wait for the parent's move, and
1 if the wait is needed.
Before this change an error return value was being treated like the
return value 1, which was not correct.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
For non compressed extents, iterate_extent_inodes() gives us offsets
that take into account the data offset from the file extent items, while
for compressed extents it doesn't. Therefore we have to adjust them before
placing them in a send clone instruction. Not doing this adjustment leads to
the receiving end requesting for a wrong a file range to the clone ioctl,
which results in different file content from the one in the original send
root.
Issue reproducible with the following excerpt from the test I made for
xfstests:
_scratch_mkfs
_scratch_mount "-o compress-force=lzo"
$XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
$XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
$XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo
# will be used for incremental send to be able to issue clone operations
$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap
$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
$FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
$FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
-x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
$FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
-x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2
$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
$BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
-c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap
_scratch_unmount
_scratch_mkfs
_scratch_mount
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
$FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
$FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
$FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
which meant that a key part of Filipe's original patch was not being built in.
This appears to be a mess up with merging Filipe's patch as it does not exist in
his original patch. Fix this by changing how we make sure del_waiting_dir_move
asserts that it did not error and take the function out of the ifdef check.
This makes btrfs/030 pass with the assert on or off. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in",
LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not
possible to build a kernel with btrfs enabled (either as module or built-in)
if libcrc32c is not enabled as well. So just replace all uses of libcrc32c
with the equivalent function in btrfs hash.h - btrfs_crc32c.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size. The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.
Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
There was a case where file hole detection was incorrect and it would
cause an incremental send to override a section of a file with zeroes.
This happened in the case where between the last leaf we processed which
contained a file extent item for our current inode and the leaf we're
currently are at (and has a file extent item for our current inode) there
are only leafs containing exclusively file extent items for our current
inode, and none of them was updated since the previous send operation.
The file hole detection code would incorrectly consider the file range
covered by these leafs as a hole.
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Instead of looking for a file extent item, process it, release the path
and do a btree search for the next file extent item, just process all
file extent items in a leaf without intermediate btree searches. This way
we save cpu and we're not blocking other tasks or affecting concurrency on
the btree, because send's paths use the commit root and skip btree node/leaf
locking.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The buffer size argument passed to snprintf must account for the
trailing null byte added by snprintf, and it returns a value >= then
sizeof(buffer) when the string can't fit in the buffer.
Since our buffer has a size of 64 characters, and the maximum orphan
name we can generate is 63 characters wide, we must pass 64 as the
buffer size to snprintf, and not 63.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
To search tree root without transaction protection, we should neither search commit
root nor skip locking here, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is possible for the send feature to send clone operations that
request a cloning range (offset + length) that is not aligned with
the block size. This makes the btrfs receive command send issue a
clone ioctl call that will fail, as the ioctl will return an -EINVAL
error because of the unaligned range.
Fix this by not sending clone operations for non block aligned ranges,
and instead send regular write operation for these (less common) cases.
The following xfstest reproduces this issue, which fails on the second
btrfs receive command without this change:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=`mktemp -d`
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $tmp
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
rm -f $seqres.full
_scratch_mkfs >/dev/null 2>&1
_scratch_mount
$XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
_filter_scratch
$XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
_filter_scratch
$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
-f $tmp/2.snap 2>&1 | _filter_scratch
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
_scratch_unmount
_check_btrfs_filesystem $SCRATCH_DEV
_scratch_mkfs >/dev/null 2>&1
_scratch_mount
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
_scratch_unmount
_check_btrfs_filesystem $SCRATCH_DEV
status=0
exit
The tests expected output is:
QA output created by 025
FSSync 'SCRATCH_MNT'
FSSync 'SCRATCH_MNT'
wrote 2978/2978 bytes at offset 1482752
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
At subvol SCRATCH_MNT/mysnap1
At subvol SCRATCH_MNT/mysnap2
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo
42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
At subvol mysnap1
42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
At snapshot mysnap2
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We will finish orphan cleanups during snapshot, so we don't
have to commit transaction here.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should gurantee that parent and clone roots can not be destroyed
during send, for this we have two ideas.
1.by holding @subvol_sem, this might be a nightmare, because it will
block all subvolumes deletion for a long time.
2.Miao pointed out we can reuse @send_in_progress, that mean we will
skip snapshot deletion if root sending is in progress.
Here we adopt the second approach since it won't block other subvolumes
deletion for a long time.
Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
, if this root is involved in send, we return directly rather than
continue to check.There are several reasons about it:
1.this case happen seldomly.
2.after sending,cleaner thread can continue to drop that root.
3.make code simple
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt
# btrfs sub snapshot -r /mnt /mnt/snap1
# btrfs sub snapshot -r /mnt /mnt/snap2
# btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
# dmesg
The problem is that we will sort clone roots(include @send_root), it
might push @send_root before thus @send_root's @send_in_progress will
be decreased twice.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Warn if the balance goes below zero, which appears to be unlikely
though. Otherwise cleans up the code a bit.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since daivd did the work that force us to use readonly snapshot,
we can safely remove transaction protection from btrfs send.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.
Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.
We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.
Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Unused since ed2590953b
"Btrfs: stop using vfs_read in send".
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove ifdefed code:
- tlv_put for 8, 16 and 32, add a generic tempalte if needed in future
- tlv_put_timespec - the btrfs_timespec fields are used
- fs_path_remove obsoleted long ago
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2190:9: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2190:9: got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2195:17: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2195:17: got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:3716:9: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:3716:9: got restricted __le64 [usertype] ctransid
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs has always had these filler extent data items for holes in inodes. This
has made somethings very easy, like logging hole punches and sending hole
punches. However for large holey files these extent data items are pure
overhead. So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files. This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any. I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The closing parenthesis is in the wrong place. We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
I hit this problem with my no holes patch and it made me realize what the
problem was for bz 60834. If the first item in the leaf is an inline extent and
we try to read anything starting from disk_bytenr onward we will read off the
end of the leaf. So we need to check to see what it's type is, and if it's not
REG we can just break out. This should fix this problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Apparently we don't actually close the files until we return to userspace, so
stop using vfs_read in send. This is actually better for us since we can avoid
all the extra logic of holding the file we're sending open and making sure to
clean it up. This will fix people who have been hitting too many files open
errors when trying to send. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove unused eb parameter from btrfs_item_nr
Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs updates from Chris Mason:
"This is against 3.11-rc7, but was pulled and tested against your tree
as of yesterday. We do have two small incrementals queued up, but I
wanted to get this bunch out the door before I hop on an airplane.
This is a fairly large batch of fixes, performance improvements, and
cleanups from the usual Btrfs suspects.
We've included Stefan Behren's work to index subvolume UUIDs, which is
targeted at speeding up send/receive with many subvolumes or snapshots
in place. It closes a long standing performance issue that was built
in to the disk format.
Mark Fasheh's offline dedup work is also here. In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO. This is a building block where utilities are
able to ask the FS to dedup a series of extents. The kernel takes
care of verifying the data involved really is the same. Today this
involves reading both extents, but we'll continue to evolve the
patches"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
Btrfs: optimize key searches in btrfs_search_slot
Btrfs: don't use an async starter for most of our workers
Btrfs: only update disk_i_size as we remove extents
Btrfs: fix deadlock in uuid scan kthread
Btrfs: stop refusing the relocation of chunk 0
Btrfs: fix memory leak of uuid_root in free_fs_info
btrfs: reuse kbasename helper
btrfs: return btrfs error code for dev excl ops err
Btrfs: allow partial ordered extent completion
Btrfs: convert all bug_ons in free-space-cache.c
Btrfs: add support for asserts
Btrfs: adjust the fs_devices->missing count on unmount
Btrf: cleanup: don't check for root_refs == 0 twice
Btrfs: fix for patch "cleanup: don't check the same thing twice"
Btrfs: get rid of one BUG() in write_all_supers()
Btrfs: allocate prelim_ref with a slab allocater
Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
Btrfs: fix race between removing a dev and writing sbs
Btrfs: remove ourselves from the cluster list under lock
...
To get name of the file from a pathname let's use kbasename() helper. It allows
to simplify code a bit.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Send was just sending everything it found, even if the extent was a hole. This
is unpleasant for users, so just skip holes when we are sending. This will also
skip sending prealloc extents since the send spec doesn't have a prealloc
command. Eventually we will add a prealloc command and rev the send version so
we can send down the prealloc info. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.
All these changes should be obviously safe.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you are sending a snapshot and specifying a parent snapshot we will walk the
trees and figure out where they differ and send the differences only. The way
we check for differences are if the leaves aren't the same and if the keys are
not the same within the leaves. So if neither leaf is the same (ie the leaf has
been cow'ed from the parent snapshot) we walk each item in the send root and
check it against the parent root. If the items match exactly then we don't do
anything. This doesn't quite work for inode refs, since they will just have the
name and the parent objectid. If you move the file from a directory and then
remove that directory and re-create a directory with the same inode number as
the old directory and then move that file back into that directory we will
assume that nothing changed and you will get errors when you try to receive.
In order to fix this we need to do extra checking to see if the inode ref really
is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the
items match. Then if the key type is an inode ref we can do some extra
checking, otherwise we just keep processing. The extra checking is to look up
the generation of the directory in the parent volume and compare it to the
generation of the send volume. If they match then they are the same directory
and we are good to go. If they don't we have to add them to the changed refs
list.
This means we have to track the generation of the ref we're trying to lookup
when we iterate all the refs for a particular inode. So in the case of looking
for new refs we have to get the generation from the parent volume, and in the
case of looking for deleted refs we have to get the generation from the send
volume to compare with.
There was also the issue of using a ulist to keep track of the directories we
needed to check. Because we can get a deleted ref and a new ref for the same
inode number the ulist won't work since it indexes based on the value. So
instead just dup any directory ref we find and add it to a local list, and then
process that list as normal and do away with using a ulist for this altogether.
Before we would fail all of the tests in the far-progs that related to moving
directories (test group 32). With this patch we now pass these tests, and all
of the tests in the far-progs send testing suite. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We have logic to see if we've already created a parent directory by check to see
if an inode inside of that directory has a lower inode number than the one we
are currently processing. The logic is that if there is a lower inode number
then we would have had to made sure the directory was created at that previous
point. The problem is that subvols inode numbers count from the lowest objectid
in the root tree, which may be less than our current progress. So just skip if
our dir item key is a root item. This fixes the original test and the xfstest
version I made that added an extra subvol create. Thanks,
Reported-by: Emil Karlson <jekarlson@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When doing a send with a parent subvol we will check to see if the file we are
acting on is being overwritten and move it if we think it may be needed further
down the line during the send. We check this by checking its directory and
making sure it existed in the parent and making sure the file existed in the
parent. The problem with this check is that if we create a directory and a file
in that directory, and then snapshot, and then remove and re-create that same
directory and file with different inode numbers and then try to snapshot and
send with the original parent we will try and save the original file inside of
that directory. This is a problem because during the receive we move the
directory out of the way because it is a completely new inode, which makes us
unable to find the old file inside of the directory when we try to move that out
of the way for the overwrite. We fix this by checking the parent directory of
the inode we think we are overwriting. If the parent directory generation in
the send root != the parent directory generation in the parent root then we know
it is a completely new directory and we need not bother with moving the file out
of the way because it would have been completely destroyed. This fixes bz
60673. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.
Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This fixes bugzilla 57491. If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems. When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup. So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it. The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup. With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to check for NULL in send.c and disk-io.c.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
sctx is removed from the argument of the function that
doesn't use sctx.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fget() returns NULL if error. So, we should check NULL or not.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Two new flags are added to allow omitting the stream header and the
end command for btrfs send streams. This is used in cases where you
send multiple snapshots back-to-back in one stream.
This used to be encoded like this (with 2 snapshots in this example):
<stream header> + <sequence of commands> + <end cmd> +
<stream header> + <sequence of commands> + <end cmd> + EOF
The new format (if the two new flags are used) is this one:
<stream header> + <sequence of commands> +
<sequence of commands> + <end cmd>
Note that the currently existing receivers treat <end cmd> only as
an indication that a new <stream header> is following. This means,
you can just skip the sequence <end cmd> <stream header> without
loosing compatibility. As long as an EOF is following, the currently
existing receivers handle the new format (if the two new flags are
used) exactly as the old one.
So what is the benefit of this change? The goal is to be able to use
a single stream (one TCP connection) to multiplex a request/response
handshake plus Btrfs send streams, all in the same stream. In this
case you cannot evaluate an EOF condition as an end of the Btrfs send
stream. You need something else, and the <end cmd> is just perfect
for this purpose.
The summary is:
The format change is driven by the need to send several Btrfs send
streams over a single TCP connections, with the ability for a repeated
request/response handshake in the middle. And this format change does
not break any existing tool, it is completely compatible.
You could compare the old behaviour of the Btrfs send stream to the
one of ftp where you need a seperate request/response channel and
newly opened data transfer channels for each file, while the new
behaviour is more like http using a single stream for everything.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've had a busy two weeks of bug fixing. The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old
race where compression and mmap combine forces to lose writes (me).
I'm fairly sure the mmap bug goes all the way back to the introduction
of the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.
I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting. I
double checked it here."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't drop path when printing out tree errors in scrub
Btrfs: fix wrong return value of btrfs_lookup_csum()
Btrfs: fix wrong reservation of csums
Btrfs: fix double free in the btrfs_qgroup_account_ref()
Btrfs: limit the global reserve to 512mb
Btrfs: hold the ordered operations mutex when waiting on ordered extents
Btrfs: fix space accounting for unlink and rename
Btrfs: fix space leak when we fail to reserve metadata space
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
Btrfs: fix race between mmap writes and compression
Btrfs: fix memory leak in btrfs_create_tree()
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Btrfs: fix missing qgroup reservation before fallocating
Btrfs: handle a bogus chunk tree nicely
Btrfs: update to use fs_state bit
When you take a snapshot, punch a hole where there has been data, then take
another snapshot and try to send an incremental stream, btrfs send would
give you EIO. That is because is_extent_unchanged had no support for holes
being punched. With this patch, instead of returning EIO we just return
0 (== the extent is not unchanged) and we're good.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Cc: Alexander Block <ablock84@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs update from Chris Mason:
"The biggest feature in the pull is the new (and still experimental)
raid56 code that David Woodhouse started long ago. I'm still working
on the parity logging setup that will avoid inconsistent parity after
a crash, so this is only for testing right now. But, I'd really like
to get it out to a broader audience to hammer out any performance
issues or other problems.
scrub does not yet correct errors on raid5/6 either.
Josef has another pass at fsync performance. The big change here is
to combine waiting for metadata with waiting for data, which is a big
latency win. It is also step one toward using atomics from the
hardware during a commit.
Mark Fasheh has a new way to use btrfs send/receive to send only the
metadata changes. SUSE is using this to make snapper more efficient
at finding changes between snapshosts.
Snapshot-aware defrag is also included.
Otherwise we have a large number of fixes and cleanups. Eric Sandeen
wins the award for removing the most lines, and I'm hoping we steal
this idea from XFS over and over again."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
btrfs: fixup/remove module.h usage as required
Btrfs: delete inline extents when we find them during logging
btrfs: try harder to allocate raid56 stripe cache
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
Btrfs: fix missing deleted items in btrfs_clean_quota_tree
btrfs: use only inline_pages from extent buffer
Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
Btrfs: fix wrong reserved space in qgroup during snap/subv creation
Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
btrfs: remove a printk from scan_one_device
Btrfs: fix NULL pointer after aborting a transaction
Btrfs: fix memory leak of log roots
Btrfs: copy everything if we've created an inline extent
btrfs: cleanup for open-coded alignment
Btrfs: do not change inode flags in rename
Btrfs: use reserved space for creating a snapshot
clear chunk_alloc flag on retryable failure
...
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send
ioctl code. When this flag is set, the btrfs send code will never write file
data into the stream (thus also avoiding expensive reads of that data in the
first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of
BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in
question.
This patch does not affect the operation of BTRFS_SEND_C_CLONE commands -
they will continue to be sent when a search finds an appropriate extent to
clone from.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.
Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
if (S_ISREG(sctx->cur_inode_mode)) {
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.
Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk. If we're compressed, this isn't
true, and so it was off into extents owned by other files.
It was also improperly handling inline extents. This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent. It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can't do the deleted/reused logic for top/root inodes as it would
create a stream that tries to delete and recreate the root dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We have to ignore inode/space cache objects in send/receive.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We need to pass the root that we determined earlier to iterate_inode_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
The previous check was working fine, but this check should be
easier to read. Also, we could theoritically have some exotic
bugs with the previous checks.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
A leftover from older code and unused now.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Updating send_progress in process_recorded_refs was not correct.
It got updated too early in the cur_inode_new_gen case.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We can't easily use the index of the radix tree for inums as the
radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels,
we now use the lower 32bit of the inum as index and an additional
list to store multiple entries per radix tree entry.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
When everything is done, name_cache_free is called which however
forgot to call kfree on the cache entries.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
If we break, we may miss the clone from send_root which we prefer
over all other clones.
Commit is a result of Arne's review.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Don't have a seperate return path for the mentioned case. Now
we do the same "take lowest inode/offset" logic for all found clones.
Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Make sure to never get in trouble due to the backref_ctx
which was on the stack before.
Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We only added the parent for the new position of a moved dir.
We also need to add the old parent of the moved dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
fs_path_remove is not used at the moment due to a previous patch.
Remove it for now (with #if 0) to avoid compile warnings.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
We missed that check which resultet in all refs with the same name
being reported as first_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
When the current inodes inum is smaller then the inum of the
parent directory strange things were happending due to wrong
path resolution and other bugs. Fix this with a new approach
for the problem.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
On powerpc, we don't get the implicit vmalloc.h include, and as a result
the build fails noisily:
fs/btrfs/send.c: In function 'fs_path_free':
fs/btrfs/send.c:185:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
fs/btrfs/send.c: In function 'fs_path_ensure_buf':
fs/btrfs/send.c:215:4: error: implicit declaration of function 'vmalloc' [-Werror=implicit-function-declaration]
fs/btrfs/send.c:215:12: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:225:12: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:233:13: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c: In function 'iterate_dir_item':
fs/btrfs/send.c:900:10: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:909:11: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c: In function 'btrfs_ioctl_send':
fs/btrfs/send.c:4463:17: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:4469:17: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:4475:2: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
fs/btrfs/send.c:4475:20: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:4483:21: warning: assignment makes pointer from integer without a cast [enabled by default]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull large btrfs update from Chris Mason:
"This pull request is very large, and the two main features in here
have been under testing/devel for quite a while.
We have subvolume quotas from the strato developers. This enables
full tracking of how many blocks are allocated to each subvolume (and
all snapshots) and you can set limits on a per-subvolume basis. You
can also create quota groups and toss multiple subvolumes into a big
group. It's everything you need to be a web hosting company and give
each user their own subvolume.
The userland side of the quotas is being refreshed, they'll send out
details on where to grab it soon.
Next is the kernel side of btrfs send/receive from Alexander Block.
This leverages the same infrastructure as the quota code to figure out
relationships between blocks and their owners. It can then compute
the difference between two snapshots and sends the diffs in a neutral
format into userland.
The basic model:
create a snapshot
send that snapshot as the initial backup
make changes
create a second snapshot
send the incremental as a backup
delete the first snapshot
(use the second snapshot for the next incremental)
The receive portion is all in userland, and in the 'next' branch of my
btrfs-progs repo.
There's still some work to do in terms of optimizing the send side
from kernel to userland. The really important part is figuring out
how two snapshots are different, and this is where we are
concentrating right now. The initial send of a dataset is a little
slower than tar, but the incremental sends are dramatically faster
than what rsync can do.
On top of all of that, we have a nice queue of fixes, cleanups and
optimizations."
Fix up trivial modify/del conflict in fs/btrfs/ioctl.c
Also fix up semantic conflict in fs/btrfs/send.c: the interface to
dentry_open() changed in commit 765927b2d5 ("switch dentry_open() to
struct path, make it grab references itself"), and since it now grabs
whatever references it needs, we should no longer do the mntget() on the
mnt (and we need to dput() the dentry reference we took).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
Btrfs: uninit variable fixes in send/receive
Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
Btrfs: add btrfs_compare_trees function
Btrfs: introduce subvol uuids and times
Btrfs: make iref_to_path non static
Btrfs: add a barrier before a waitqueue_active check
Btrfs: call the ordered free operation without any locks held
Btrfs: Check INCOMPAT flags on remount and add helper function
Btrfs: add helper for tree enumeration
btrfs: allow cross-subvolume file clone
Btrfs: improve multi-thread buffer read
Btrfs: make btrfs's allocation smoothly with preallocation
Btrfs: lock the transition from dirty to writeback for an eb
Btrfs: fix potential race in extent buffer freeing
Btrfs: don't return true in releasepage unless we actually freed the eb
Btrfs: suppress printk() if all device I/O stats are zero
Btrfs: remove unwanted printk() for btrfs device I/O stats
Btrfs: rewrite BTRFS_SETGET_FUNCS
Btrfs: zero unused bytes in inode item
Btrfs: kill free_space pointer from inode structure
...
Conflicts:
fs/btrfs/ioctl.c
This patch introduces the BTRFS_IOC_SEND ioctl that is
required for send. It allows btrfs-progs to implement
full and incremental sends. Patches for btrfs-progs will
follow.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>