I've noticed an extra line after "use no compression", but search
revealed much more in messages of more critical levels and rare errors.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
use the newer and more pleasant kstrtoull() to replace simple_strtoull(),
because simple_strtoull() is marked for obsoletion.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
Seeding device support allows us to create a new filesystem
based on existed filesystem.
However newly created filesystem's @total_devices should include seed
devices. This patch fix the following problem:
# mkfs.btrfs -f /dev/sdb
# btrfstune -S 1 /dev/sdb
# mount /dev/sdb /mnt
# btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1
# umount /mnt
# mount /dev/sdc /mnt --->fs_devices->total_devices = 2
This is because we record right @total_devices in superblock, but
@fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout().
Fix this problem by not resetting @fs_devices->total_devices.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could
been enabled using a mount option, and now fs/btrfs/acl.o is not
built, so the mount options will appear to be supported but will
be silently ignored.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently qgroups account for space by intercepting delayed ref updates to fs
trees. It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly. The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together. Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc. This patch
accomplishes this by only adding qgroup operations for real ref changes. We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time. This patch encompasses a bunch of
architectural changes
1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.
2) tree mod seq: we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.
3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence. This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point. This allows us to merge delayed refs during runtime.
With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
According to commit 865ffef379
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Same as normal devices, seed devices should be initialized with
fs_info->dev_root as well, otherwise we'll get a NULL pointer crash.
Cc: Chris Murphy <lists@colorremedies.com>
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
To ease finding bugs during development related to modifying btree leaves
in such a way that it makes its items not sorted by key anymore. Since this
is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
is set, which isn't meant to be enabled for regular users.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
When the csum tree is empty, our leaf (path->nodes[0]) has a number
of items equal to 0 and since btrfs_header_nritems() returns an
unsigned integer (and so is our local nritems variable) the following
comparison always evaluates to false:
if (path->slots[0] >= nritems - 1) {
As the casting rules lead to:
if ((u32)0 >= (u32)4294967295) {
This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf
some lines below:
btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
found_key.type != BTRFS_EXTENT_CSUM_KEY) {
found_next = 1;
goto insert;
}
So just don't access such non-existent slot and don't set found_next to 1
when the tree is empty. It's very unlikely we'll get a random key with the
objectid and type values above, which is where we could go into trouble.
If nritems is 0, just set found_next to 1 anyway as it will make us insert
a csum item covering our whole extent (or the whole leaf) when the tree is
empty.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:
kernel BUG at fs/btrfs/async-thread.c:619!
By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.
Thanks to Miao for pointing out this problem.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
In btrfs_create_tree(), if btrfs_insert_root() fails, we should
free root->commit_root.
Reported-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
posix_acl_xattr_set() already does the check, and it's the only
way to feed in an ACL from userspace.
So the check here is useless, remove it.
Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
This fix will ensure all SB copies on the disk is zeroed
when the disk is intentionally removed. This helps to
better manage disks in the user land.
This version of patch also merges the Zach patch as below.
btrfs: don't double brelse on device rm
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is a continuation of the previous changes titled:
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
There's a few more cases where a directory rename/move must be delayed which was
previously overlooked. If our immediate ancestor has a lower inode number than
ours and it doesn't have a delayed rename/move operation associated to it, it
doesn't mean there isn't any non-direct ancestor of our current inode that needs
to be renamed/moved before our current inode (i.e. with a higher inode number
than ours).
So we can't stop the search if our immediate ancestor has a lower inode number than
ours, we need to navigate the directory hierarchy upwards until we hit the root or:
1) find an ancestor with an higher inode number that was renamed/moved in the send
root too (or already has a pending rename/move registered);
2) find an ancestor that is a new directory (higher inode number than ours and
exists only in the send root).
Reproducer for case 1)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir -p /mnt/a/c/d
$ mkdir /mnt/a/b/e
$ mkdir /mnt/a/c/d/f
$ mv /mnt/a/b /mnt/a/c/d/2b
$ mkdir /mnt/a/x
$ mkdir /mnt/a/y
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x /mnt/a/y
$ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
$ mv /mnt/a/c/d /mnt/a/h/2d
$ mv /mnt/a/c /mnt/a/h/2d/2b/2c
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
Simple reproducer for case 2)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir /mnt/a/c
$ mv /mnt/a/b /mnt/a/c/b2
$ mkdir /mnt/a/e
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/c/b2 /mnt/a/e/b3
$ mkdir /mnt/a/e/b3/f
$ mkdir /mnt/a/h
$ mv /mnt/a/c /mnt/a/e/b3/f/c2
$ mv /mnt/a/e /mnt/a/h/e2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
Another simple reproducer for case 2)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir /mnt/a/c
$ mkdir /mnt/a/b/d
$ mkdir /mnt/a/c/e
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/b/d/f
$ mkdir /mnt/a/b/g
$ mv /mnt/a/c/e /mnt/a/b/g/e2
$ mv /mnt/a/c /mnt/a/b/d/f/c2
$ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
More complex reproducer for case 2)
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b
$ mkdir -p /mnt/a/c/d
$ mkdir /mnt/a/b/e
$ mkdir /mnt/a/c/d/f
$ mv /mnt/a/b /mnt/a/c/d/2b
$ mkdir /mnt/a/x
$ mkdir /mnt/a/y
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x /mnt/a/y
$ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
$ mv /mnt/a/c/d /mnt/a/h/2d
$ mv /mnt/a/c /mnt/a/h/2d/2b/2c
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
For both cases the incremental send would enter an infinite loop when building
path strings.
While solving these cases, this change also re-implements the code to detect
when directory moves/renames should be delayed. Instead of dealing with several
specific cases separately, it's now more generic handling all cases with a simple
detection algorithm and if when applying a delayed move/rename there's a path loop
detected, it further delays the move/rename registering a new ancestor inode as
the dependency inode (so our rename happens after that ancestor is renamed).
Tests for these cases is being added to xfstests too.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we have directories with a pending move/rename operation, we must take into
account any orphan directories that got created before executing the pending
move/rename. Those orphan directories are directories with an inode number higher
then the current send progress and that don't exist in the parent snapshot, they
are created before current progress reaches their inode number, with a generated
name of the form oN-M-I and at the root of the filesystem tree, and later when
progress matches their inode number, moved/renamed to their final location.
Reproducer:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b/c/d
$ mkdir /mnt/a/b/e
$ mv /mnt/a/b/c /mnt/a/b/e/CC
$ mkdir /mnt/a/b/e/CC/d/f
$ mkdir /mnt/a/g
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/g/h
$ mv /mnt/a/b/e /mnt/a/g/h/EE
$ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The second receive command failed with the following error:
ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Regardless of whether the caller is interested or not in knowing the inode's
generation (dir_gen != NULL), get_first_ref always does a btree lookup to get
the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which
is in some cases).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
For RAID0,5,6,10,
For system chunk, there shouldn't be too many stripes to
make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE
For data/meta chunk, there shouldn't be too many stripes to
make a btrfs_chunk that exceeds a leaf.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For system chunk array,
We copy a "disk_key" and an chunk item each time,
so there should be enough space to hold both of them,
not only the chunk item.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Current btrfs_orphan_cleanup will also cleanup roots which is already in
fs_info->dead_roots without protection.
This will have conditional race with fs_info->cleaner_kthread.
This patch will use refs in root->root_item to detect roots in
dead_roots and avoid conflicts.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail to load a free space cache, we can rebuild it from the extent tree,
so it is not a serious error, we should not output a error message that
would make the users uncomfortable. This patch uses warning message instead
of it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs will send uevent to udev inform the device change,
but ctime/mtime for the block device inode is not udpated, which cause
libblkid used by btrfs-progs unable to detect device change and use old
cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an
error message.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The patch "Btrfs: fix protection between send and root deletion"
(18f687d538) does not actually prevent to delete the snapshot
and just takes care during background cleaning, but this seems rather
user unfriendly, this patch implements the idea presented in
http://www.spinics.net/lists/linux-btrfs/msg30813.html
- add an internal root_item flag to denote a dead root
- check if the send_in_progress is set and refuse to delete, otherwise
set the flag and proceed
- check the flag in send similar to the btrfs_root_readonly checks, for
all involved roots
The root lookup in send via btrfs_read_fs_root_no_name will check if the
root is really dead or not. If it is, ENOENT, aborted send. If it's
alive, it's protected by send_in_progress, send can continue.
CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This implements the tmpfile callback of struct inode_operations, introduced
in the linux kernel 3.11, and implemented already by some filesystems. This
callback is invoked by the VFS when the flag O_TMPFILE is passed to the open
system call.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
This ioctl provides basic info about the filesystem that can be obtained
in other ways (eg. sysfs), there's no reason to restrict it to
CAP_SYSADMIN.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This ioctl provides basic info about the devices that can be obtained in
other ways (eg. sysfs), there's no reason to restrict it to
CAP_SYSADMIN.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Similar to the FS_INFO updates, export the basic filesystem info through
sysfs: node size, sector size and clone alignment.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Provide the basic information about filesystem through the ioctl:
* b-tree node size (same as leaf size)
* sector size
* expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments
Backward compatibility: if the values are 0, kernel does not provide
this information, the applications should ignore them.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This started as debugging helper, to watch the effects of converting
between raid levels on multiple devices, but could be useful standalone.
In my case the usage filter was not finegrained enough and led to
converting too many chunks at once. Another example use is in connection
with drange+devid or vrange filters that allow to work with a specific
chunk or even with a chunk on a given device.
The limit filter applies last, the value of 0 means no limiting.
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
While running a stress test with multiple threads writing to the same btrfs
file system, I ended up with a situation where a leaf was corrupted in that
it had 2 file extent item keys that had the same exact key. I was able to
detect this quickly thanks to the following patch which triggers an assertion
as soon as a leaf is marked dirty if there are duplicated keys or out of order
keys:
Btrfs: check if items are ordered when a leaf is marked dirty
(https://patchwork.kernel.org/patch/3955431/)
Basically while running the test, I got the following in dmesg:
[28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
(...)
[28877.415917] Call Trace:
[28877.415922] [<ffffffff816f1189>] dump_stack+0x4e/0x68
[28877.415926] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
[28877.415929] [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
[28877.415944] [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
[28877.415949] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
[28877.415962] [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
[28877.415972] [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
[28877.415984] [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
(...)
[29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
[29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
(...)
[29854.132637] item 23 key (3486 108 667648) itemoff 2694 itemsize 53
[29854.132638] extent data disk bytenr 14574411776 nr 286720
[29854.132639] extent data offset 0 nr 286720 ram 286720
[29854.132640] item 24 key (3486 108 954368) itemoff 2641 itemsize 53
[29854.132641] extent data disk bytenr 0 nr 0
[29854.132643] extent data offset 0 nr 0 ram 0
[29854.132644] item 25 key (3486 108 954368) itemoff 2588 itemsize 53
[29854.132645] extent data disk bytenr 8699670528 nr 77824
[29854.132646] extent data offset 0 nr 77824 ram 77824
[29854.132647] item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
[29854.132648] extent data disk bytenr 8699670528 nr 77824
[29854.132649] extent data offset 0 nr 77824 ram 77824
(...)
[29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
(...)
[29854.132771] Call Trace:
[29854.132779] [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
[29854.132791] [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
[29854.132794] [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
[29854.132797] [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
[29854.132800] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
[29854.132810] [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
[29854.132820] [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
[29854.132830] [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
(...)
So this is caused by getting an -ENOSPC error while punching a file hole, more
specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
or more file extent items due to lack of enough free space. When this happens,
in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
block reservation object to root->fs_info->trans_block_rsv, end our transaction and
start a new transaction basically - and, we keep increasing our current offset
(cur_offset) as long as it's smaller than the end of the target range (lockend) -
this makes use leave the loop with cur_offset == drop_end which in turn makes us
call fill_holes() for inserting a file extent item that represents a 0 bytes range
hole (and this insertion succeeds, as in the meanwhile more space became available).
This 0 bytes file hole extent item is a problem because any subsequent caller of
__btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
start file offset that is equal to the offset of the hole, will not remove this
extent item due to the following conditional in the while loop of
__btrfs_drop_extents:
if (extent_end <= search_start) {
path->slots[0]++;
goto next_slot;
}
This later makes the call to setup_items_for_insert() (at the very end of
__btrfs_drop_extents), insert a new file extent item with the same offset as
the 0 bytes file hole extent item that follows it. Needless is to say that this
causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
where we perform leaf sanity checks or in subsequent operations that manipulate
file extent items, as in the fallocate call as shown by the dmesg trace above.
Without my other patch to perform the leaf sanity checks once a leaf is marked
as dirty (if the integrity checker is enabled), it would have been much harder
to debug this issue.
This change might fix a few similar issues reported by users in the mailing
list regarding assertion failures in btrfs_set_item_key_safe calls performed
by __btrfs_drop_extents, such as the following report:
http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938
Asking fill_holes() to create a 0 bytes wide file hole item also produced the
first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
that has an end smaller (by -1) than its start.
On 3.14 kernels this issue manifests itself through leaf corruption, as we get
duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
instead we have callers of __btrfs_drop_extents(), namely the functions
inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
btrfs_insert_empty_item() to insert the new file extent item, which would fail with
error -EEXIST, instead of inserting a duplicated key - which is still a serious
issue as it would make all similar file extent item replace operations keep
failing if they target the same file range.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
'bio_index' is just a index, it's really not necessary to do increment
one by one.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
In a previous change, commit 12870f1c9b,
I accidentally moved the roundup of inode->i_size to outside of the
critical section delimited by the inode mutex, which is not atomic and
not correct since the size can be changed by other task before we acquire
the mutex. Therefore fix it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
iput() already checks for the inode being NULL, thus it's unnecessary to
check before calling.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Chris Mason <clm@fb.com>
uncompress_inline() is dropping the error from btrfs_decompress() after
testing it and zeroing the page that was supposed to hold decompressed
data. This can silently turn compressed inline data in to zeros if
decompression fails due to corrupt compressed data or memory allocation
failure.
I verified this by manually forcing the error from btrfs_decompress()
for a silly named copy of od:
if (!strcmp(current->comm, "failod"))
ret = -ENOMEM;
# od -x /mnt/btrfs/dir/80 | head -1
0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031
# echo 3 > /proc/sys/vm/drop_caches
# cp $(which od) /tmp/failod
# /tmp/failod -x /mnt/btrfs/dir/80 | head -1
0000000 0000 0000 0000 0000 0000 0000 0000 0000
The fix is to pass the error to its caller. Which still has a BUG_ON().
So we fix that too.
There seems to be no reason for the zeroing of the page on the error
from btrfs_decompress() but not from the allocation error a few lines
above. So the page zeroing is removed.
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The btrfs compression wrappers translated errors from workspace
allocation to either -ENOMEM or -1. The compression type workspace
allocators are already returning a ERR_PTR(-ENOMEM). Just return that
and get rid of the magical -1.
This helps a future patch return errors from the compression wrappers.
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The compression layer seems to have been built to return -1 and have
callers make up errors that make sense. This isn't great because there
are different errors that originate down in the compression layer.
Let's return real negative errnos from the compression layer so that
callers can pass on the error without having to guess what happened.
ENOMEM for allocation failure, E2BIG when compression exceeds the
uncompressed input, and EIO for everything else.
This helps a future path return errors from btrfs_decompress().
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
This issue was not causing any harm but IMO (and in the opinion of the
static code checker) it is better to propagate this error status upwards.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:
[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001
[ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)
It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.
This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:
mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
fsstress -p 6 -d /mnt -n 100000 -x \
"btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
-f allocsp=0 \
-f bulkstat=0 \
-f bulkstat1=0 \
-f chown=0 \
-f creat=1 \
-f dread=0 \
-f dwrite=0 \
-f fallocate=1 \
-f fdatasync=0 \
-f fiemap=0 \
-f freesp=0 \
-f fsync=0 \
-f getattr=0 \
-f getdents=0 \
-f link=0 \
-f mkdir=0 \
-f mknod=0 \
-f punch=1 \
-f read=0 \
-f readlink=0 \
-f rename=0 \
-f resvsp=0 \
-f rmdir=0 \
-f setxattr=0 \
-f stat=0 \
-f symlink=0 \
-f sync=0 \
-f truncate=1 \
-f unlink=0 \
-f unresvsp=0 \
-f write=4
So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
If a path has more than 230 characters, we allocate a new buffer to
use for the path, but we were forgotting to copy the contents of the
previous buffer into the new one, which has random content from the
kmalloc call.
Test:
mkfs.btrfs -f /dev/sdd
mount /dev/sdd /mnt
TEST_PATH="/mnt/fdmanana/.config/google-chrome-mysetup/Default/Pepper_Data/Shockwave_Flash/WritableRoot/#SharedObjects/JSHJ4ZKN/s.wsj.net/[[IMPORT]]/players.edgesuite.net/flash/plugins/osmf/advanced-streaming-plugin/v2.7/osmf1.6/Ak#"
mkdir -p $TEST_PATH
echo "hello world" > $TEST_PATH/amaiAdvancedStreamingPlugin.txt
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
btrfs send /mnt/mysnap1 -f /tmp/1.snap
A test for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Cc: Marc Merlin <marc@merlins.org>
Tested-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Chris Mason <clm@fb.com>
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull two btrfs fixes from Chris Mason:
"This has two fixes that we've been testing for 3.16, but since both
are safe and fix real bugs, it makes sense to send for 3.15 instead"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: send, fix incorrect ref access when using extrefs
Btrfs: fix EIO on reading file after ioctl clone works on it
For inline data extent, we need to make its length aligned, otherwise,
we can get a phantom extent map which confuses readpages() to return -EIO.
This can be detected by xfstests/btrfs/035.
Reported-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.
Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment. Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: limit the path size in send to PATH_MAX
Btrfs: correctly set profile flags on seqlock retry
Btrfs: use correct key when repeating search for extent item
Btrfs: fix inode caching vs tree log
Btrfs: fix possible memory leaks in open_ctree()
Btrfs: avoid triggering bug_on() when we fail to start inode caching task
Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
btrfs: replace error code from btrfs_drop_extents
btrfs: Change the hole range to a more accurate value.
btrfs: fix use-after-free in mount_subvol()
fs_path_ensure_buf is used to make sure our path buffers for
send are big enough for the path names as we construct them.
The buffer size is limited to 32K by the length field in
the struct.
But bugs in the path construction can end up trying to build
a huge buffer, and we'll do invalid memmmoves when the
buffer length field wraps.
This patch is step one, preventing the overflows.
Signed-off-by: Chris Mason <clm@fb.com>
If we had to retry on the profiles seqlock (due to a concurrent write), we
would set bits on the input flags that corresponded both to the current
profile and to previous values of the profile.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
If skinny metadata is enabled and our first tree search fails to find a
skinny extent item, we may repeat a tree search for a "fat" extent item
(if the previous item in the leaf is not the "fat" extent we're looking
for). However we were not setting the new key's objectid to the right
value, as we previously used the same key variable to peek at the previous
item in the leaf, which has a different objectid. So just set the right
objectid to avoid modifying/deleting a wrong item if we repeat the tree
search.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, with inode cache enabled, we will reuse its inode id immediately
after unlinking file, we may hit something like following:
|->iput inode
|->return inode id into inode cache
|->create dir,fsync
|->power off
An easy way to reproduce this problem is:
mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt -o inode_cache,commit=100
dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync
inode_id=`ls -i /mnt/data | awk '{print $1}'`
rm -f /mnt/data
i=1
while [ 1 ]
do
mkdir /mnt/dir_$i
test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'`
if [ $test1 -eq $inode_id ]
then
dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync
echo b > /proc/sysrq-trigger
fi
sleep 1
i=$(($i+1))
done
mount /dev/sdb /mnt
umount /dev/sdb
btrfs check /dev/sdb
We fix this problem by adding unlinked inode's id into pinned tree,
and we can not reuse them until committing transaction.
Cc: stable@vger.kernel.org
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Fix possible memory leaks in the following error handling paths:
read_tree_block()
btrfs_recover_log_trees
btrfs_commit_super()
btrfs_find_orphan_roots()
btrfs_cleanup_fs_roots()
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
There's a case which clone does not handle and used to BUG_ON instead,
(testcase xfstests/btrfs/035), now returns EINVAL. This error code is
confusing to the ioctl caller, as it normally signifies errorneous
arguments.
Change it to ENOPNOTSUPP which allows a fall back to copy instead of
clone. This does not affect the common reflink operation.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 3ac0d7b96a fixed the btrfs expanding
write problem but the hole punched is sometimes too large for some
iovec, which has unmapped data ranges.
This patch will change to hole range to a more accurate value using the
counts checked by the write check routines.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pointer 'newargs' is used after the memory that it points to has already
been freed.
Picked up by Coverity - CID 1201425.
Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options")
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.
Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...
fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type
Since @rot is an int type, we should not use do_div(), fix it.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Given the following /etc/fstab entries:
/dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0
/dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0
you can't issue:
$ mount /mnt/foo
$ mount /mnt/bar
You would have to do:
$ mount /mnt/foo
$ mount -o remount,rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
$ mount /mnt/bar
or
$ mount /mnt/bar
$ mount --rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
With this patch you can do
$ mount /mnt/foo
$ mount /mnt/bar
$ cat /proc/self/mountinfo
49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache
87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache
Signed-off-by: Chris Mason <clm@fb.com>
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.
It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e806) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.
The unpatched userspace tools will show the blockgroup as 'unknown'.
CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Reproducer:
mount /dev/ubda /mnt
mount -oremount,thread_pool=42 /mnt
Gives a crash:
? btrfs_workqueue_set_max+0x0/0x70
btrfs_resize_thread_pool+0xe3/0xf0
? sync_filesystem+0x0/0xc0
? btrfs_resize_thread_pool+0x0/0xf0
btrfs_remount+0x1d2/0x570
? kern_path+0x0/0x80
do_remount_sb+0xd9/0x1c0
do_mount+0x26a/0xbf0
? kfree+0x0/0x1b0
SyS_mount+0xc4/0x110
It's a call
btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
with
fs_info->scrub_wr_completion_workers = NULL;
as scrub wqs get created only on user's demand.
Patch skips not-created-yet workqueues.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
I'm not sure why we weren't aborting here in the first place, it is obviously a
bad time from the fact that we print the leaf and yell loudly about it. Fix
this up, otherwise we panic because our path could be pointing into oblivion.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it. This adds it to the
caller for inline extents, which is where we really need it.
Signed-off-by: Chris Mason <clm@fb.com>
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
# mount /dev/sda8 /mnt
# btrfs scrub start -BR /mnt
# echo $? <--unverified errors make return value be 3
This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().
This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.
This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.
An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Originally following cmds will work:
# btrfs fi resize -10A <mnt>
# btrfs fi resize -10Gaha <mnt>
Filter the arg by checking the return pointer of memparse.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The error handling was copy and pasted from memdup_user(). It should be
checking for NULL obviously.
Fixes: abccd00f8a ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running fsstress and snapshots concurrently, we will hit something
like followings:
Thread 1 Thread 2
|->fallocate
|->write pages
|->join transaction
|->add ordered extent
|->end transaction
|->flushing data
|->creating pending snapshots
|->write data into src root's
fallocated space
After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.
Fix this problem by syncing snapshots with nocow writting:
1.for nocow writting,if there are pending snapshots, we will
fall into COW way.
2.if there are pending nocow writes, snapshots for this root
will be blocked until nocow writting finish.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When testing fsstress with snapshot making background, some snapshot
following problem.
Snapshot 270:
inode 323: size 0
Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0 122880 172032 349145
Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0 122880 172032 349145
The fsstress operation on inode 323 is the following:
write: offset 126832 len 43124
truncate: size 349145
Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.
To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Print the message only when the device is seen for the first time.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
---------------------------------------------------------
kswapd0/169 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(&found->groups_sem){+++++.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&found->groups_sem);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/169:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
#1: (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>