btrfs: ensure pages are unlocked on cow_file_range() failure
There is a hung_task report on zoned btrfs like below.
https://github.com/naota/linux/issues/59
[726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[726.329839] Not tainted 5.16.0-rc1+ #1
[726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000
[726.331608] Call Trace:
[726.331611] <TASK>
[726.331614] __schedule+0x2e5/0x9d0
[726.331622] schedule+0x58/0xd0
[726.331626] io_schedule+0x3f/0x70
[726.331629] __folio_lock+0x125/0x200
[726.331634] ? find_get_entries+0x1bc/0x240
[726.331638] ? filemap_invalidate_unlock_two+0x40/0x40
[726.331642] truncate_inode_pages_range+0x5b2/0x770
[726.331649] truncate_inode_pages_final+0x44/0x50
[726.331653] btrfs_evict_inode+0x67/0x480
[726.331658] evict+0xd0/0x180
[726.331661] iput+0x13f/0x200
[726.331664] do_unlinkat+0x1c0/0x2b0
[726.331668] __x64_sys_unlink+0x23/0x30
[726.331670] do_syscall_64+0x3b/0xc0
[726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae
[726.331677] RIP: 0033:0x7fb9490a171b
[726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[726.331693] </TASK>
While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.
Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.
The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.
Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.
In summary, cow_file_range now behaves like this:
- page_started == 1 (return value)
- All the pages are unlocked. IO is started.
- unlock == 1
- All the pages except @locked_page are unlocked in any case
- unlock == 0
- On success, all the pages are locked for writing out them
- On failure, all the pages except @locked_page are unlocked
Fixes: 42c0110009
("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
140a8ff765
commit
9ce7466f37
|
@ -1128,6 +1128,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
|
|||
* *page_started is set to one if we unlock locked_page and do everything
|
||||
* required to start IO on it. It may be clean and already done with
|
||||
* IO when we return.
|
||||
*
|
||||
* When unlock == 1, we unlock the pages in successfully allocated regions.
|
||||
* When unlock == 0, we leave them locked for writing them out.
|
||||
*
|
||||
* However, we unlock all the pages except @locked_page in case of failure.
|
||||
*
|
||||
* In summary, page locking state will be as follow:
|
||||
*
|
||||
* - page_started == 1 (return value)
|
||||
* - All the pages are unlocked. IO is started.
|
||||
* - Note that this can happen only on success
|
||||
* - unlock == 1
|
||||
* - All the pages except @locked_page are unlocked in any case
|
||||
* - unlock == 0
|
||||
* - On success, all the pages are locked for writing out them
|
||||
* - On failure, all the pages except @locked_page are unlocked
|
||||
*
|
||||
* When a failure happens in the second or later iteration of the
|
||||
* while-loop, the ordered extents created in previous iterations are kept
|
||||
* intact. So, the caller must clean them up by calling
|
||||
* btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
|
||||
* example.
|
||||
*/
|
||||
static noinline int cow_file_range(struct btrfs_inode *inode,
|
||||
struct page *locked_page,
|
||||
|
@ -1137,6 +1159,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
|
|||
struct btrfs_root *root = inode->root;
|
||||
struct btrfs_fs_info *fs_info = root->fs_info;
|
||||
u64 alloc_hint = 0;
|
||||
u64 orig_start = start;
|
||||
u64 num_bytes;
|
||||
unsigned long ram_size;
|
||||
u64 cur_alloc_size = 0;
|
||||
|
@ -1324,18 +1347,44 @@ out_reserve:
|
|||
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
|
||||
btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
|
||||
out_unlock:
|
||||
/*
|
||||
* Now, we have three regions to clean up:
|
||||
*
|
||||
* |-------(1)----|---(2)---|-------------(3)----------|
|
||||
* `- orig_start `- start `- start + cur_alloc_size `- end
|
||||
*
|
||||
* We process each region below.
|
||||
*/
|
||||
|
||||
clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
|
||||
EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
|
||||
page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
|
||||
|
||||
/*
|
||||
* If we reserved an extent for our delalloc range (or a subrange) and
|
||||
* failed to create the respective ordered extent, then it means that
|
||||
* when we reserved the extent we decremented the extent's size from
|
||||
* the data space_info's bytes_may_use counter and incremented the
|
||||
* space_info's bytes_reserved counter by the same amount. We must make
|
||||
* sure extent_clear_unlock_delalloc() does not try to decrement again
|
||||
* the data space_info's bytes_may_use counter, therefore we do not pass
|
||||
* it the flag EXTENT_CLEAR_DATA_RESV.
|
||||
* For the range (1). We have already instantiated the ordered extents
|
||||
* for this region. They are cleaned up by
|
||||
* btrfs_cleanup_ordered_extents() in e.g,
|
||||
* btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
|
||||
* already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
|
||||
* EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
|
||||
* function.
|
||||
*
|
||||
* However, in case of unlock == 0, we still need to unlock the pages
|
||||
* (except @locked_page) to ensure all the pages are unlocked.
|
||||
*/
|
||||
if (!unlock && orig_start < start)
|
||||
extent_clear_unlock_delalloc(inode, orig_start, start - 1,
|
||||
locked_page, 0, page_ops);
|
||||
|
||||
/*
|
||||
* For the range (2). If we reserved an extent for our delalloc range
|
||||
* (or a subrange) and failed to create the respective ordered extent,
|
||||
* then it means that when we reserved the extent we decremented the
|
||||
* extent's size from the data space_info's bytes_may_use counter and
|
||||
* incremented the space_info's bytes_reserved counter by the same
|
||||
* amount. We must make sure extent_clear_unlock_delalloc() does not try
|
||||
* to decrement again the data space_info's bytes_may_use counter,
|
||||
* therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
|
||||
*/
|
||||
if (extent_reserved) {
|
||||
extent_clear_unlock_delalloc(inode, start,
|
||||
|
@ -1347,6 +1396,13 @@ out_unlock:
|
|||
if (start >= end)
|
||||
goto out;
|
||||
}
|
||||
|
||||
/*
|
||||
* For the range (3). We never touched the region. In addition to the
|
||||
* clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
|
||||
* space_info's bytes_may_use counter, reserved in
|
||||
* btrfs_check_data_free_space().
|
||||
*/
|
||||
extent_clear_unlock_delalloc(inode, start, end, locked_page,
|
||||
clear_bits | EXTENT_CLEAR_DATA_RESV,
|
||||
page_ops);
|
||||
|
|
Loading…
Reference in New Issue