2018-04-04 01:23:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2008-06-12 04:50:36 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
2014-02-28 10:46:03 +08:00
|
|
|
* Copyright (C) 2014 Fujitsu. All rights reserved.
|
2008-06-12 04:50:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/kthread.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2008-06-12 04:50:36 +08:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/spinlock.h>
|
2009-02-04 22:23:24 +08:00
|
|
|
#include <linux/freezer.h>
|
2008-06-12 04:50:36 +08:00
|
|
|
#include "async-thread.h"
|
2014-03-06 12:19:50 +08:00
|
|
|
#include "ctree.h"
|
2008-06-12 04:50:36 +08:00
|
|
|
|
2019-08-01 20:50:33 +08:00
|
|
|
enum {
|
|
|
|
WORK_DONE_BIT,
|
|
|
|
WORK_ORDER_DONE_BIT,
|
|
|
|
};
|
Btrfs: Add ordered async work queues
Btrfs uses kernel threads to create async work queues for cpu intensive
operations such as checksumming and decompression. These work well,
but they make it difficult to keep IO order intact.
A single writepages call from pdflush or fsync will turn into a number
of bios, and each bio is checksummed in parallel. Once the checksum is
computed, the bio is sent down to the disk, and since we don't control
the order in which the parallel operations happen, they might go down to
the disk in almost any order.
The code deals with this somewhat by having deep work queues for a single
kernel thread, making it very likely that a single thread will process all
the bios for a single inode.
This patch introduces an explicitly ordered work queue. As work structs
are placed into the queue they are put onto the tail of a list. They have
three callbacks:
->func (cpu intensive processing here)
->ordered_func (order sensitive processing here)
->ordered_free (free the work struct, all processing is done)
The work struct has three callbacks. The func callback does the cpu intensive
work, and when it completes the work struct is marked as done.
Every time a work struct completes, the list is checked to see if the head
is marked as done. If so the ordered_func callback is used to do the
order sensitive processing and the ordered_free callback is used to do
any cleanup. Then we loop back and check the head of the list again.
This patch also changes the checksumming code to use the ordered workqueues.
One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07 11:03:00 +08:00
|
|
|
|
2014-02-28 10:46:05 +08:00
|
|
|
#define NO_THRESHOLD (-1)
|
|
|
|
#define DFT_THRESHOLD (32)
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
struct btrfs_workqueue {
|
2014-02-28 10:46:03 +08:00
|
|
|
struct workqueue_struct *normal_wq;
|
2016-06-10 04:22:11 +08:00
|
|
|
|
|
|
|
/* File system this workqueue services */
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
|
2014-02-28 10:46:03 +08:00
|
|
|
/* List head pointing to ordered work list */
|
|
|
|
struct list_head ordered_list;
|
|
|
|
|
|
|
|
/* Spinlock for ordered_list */
|
|
|
|
spinlock_t list_lock;
|
2014-02-28 10:46:05 +08:00
|
|
|
|
|
|
|
/* Thresholding related variants */
|
|
|
|
atomic_t pending;
|
2015-08-20 09:30:39 +08:00
|
|
|
|
|
|
|
/* Up limit of concurrency workers */
|
|
|
|
int limit_active;
|
|
|
|
|
|
|
|
/* Current number of concurrency workers */
|
|
|
|
int current_active;
|
|
|
|
|
|
|
|
/* Threshold to change current_active */
|
2014-02-28 10:46:05 +08:00
|
|
|
int thresh;
|
|
|
|
unsigned int count;
|
|
|
|
spinlock_t thres_lock;
|
2014-02-28 10:46:03 +08:00
|
|
|
};
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
struct btrfs_fs_info * __pure btrfs_workqueue_owner(const struct btrfs_workqueue *wq)
|
2016-06-10 04:22:11 +08:00
|
|
|
{
|
|
|
|
return wq->fs_info;
|
|
|
|
}
|
|
|
|
|
2019-10-02 01:57:39 +08:00
|
|
|
struct btrfs_fs_info * __pure btrfs_work_owner(const struct btrfs_work *work)
|
2016-06-10 04:22:11 +08:00
|
|
|
{
|
|
|
|
return work->wq->fs_info;
|
|
|
|
}
|
|
|
|
|
2017-06-29 11:56:54 +08:00
|
|
|
bool btrfs_workqueue_normal_congested(const struct btrfs_workqueue *wq)
|
2016-12-13 06:32:44 +08:00
|
|
|
{
|
|
|
|
/*
|
2022-04-18 12:43:09 +08:00
|
|
|
* We could compare wq->pending with num_online_cpus()
|
2016-12-13 06:32:44 +08:00
|
|
|
* to support "thresh == NO_THRESHOLD" case, but it requires
|
|
|
|
* moving up atomic_inc/dec in thresh_queue/exec_hook. Let's
|
|
|
|
* postpone it until someone needs the support of that case.
|
|
|
|
*/
|
2022-04-18 12:43:09 +08:00
|
|
|
if (wq->thresh == NO_THRESHOLD)
|
2016-12-13 06:32:44 +08:00
|
|
|
return false;
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
return atomic_read(&wq->pending) > wq->thresh * 2;
|
2016-12-13 06:32:44 +08:00
|
|
|
}
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
|
|
|
|
const char *name, unsigned int flags,
|
|
|
|
int limit_active, int thresh)
|
2014-02-28 10:46:04 +08:00
|
|
|
{
|
2022-04-18 12:43:09 +08:00
|
|
|
struct btrfs_workqueue *ret = kzalloc(sizeof(*ret), GFP_KERNEL);
|
2014-02-28 10:46:04 +08:00
|
|
|
|
2014-09-30 01:20:37 +08:00
|
|
|
if (!ret)
|
2014-02-28 10:46:04 +08:00
|
|
|
return NULL;
|
|
|
|
|
2016-06-10 04:22:11 +08:00
|
|
|
ret->fs_info = fs_info;
|
2015-08-20 09:30:39 +08:00
|
|
|
ret->limit_active = limit_active;
|
2014-02-28 10:46:05 +08:00
|
|
|
atomic_set(&ret->pending, 0);
|
|
|
|
if (thresh == 0)
|
|
|
|
thresh = DFT_THRESHOLD;
|
|
|
|
/* For low threshold, disabling threshold is a better choice */
|
|
|
|
if (thresh < DFT_THRESHOLD) {
|
2015-08-20 09:30:39 +08:00
|
|
|
ret->current_active = limit_active;
|
2014-02-28 10:46:05 +08:00
|
|
|
ret->thresh = NO_THRESHOLD;
|
|
|
|
} else {
|
2015-08-20 09:30:39 +08:00
|
|
|
/*
|
|
|
|
* For threshold-able wq, let its concurrency grow on demand.
|
|
|
|
* Use minimal max_active at alloc time to reduce resource
|
|
|
|
* usage.
|
|
|
|
*/
|
|
|
|
ret->current_active = 1;
|
2014-02-28 10:46:05 +08:00
|
|
|
ret->thresh = thresh;
|
|
|
|
}
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
ret->normal_wq = alloc_workqueue("btrfs-%s", flags, ret->current_active,
|
|
|
|
name);
|
2014-09-30 01:20:37 +08:00
|
|
|
if (!ret->normal_wq) {
|
2014-02-28 10:46:04 +08:00
|
|
|
kfree(ret);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&ret->ordered_list);
|
|
|
|
spin_lock_init(&ret->list_lock);
|
2014-02-28 10:46:05 +08:00
|
|
|
spin_lock_init(&ret->thres_lock);
|
2022-04-18 12:43:09 +08:00
|
|
|
trace_btrfs_workqueue_alloc(ret, name);
|
2014-02-28 10:46:03 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-02-28 10:46:05 +08:00
|
|
|
/*
|
|
|
|
* Hook for threshold which will be called in btrfs_queue_work.
|
|
|
|
* This hook WILL be called in IRQ handler context,
|
|
|
|
* so workqueue_set_max_active MUST NOT be called in this hook
|
|
|
|
*/
|
2022-04-18 12:43:09 +08:00
|
|
|
static inline void thresh_queue_hook(struct btrfs_workqueue *wq)
|
2014-02-28 10:46:05 +08:00
|
|
|
{
|
|
|
|
if (wq->thresh == NO_THRESHOLD)
|
|
|
|
return;
|
|
|
|
atomic_inc(&wq->pending);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hook for threshold which will be called before executing the work,
|
|
|
|
* This hook is called in kthread content.
|
|
|
|
* So workqueue_set_max_active is called here.
|
|
|
|
*/
|
2022-04-18 12:43:09 +08:00
|
|
|
static inline void thresh_exec_hook(struct btrfs_workqueue *wq)
|
2014-02-28 10:46:05 +08:00
|
|
|
{
|
2015-08-20 09:30:39 +08:00
|
|
|
int new_current_active;
|
2014-02-28 10:46:05 +08:00
|
|
|
long pending;
|
|
|
|
int need_change = 0;
|
|
|
|
|
|
|
|
if (wq->thresh == NO_THRESHOLD)
|
|
|
|
return;
|
|
|
|
|
|
|
|
atomic_dec(&wq->pending);
|
|
|
|
spin_lock(&wq->thres_lock);
|
|
|
|
/*
|
|
|
|
* Use wq->count to limit the calling frequency of
|
|
|
|
* workqueue_set_max_active.
|
|
|
|
*/
|
|
|
|
wq->count++;
|
|
|
|
wq->count %= (wq->thresh / 4);
|
|
|
|
if (!wq->count)
|
|
|
|
goto out;
|
2015-08-20 09:30:39 +08:00
|
|
|
new_current_active = wq->current_active;
|
2014-02-28 10:46:05 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* pending may be changed later, but it's OK since we really
|
|
|
|
* don't need it so accurate to calculate new_max_active.
|
|
|
|
*/
|
|
|
|
pending = atomic_read(&wq->pending);
|
|
|
|
if (pending > wq->thresh)
|
2015-08-20 09:30:39 +08:00
|
|
|
new_current_active++;
|
2014-02-28 10:46:05 +08:00
|
|
|
if (pending < wq->thresh / 2)
|
2015-08-20 09:30:39 +08:00
|
|
|
new_current_active--;
|
|
|
|
new_current_active = clamp_val(new_current_active, 1, wq->limit_active);
|
|
|
|
if (new_current_active != wq->current_active) {
|
2014-02-28 10:46:05 +08:00
|
|
|
need_change = 1;
|
2015-08-20 09:30:39 +08:00
|
|
|
wq->current_active = new_current_active;
|
2014-02-28 10:46:05 +08:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
spin_unlock(&wq->thres_lock);
|
|
|
|
|
|
|
|
if (need_change) {
|
2015-08-20 09:30:39 +08:00
|
|
|
workqueue_set_max_active(wq->normal_wq, wq->current_active);
|
2014-02-28 10:46:05 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
static void run_ordered_work(struct btrfs_workqueue *wq,
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
struct btrfs_work *self)
|
2014-02-28 10:46:03 +08:00
|
|
|
{
|
|
|
|
struct list_head *list = &wq->ordered_list;
|
2014-02-28 10:46:19 +08:00
|
|
|
struct btrfs_work *work;
|
2014-02-28 10:46:03 +08:00
|
|
|
spinlock_t *lock = &wq->list_lock;
|
|
|
|
unsigned long flags;
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
bool free_self = false;
|
2014-02-28 10:46:03 +08:00
|
|
|
|
|
|
|
while (1) {
|
|
|
|
spin_lock_irqsave(lock, flags);
|
|
|
|
if (list_empty(list))
|
|
|
|
break;
|
2014-02-28 10:46:19 +08:00
|
|
|
work = list_entry(list->next, struct btrfs_work,
|
2014-02-28 10:46:03 +08:00
|
|
|
ordered_list);
|
|
|
|
if (!test_bit(WORK_DONE_BIT, &work->flags))
|
|
|
|
break;
|
2021-11-02 20:49:16 +08:00
|
|
|
/*
|
|
|
|
* Orders all subsequent loads after reading WORK_DONE_BIT,
|
|
|
|
* paired with the smp_mb__before_atomic in btrfs_work_helper
|
|
|
|
* this guarantees that the ordered function will see all
|
|
|
|
* updates from ordinary work function.
|
|
|
|
*/
|
|
|
|
smp_rmb();
|
2014-02-28 10:46:03 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* we are going to call the ordered done function, but
|
|
|
|
* we leave the work item on the list as a barrier so
|
|
|
|
* that later work items that are done don't have their
|
|
|
|
* functions called before this one returns
|
|
|
|
*/
|
|
|
|
if (test_and_set_bit(WORK_ORDER_DONE_BIT, &work->flags))
|
|
|
|
break;
|
2014-03-06 12:19:50 +08:00
|
|
|
trace_btrfs_ordered_sched(work);
|
2014-02-28 10:46:03 +08:00
|
|
|
spin_unlock_irqrestore(lock, flags);
|
|
|
|
work->ordered_func(work);
|
|
|
|
|
|
|
|
/* now take the lock again and drop our item from the list */
|
|
|
|
spin_lock_irqsave(lock, flags);
|
|
|
|
list_del(&work->ordered_list);
|
|
|
|
spin_unlock_irqrestore(lock, flags);
|
|
|
|
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
if (work == self) {
|
|
|
|
/*
|
|
|
|
* This is the work item that the worker is currently
|
|
|
|
* executing.
|
|
|
|
*
|
|
|
|
* The kernel workqueue code guarantees non-reentrancy
|
|
|
|
* of work items. I.e., if a work item with the same
|
|
|
|
* address and work function is queued twice, the second
|
|
|
|
* execution is blocked until the first one finishes. A
|
|
|
|
* work item may be freed and recycled with the same
|
|
|
|
* work function; the workqueue code assumes that the
|
|
|
|
* original work item cannot depend on the recycled work
|
|
|
|
* item in that case (see find_worker_executing_work()).
|
|
|
|
*
|
2019-09-17 02:30:57 +08:00
|
|
|
* Note that different types of Btrfs work can depend on
|
|
|
|
* each other, and one type of work on one Btrfs
|
|
|
|
* filesystem may even depend on the same type of work
|
|
|
|
* on another Btrfs filesystem via, e.g., a loop device.
|
|
|
|
* Therefore, we must not allow the current work item to
|
|
|
|
* be recycled until we are really done, otherwise we
|
|
|
|
* break the above assumption and can deadlock.
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
*/
|
|
|
|
free_self = true;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We don't want to call the ordered free functions with
|
2019-09-17 02:30:58 +08:00
|
|
|
* the lock held.
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
*/
|
|
|
|
work->ordered_free(work);
|
2019-09-17 02:30:58 +08:00
|
|
|
/* NB: work must not be dereferenced past this point. */
|
|
|
|
trace_btrfs_all_work_done(wq->fs_info, work);
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
}
|
2014-02-28 10:46:03 +08:00
|
|
|
}
|
|
|
|
spin_unlock_irqrestore(lock, flags);
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
|
|
|
|
if (free_self) {
|
|
|
|
self->ordered_free(self);
|
2019-09-17 02:30:58 +08:00
|
|
|
/* NB: self must not be dereferenced past this point. */
|
|
|
|
trace_btrfs_all_work_done(wq->fs_info, self);
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
}
|
2014-02-28 10:46:03 +08:00
|
|
|
}
|
|
|
|
|
2019-09-17 02:30:57 +08:00
|
|
|
static void btrfs_work_helper(struct work_struct *normal_work)
|
2014-02-28 10:46:03 +08:00
|
|
|
{
|
2019-09-17 02:30:57 +08:00
|
|
|
struct btrfs_work *work = container_of(normal_work, struct btrfs_work,
|
|
|
|
normal_work);
|
2022-04-18 12:43:09 +08:00
|
|
|
struct btrfs_workqueue *wq = work->wq;
|
2014-02-28 10:46:03 +08:00
|
|
|
int need_order = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We should not touch things inside work in the following cases:
|
|
|
|
* 1) after work->func() if it has no ordered_free
|
|
|
|
* Since the struct is freed in work->func().
|
|
|
|
* 2) after setting WORK_DONE_BIT
|
|
|
|
* The work may be freed in other threads almost instantly.
|
|
|
|
* So we save the needed things here.
|
|
|
|
*/
|
|
|
|
if (work->ordered_func)
|
|
|
|
need_order = 1;
|
|
|
|
|
2014-03-06 12:19:50 +08:00
|
|
|
trace_btrfs_work_sched(work);
|
2014-02-28 10:46:05 +08:00
|
|
|
thresh_exec_hook(wq);
|
2014-02-28 10:46:03 +08:00
|
|
|
work->func(work);
|
|
|
|
if (need_order) {
|
2021-11-02 20:49:16 +08:00
|
|
|
/*
|
|
|
|
* Ensures all memory accesses done in the work function are
|
|
|
|
* ordered before setting the WORK_DONE_BIT. Ensuring the thread
|
|
|
|
* which is going to executed the ordered work sees them.
|
|
|
|
* Pairs with the smp_rmb in run_ordered_work.
|
|
|
|
*/
|
|
|
|
smp_mb__before_atomic();
|
2014-02-28 10:46:03 +08:00
|
|
|
set_bit(WORK_DONE_BIT, &work->flags);
|
btrfs: don't prematurely free work in run_ordered_work()
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-17 02:30:53 +08:00
|
|
|
run_ordered_work(wq, work);
|
2019-09-17 02:30:58 +08:00
|
|
|
} else {
|
|
|
|
/* NB: work must not be dereferenced past this point. */
|
|
|
|
trace_btrfs_all_work_done(wq->fs_info, work);
|
2014-02-28 10:46:03 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-09-17 02:30:57 +08:00
|
|
|
void btrfs_init_work(struct btrfs_work *work, btrfs_func_t func,
|
|
|
|
btrfs_func_t ordered_func, btrfs_func_t ordered_free)
|
2014-02-28 10:46:03 +08:00
|
|
|
{
|
|
|
|
work->func = func;
|
|
|
|
work->ordered_func = ordered_func;
|
|
|
|
work->ordered_free = ordered_free;
|
2019-09-17 02:30:57 +08:00
|
|
|
INIT_WORK(&work->normal_work, btrfs_work_helper);
|
2014-02-28 10:46:03 +08:00
|
|
|
INIT_LIST_HEAD(&work->ordered_list);
|
|
|
|
work->flags = 0;
|
|
|
|
}
|
|
|
|
|
2022-04-18 12:43:09 +08:00
|
|
|
void btrfs_queue_work(struct btrfs_workqueue *wq, struct btrfs_work *work)
|
2014-02-28 10:46:03 +08:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
work->wq = wq;
|
2014-02-28 10:46:05 +08:00
|
|
|
thresh_queue_hook(wq);
|
2014-02-28 10:46:03 +08:00
|
|
|
if (work->ordered_func) {
|
|
|
|
spin_lock_irqsave(&wq->list_lock, flags);
|
|
|
|
list_add_tail(&work->ordered_list, &wq->ordered_list);
|
|
|
|
spin_unlock_irqrestore(&wq->list_lock, flags);
|
|
|
|
}
|
2014-03-06 12:19:50 +08:00
|
|
|
trace_btrfs_work_queued(work);
|
2016-01-22 09:28:38 +08:00
|
|
|
queue_work(wq->normal_wq, &work->normal_work);
|
2014-02-28 10:46:03 +08:00
|
|
|
}
|
|
|
|
|
2014-02-28 10:46:19 +08:00
|
|
|
void btrfs_destroy_workqueue(struct btrfs_workqueue *wq)
|
2014-02-28 10:46:04 +08:00
|
|
|
{
|
|
|
|
if (!wq)
|
|
|
|
return;
|
2022-04-18 12:43:09 +08:00
|
|
|
destroy_workqueue(wq->normal_wq);
|
|
|
|
trace_btrfs_workqueue_destroy(wq);
|
2014-03-11 22:31:44 +08:00
|
|
|
kfree(wq);
|
2014-02-28 10:46:04 +08:00
|
|
|
}
|
|
|
|
|
2015-08-20 09:30:39 +08:00
|
|
|
void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int limit_active)
|
2014-02-28 10:46:03 +08:00
|
|
|
{
|
2022-04-18 12:43:09 +08:00
|
|
|
if (wq)
|
|
|
|
wq->limit_active = limit_active;
|
2014-02-28 10:46:03 +08:00
|
|
|
}
|
Btrfs: fix crash during unmount due to race with delayed inode workers
During unmount we can have a job from the delayed inode items work queue
still running, that can lead to at least two bad things:
1) A crash, because the worker can try to create a transaction just
after the fs roots were freed;
2) A transaction leak, because the worker can create a transaction
before the fs roots are freed and just after we committed the last
transaction and after we stopped the transaction kthread.
A stack trace example of the crash:
[79011.691214] kernel BUG at lib/radix-tree.c:982!
[79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2
(...)
[79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
[79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
(...)
[79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
[79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
[79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
[79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
[79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
[79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
[79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
[79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[79011.715448] Call Trace:
[79011.715925] record_root_in_trans+0x72/0xf0 [btrfs]
[79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
[79011.717925] start_transaction+0xdd/0x5c0 [btrfs]
[79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
[79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs]
[79011.720773] process_one_work+0x26d/0x6a0
[79011.721497] worker_thread+0x4f/0x3e0
[79011.722153] ? process_one_work+0x6a0/0x6a0
[79011.722901] kthread+0x103/0x140
[79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70
[79011.724379] ret_from_fork+0x3a/0x50
(...)
The following diagram shows a sequence of steps that lead to the crash
during ummount of the filesystem:
CPU 1 CPU 2 CPU 3
btrfs_punch_hole()
btrfs_btree_balance_dirty()
btrfs_balance_delayed_items()
--> sees
fs_info->delayed_root->items
with value 200, which is greater
than
BTRFS_DELAYED_BACKGROUND (128)
and smaller than
BTRFS_DELAYED_WRITEBACK (512)
btrfs_wq_run_delayed_node()
--> queues a job for
fs_info->delayed_workers to run
btrfs_async_run_delayed_root()
btrfs_async_run_delayed_root()
--> job queued by CPU 1
--> starts picking and running
delayed nodes from the
prepare_list list
close_ctree()
btrfs_delete_unused_bgs()
btrfs_commit_super()
btrfs_join_transaction()
--> gets transaction N
btrfs_commit_transaction(N)
--> set transaction state
to TRANTS_STATE_COMMIT_START
btrfs_first_prepared_delayed_node()
--> picks delayed node X through
the prepared_list list
btrfs_run_delayed_items()
btrfs_first_delayed_node()
--> also picks delayed node X
but through the node_list
list
__btrfs_commit_inode_delayed_items()
--> runs all delayed items from
this node and drops the
node's item count to 0
through call to
btrfs_release_delayed_inode()
--> finishes running any remaining
delayed nodes
--> finishes transaction commit
--> stops cleaner and transaction threads
btrfs_free_fs_roots()
--> frees all roots and removes them
from the radix tree
fs_info->fs_roots_radix
btrfs_join_transaction()
start_transaction()
btrfs_record_root_in_trans()
record_root_in_trans()
radix_tree_tag_set()
--> crashes because
the root is not in
the radix tree
anymore
If the worker is able to call btrfs_join_transaction() before the unmount
task frees the fs roots, we end up leaking a transaction and all its
resources, since after the call to btrfs_commit_super() and stopping the
transaction kthread, we don't expect to have any transaction open anymore.
When this situation happens the worker has a delayed node that has no
more items to run, since the task calling btrfs_run_delayed_items(),
which is doing a transaction commit, picks the same node and runs all
its items first.
We can not wait for the worker to complete when running delayed items
through btrfs_run_delayed_items(), because we call that function in
several phases of a transaction commit, and that could cause a deadlock
because the worker calls btrfs_join_transaction() and the task doing the
transaction commit may have already set the transaction state to
TRANS_STATE_COMMIT_DOING.
Also it's not possible to get into a situation where only some of the
items of a delayed node are added to the fs/subvolume tree in the current
transaction and the remaining ones in the next transaction, because when
running the items of a delayed inode we lock its mutex, effectively
waiting for the worker if the worker is running the items of the delayed
node already.
Since this can only cause issues when unmounting a filesystem, fix it in
a simple way by waiting for any jobs on the delayed workers queue before
calling btrfs_commit_supper() at close_ctree(). This works because at this
point no one can call btrfs_btree_balance_dirty() or
btrfs_balance_delayed_items(), and if we end up waiting for any worker to
complete, btrfs_commit_super() will commit the transaction created by the
worker.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-28 21:04:36 +08:00
|
|
|
|
|
|
|
void btrfs_flush_workqueue(struct btrfs_workqueue *wq)
|
|
|
|
{
|
2022-04-18 12:43:09 +08:00
|
|
|
flush_workqueue(wq->normal_wq);
|
Btrfs: fix crash during unmount due to race with delayed inode workers
During unmount we can have a job from the delayed inode items work queue
still running, that can lead to at least two bad things:
1) A crash, because the worker can try to create a transaction just
after the fs roots were freed;
2) A transaction leak, because the worker can create a transaction
before the fs roots are freed and just after we committed the last
transaction and after we stopped the transaction kthread.
A stack trace example of the crash:
[79011.691214] kernel BUG at lib/radix-tree.c:982!
[79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2
(...)
[79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
[79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
(...)
[79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
[79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
[79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
[79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
[79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
[79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
[79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
[79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[79011.715448] Call Trace:
[79011.715925] record_root_in_trans+0x72/0xf0 [btrfs]
[79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
[79011.717925] start_transaction+0xdd/0x5c0 [btrfs]
[79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
[79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs]
[79011.720773] process_one_work+0x26d/0x6a0
[79011.721497] worker_thread+0x4f/0x3e0
[79011.722153] ? process_one_work+0x6a0/0x6a0
[79011.722901] kthread+0x103/0x140
[79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70
[79011.724379] ret_from_fork+0x3a/0x50
(...)
The following diagram shows a sequence of steps that lead to the crash
during ummount of the filesystem:
CPU 1 CPU 2 CPU 3
btrfs_punch_hole()
btrfs_btree_balance_dirty()
btrfs_balance_delayed_items()
--> sees
fs_info->delayed_root->items
with value 200, which is greater
than
BTRFS_DELAYED_BACKGROUND (128)
and smaller than
BTRFS_DELAYED_WRITEBACK (512)
btrfs_wq_run_delayed_node()
--> queues a job for
fs_info->delayed_workers to run
btrfs_async_run_delayed_root()
btrfs_async_run_delayed_root()
--> job queued by CPU 1
--> starts picking and running
delayed nodes from the
prepare_list list
close_ctree()
btrfs_delete_unused_bgs()
btrfs_commit_super()
btrfs_join_transaction()
--> gets transaction N
btrfs_commit_transaction(N)
--> set transaction state
to TRANTS_STATE_COMMIT_START
btrfs_first_prepared_delayed_node()
--> picks delayed node X through
the prepared_list list
btrfs_run_delayed_items()
btrfs_first_delayed_node()
--> also picks delayed node X
but through the node_list
list
__btrfs_commit_inode_delayed_items()
--> runs all delayed items from
this node and drops the
node's item count to 0
through call to
btrfs_release_delayed_inode()
--> finishes running any remaining
delayed nodes
--> finishes transaction commit
--> stops cleaner and transaction threads
btrfs_free_fs_roots()
--> frees all roots and removes them
from the radix tree
fs_info->fs_roots_radix
btrfs_join_transaction()
start_transaction()
btrfs_record_root_in_trans()
record_root_in_trans()
radix_tree_tag_set()
--> crashes because
the root is not in
the radix tree
anymore
If the worker is able to call btrfs_join_transaction() before the unmount
task frees the fs roots, we end up leaking a transaction and all its
resources, since after the call to btrfs_commit_super() and stopping the
transaction kthread, we don't expect to have any transaction open anymore.
When this situation happens the worker has a delayed node that has no
more items to run, since the task calling btrfs_run_delayed_items(),
which is doing a transaction commit, picks the same node and runs all
its items first.
We can not wait for the worker to complete when running delayed items
through btrfs_run_delayed_items(), because we call that function in
several phases of a transaction commit, and that could cause a deadlock
because the worker calls btrfs_join_transaction() and the task doing the
transaction commit may have already set the transaction state to
TRANS_STATE_COMMIT_DOING.
Also it's not possible to get into a situation where only some of the
items of a delayed node are added to the fs/subvolume tree in the current
transaction and the remaining ones in the next transaction, because when
running the items of a delayed inode we lock its mutex, effectively
waiting for the worker if the worker is running the items of the delayed
node already.
Since this can only cause issues when unmounting a filesystem, fix it in
a simple way by waiting for any jobs on the delayed workers queue before
calling btrfs_commit_supper() at close_ctree(). This works because at this
point no one can call btrfs_btree_balance_dirty() or
btrfs_balance_delayed_items(), and if we end up waiting for any worker to
complete, btrfs_commit_super() will commit the transaction created by the
worker.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-28 21:04:36 +08:00
|
|
|
}
|