OpenCloudOS-Kernel/block
Omar Sandoval 4c5b123ab2 blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
commit e972b08b91ef48488bae9789f03cfedb148667fb upstream.

We're seeing crashes from rq_qos_wake_function that look like this:

  BUG: unable to handle page fault for address: ffffafe180a40084
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0
  Oops: Oops: 0002 [#1] PREEMPT SMP PTI
  CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
  RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40
  Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00
  RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011
  RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084
  RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011
  R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002
  R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003
  FS:  0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   <IRQ>
   try_to_wake_up+0x5a/0x6a0
   rq_qos_wake_function+0x71/0x80
   __wake_up_common+0x75/0xa0
   __wake_up+0x36/0x60
   scale_up.part.0+0x50/0x110
   wb_timer_fn+0x227/0x450
   ...

So rq_qos_wake_function() calls wake_up_process(data->task), which calls
try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock).

p comes from data->task, and data comes from the waitqueue entry, which
is stored on the waiter's stack in rq_qos_wait(). Analyzing the core
dump with drgn, I found that the waiter had already woken up and moved
on to a completely unrelated code path, clobbering what was previously
data->task. Meanwhile, the waker was passing the clobbered garbage in
data->task to wake_up_process(), leading to the crash.

What's happening is that in between rq_qos_wake_function() deleting the
waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding
that it already got a token and returning. The race looks like this:

rq_qos_wait()                           rq_qos_wake_function()
==============================================================
prepare_to_wait_exclusive()
                                        data->got_token = true;
                                        list_del_init(&curr->entry);
if (data.got_token)
        break;
finish_wait(&rqw->wait, &data.wq);
  ^- returns immediately because
     list_empty_careful(&wq_entry->entry)
     is true
... return, go do something else ...
                                        wake_up_process(data->task)
                                          (NO LONGER VALID!)-^

Normally, finish_wait() is supposed to synchronize against the waker.
But, as noted above, it is returning immediately because the waitqueue
entry has already been removed from the waitqueue.

The bug is that rq_qos_wake_function() is accessing the waitqueue entry
AFTER deleting it. Note that autoremove_wake_function() wakes the waiter
and THEN deletes the waitqueue entry, which is the proper order.

Fix it by swapping the order. We also need to use
list_del_init_careful() to match the list_empty_careful() in
finish_wait().

Fixes: 38cfb5a45e ("blk-wbt: improve waking of tasks")
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/d3bee2463a67b1ee597211823bf7ad3721c26e41.1729014591.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-22 15:46:27 +02:00
..
partitions block: fix potential invalid pointer dereference in blk_add_partition 2024-10-04 16:29:01 +02:00
Kconfig block: sed-opal: keyring support for SED keys 2023-08-22 11:10:26 -06:00
Kconfig.iosched block: Default to use cgroup support for BFQ 2023-01-30 09:42:42 -07:00
Makefile block: move the code to do early boot lookup of block devices to block/ 2023-06-05 10:57:40 -06:00
badblocks.c block/badblocks: Remove redundant assignments 2022-04-23 07:15:26 -06:00
bdev.c block: Provide bdev_open_* functions 2024-03-26 18:19:40 -04:00
bfq-cgroup.c blkcg: Restructure blkg_conf_prep() and friends 2023-04-13 06:46:49 -06:00
bfq-iosched.c block, bfq: fix procress reference leakage for bfqq in merge chain 2024-10-04 16:29:00 +02:00
bfq-iosched.h block, bfq: remove BFQ_WEIGHT_LEGACY_DFL 2023-04-06 16:17:32 -06:00
bfq-wf2q.c block, bfq: inject I/O to underutilized actuators 2023-01-29 15:18:33 -07:00
bio-integrity.c block: initialize integrity buffer to zero before writing it to media 2024-08-03 08:53:20 +02:00
bio.c block: Fix page refcounts for unaligned buffers in __bio_release_pages() 2024-04-03 15:28:27 +02:00
blk-cgroup-fc-appid.c block: Replace all non-returning strlcpy with strscpy 2023-06-01 09:13:31 -06:00
blk-cgroup-rwstat.c Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" 2023-02-14 14:24:09 -07:00
blk-cgroup-rwstat.h block: Use the new blk_opf_t type 2022-07-14 12:14:30 -06:00
blk-cgroup.c blk-cgroup: Properly propagate the iostat update up the hierarchy 2024-06-12 11:12:46 +02:00
blk-cgroup.h block: fix q->blkg_list corruption during disk rebind 2024-04-17 11:19:28 +02:00
blk-core.c block: Fix where bio IO priority gets set 2024-09-30 16:25:12 +02:00
blk-crypto-fallback.c blk-crypto: dynamically allocate fallback profile 2023-08-18 15:00:39 -06:00
blk-crypto-internal.h blk-crypto: remove blk_crypto_insert_cloned_request() 2023-03-16 09:35:09 -06:00
blk-crypto-profile.c blk-crypto: use dynamic lock class for blk_crypto_profile::lock 2023-07-05 16:36:12 -06:00
blk-crypto-sysfs.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-crypto.c blk-crypto: make blk_crypto_evict_key() more robust 2023-03-16 09:35:09 -06:00
blk-flush.c block: fix request.queuelist usage in flush 2024-06-21 14:38:35 +02:00
blk-ia-ranges.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-integrity.c block: remove the blk_flush_integrity call in blk_integrity_unregister 2024-09-08 07:54:47 +02:00
blk-ioc.c blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue() 2023-06-07 07:51:00 -06:00
blk-iocost.c blk_iocost: fix more out of bound shifts 2024-10-10 11:57:24 +02:00
blk-iolatency.c block: fix bad lockdep annotation in blk-iolatency 2023-08-10 17:24:53 -06:00
blk-ioprio.c blk-ioprio: Introduce promote-to-rt policy 2023-06-06 22:26:26 -06:00
blk-ioprio.h blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit 2022-09-26 19:09:31 -06:00
blk-lib.c blk-lib: fix blkdev_issue_secure_erase 2022-09-15 00:25:17 -06:00
blk-map.c block: Fix WARNING in _copy_from_iter 2024-03-01 13:34:49 +01:00
blk-merge.c block: support to account io_ticks precisely 2024-06-12 11:11:35 +02:00
blk-mq-cpumap.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-debugfs-zoned.c block: move zone related fields to struct gendisk 2022-07-06 06:46:26 -06:00
blk-mq-debugfs.c blk-mq: fix potential io hang by wrong 'wake_batch' 2023-06-12 09:55:53 -06:00
blk-mq-debugfs.h block: remove per-disk debugfs files in blk_unregister_queue 2022-06-17 07:31:05 -06:00
blk-mq-pci.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-sched.c blk-mq: cleanup __blk_mq_sched_dispatch_requests 2023-04-13 06:57:18 -06:00
blk-mq-sched.h blk-mq: make sure elevator callbacks aren't called for passthrough request 2023-05-18 19:42:54 -06:00
blk-mq-sysfs.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-tag.c block: Fix lockdep warning in blk_mq_mark_tag_wait 2024-08-29 17:33:30 +02:00
blk-mq-virtio.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq.c block: Fix where bio IO priority gets set 2024-09-30 16:25:12 +02:00
blk-mq.h blk-mq: fix potential io hang by wrong 'wake_batch' 2023-06-12 09:55:53 -06:00
blk-pm.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-pm.h block: Remove unused blk_pm_*() function definitions 2021-02-22 06:33:48 -07:00
blk-rq-qos.c blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race 2024-10-22 15:46:27 +02:00
blk-rq-qos.h blk-iolatency: s/blkcg_rq_qos/iolat_rq_qos/ 2023-04-13 06:46:49 -06:00
blk-settings.c block: Clear zone limits for a non-zoned stacked queue 2024-04-03 15:28:20 +02:00
blk-stat.c block: prevent division by zero in blk_rq_stat_sum() 2024-04-13 13:07:37 +02:00
blk-stat.h block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-sysfs.c block: don't allow enabling a cache on devices that don't support it 2023-07-17 08:18:18 -06:00
blk-throttle.c blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!" 2023-12-20 17:01:55 +01:00
blk-throttle.h blk-throttle: print signed value 'carryover_bytes/ios' for user 2023-08-30 10:15:01 -06:00
blk-timeout.c block: blk-timeout: delete duplicated word 2020-07-31 16:29:47 -06:00
blk-wbt.c blk-wbt: Fix detection of dirty-throttled tasks 2024-02-23 09:25:16 +01:00
blk-wbt.h blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled 2023-06-26 09:53:36 -06:00
blk-zoned.c Merge branch '6.5/scsi-staging' into 6.5/scsi-fixes 2023-07-11 12:15:15 -04:00
blk.h block: support to account io_ticks precisely 2024-06-12 11:11:35 +02:00
bounce.c block: change the blk_queue_bounce calling convention 2022-08-02 17:22:54 -06:00
bsg-lib.c scsi: replace the fmode_t argument to ->sg_io_fn with a simple bool 2023-06-12 08:04:04 -06:00
bsg.c SCSI misc on 20230629 2023-06-30 11:57:07 -07:00
disk-events.c block: fix kernel-doc for disk_force_media_change() 2023-09-26 00:43:34 -06:00
early-lookup.c block: don't return -EINVAL for not found names in devt_from_devname 2023-06-22 09:09:33 -06:00
elevator.c blk-mq: release scheduler resource when request completes 2023-08-19 07:47:17 -06:00
elevator.h blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
fops.c block: refine the EOF check in blkdev_iomap_begin 2024-06-12 11:11:35 +02:00
genhd.c block: fix deadlock between sd_remove & sd_release 2024-08-03 08:54:24 +02:00
holder.c block: don't allow a disk link holder to itself 2022-11-16 15:19:56 -07:00
ioctl.c block/ioctl: prefer different overflow check 2024-06-27 13:49:01 +02:00
ioprio.c scsi: block: Improve ioprio value validity checks 2023-06-16 12:04:30 -04:00
kyber-iosched.c blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
mq-deadline.c block/mq-deadline: Fix the tag reservation code 2024-08-03 08:53:22 +02:00
opal_proto.h block: sed-opal: handle empty atoms when parsing response 2024-03-26 18:19:12 -04:00
sed-opal.c block: sed-opal: avoid possible wrong address reference in read_sed_opal_key() 2024-06-21 14:38:35 +02:00
t10-pi.c block: add pi for extended integrity 2022-03-07 12:48:35 -07:00