OpenCloudOS-Kernel/block
Tejun Heo 8a177a36da blk-iolatency: Fix inflight count imbalances and IO hangs on offline
iolatency needs to track the number of inflight IOs per cgroup. As this
tracking can be expensive, it is disabled when no cgroup has iolatency
configured for the device. To ensure that the inflight counters stay
balanced, iolatency_set_limit() freezes the request_queue while manipulating
the enabled counter, which ensures that no IO is in flight and thus all
counters are zero.

Unfortunately, iolatency_set_limit() isn't the only place where the enabled
counter is manipulated. iolatency_pd_offline() can also dec the counter and
trigger disabling. As this disabling happens without freezing the q, this
can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty
cgroup while IOs are in flight in other cgroups and then removing the
cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
system to ensure that removing the cgroup disables iolatency for the whole
device.

The following keeps flipping on and off iolatency on sda:

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  while true; do
      mkdir -p /sys/fs/cgroup/test
      echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
      sleep 1
      rmdir /sys/fs/cgroup/test
      sleep 1
  done

and there's concurrent fio generating direct rand reads:

  fio --name test --filename=/dev/sda --direct=1 --rw=randread \
      --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

  while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
        for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
            blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
            pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
            if pd.value_() == 0:
                continue
            iolat = container_of(pd, 'struct iolatency_grp', 'pd')
            inflight = iolat.rq_wait.inflight.counter.value_()
            if inflight:
                print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                      f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

The monitoring output looks like the following:

  inflight=1 sda /user.slice
  inflight=1 sda /user.slice
  ...
  inflight=14 sda /user.slice
  inflight=13 sda /user.slice
  inflight=17 sda /user.slice
  inflight=15 sda /user.slice
  inflight=18 sda /user.slice
  inflight=17 sda /user.slice
  inflight=20 sda /user.slice
  inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
  inflight=19 sda /user.slice
  inflight=19 sda /user.slice

If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
will never get issued as there's no completion event to wake it up leading
to an indefinite hang.

This patch fixes the bug by unifying enable handling into a work item which
is automatically kicked off from iolatency_set_min_lat_nsec() which is
called from both iolatency_set_limit() and iolatency_pd_offline() paths.
Punting to a work item is necessary as iolatency_pd_offline() is called
under spinlocks while freezing a request_queue requires a sleepable context.

This also simplifies the code reducing LOC sans the comments and avoids the
unnecessary freezes which were happening whenever a cgroup's latency target
is newly set or cleared.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Liu Bo <bo.liu@linux.alibaba.com>
Fixes: 8c772a9bfc ("blk-iolatency: fix IO hang due to negative inflight counter")
Cc: stable@vger.kernel.org # v5.0+
Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-26 11:43:00 -06:00
..
partitions block/partitions/ldm: Remove redundant assignments 2022-04-23 07:15:26 -06:00
Kconfig block: add pi for extended integrity 2022-03-07 12:48:35 -07:00
Kconfig.iosched block: only build the icq tracking code when needed 2021-12-16 10:59:02 -07:00
Makefile blk-cgroup: move blkcg_{get,set}_fc_appid out of line 2022-05-02 14:06:20 -06:00
badblocks.c block/badblocks: Remove redundant assignments 2022-04-23 07:15:26 -06:00
bdev.c Merge branch 'akpm' (patches from Andrew) 2022-03-22 16:11:53 -07:00
bfq-cgroup.c bfq: Make sure bfqg for which we are queueing requests is online 2022-04-17 19:34:32 -06:00
bfq-iosched.c bfq: Remove bfq_requeue_request_body() 2022-05-19 06:52:36 -06:00
bfq-iosched.h bfq: Relax waker detection for shared queues 2022-05-19 06:52:33 -06:00
bfq-wf2q.c block, bfq: cleanup bfq_bfqq_to_bfqg() 2022-02-18 06:13:00 -07:00
bio-integrity.c for-5.18/block-2022-03-18 2022-03-21 16:48:55 -07:00
bio.c block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone 2022-05-04 18:29:52 -06:00
blk-cgroup-fc-appid.c blk-cgroup: move blkcg_{get,set}_fc_appid out of line 2022-05-02 14:06:20 -06:00
blk-cgroup-rwstat.c blk-cgroup: Fix the recursive blkg rwstat 2021-03-05 11:32:15 -07:00
blk-cgroup-rwstat.h block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-cgroup.c blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() 2022-05-18 16:32:00 -06:00
blk-cgroup.h blk-cgroup: always terminate io.stat lines 2022-05-17 06:11:17 -06:00
blk-core.c block: cleanup the VM accounting in submit_bio 2022-05-16 11:37:50 -06:00
blk-crypto-fallback.c block: remove superfluous calls to blkcg_bio_issue_init 2022-05-04 18:29:52 -06:00
blk-crypto-internal.h blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
blk-crypto-profile.c blk-crypto: remove blk_crypto_unregister() 2021-11-29 06:38:51 -07:00
blk-crypto-sysfs.c blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
blk-crypto.c blk-crypto: show crypto capabilities in sysfs 2022-02-28 06:40:23 -07:00
blk-flush.c block: pass a block_device and opf to bio_init 2022-02-02 07:49:59 -07:00
blk-ia-ranges.c block: fix memory leak in disk_register_independent_access_ranges 2022-01-23 09:13:09 -07:00
blk-integrity.c blk-crypto: remove blk_crypto_unregister() 2021-11-29 06:38:51 -07:00
blk-ioc.c block: restore the old set_task_ioprio() behaviour wrt PF_EXITING 2022-03-28 06:34:11 -06:00
blk-iocost.c blk-cgroup: always terminate io.stat lines 2022-05-17 06:11:17 -06:00
blk-iolatency.c blk-iolatency: Fix inflight count imbalances and IO hangs on offline 2022-05-26 11:43:00 -06:00
blk-ioprio.c block: partition include/linux/blk-cgroup.h 2022-02-11 10:02:41 -07:00
blk-ioprio.h block: Introduce the ioprio rq-qos policy 2021-06-21 15:03:40 -06:00
blk-lib.c block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD 2022-04-17 19:49:59 -06:00
blk-map.c block/blk-map: Remove redundant assignment 2022-04-23 07:15:26 -06:00
blk-merge.c for-5.18/write-streams-2022-03-18 2022-03-26 11:51:46 -07:00
blk-mq-cpumap.c blk-mq: remove the calling of local_memory_node() 2020-10-20 07:08:17 -06:00
blk-mq-debugfs-zoned.c block: Cleanup license notice 2019-01-17 21:21:40 -07:00
blk-mq-debugfs.c block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD 2022-04-17 19:49:59 -06:00
blk-mq-debugfs.h blk-mq: manage hctx map via xarray 2022-03-08 19:39:38 -07:00
blk-mq-pci.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-rdma.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-sched.c block: limit request dispatch loop duration 2022-03-17 20:31:43 -06:00
blk-mq-sched.h block: move blk_mq_sched_assign_ioc to blk-ioc.c 2021-11-29 06:41:29 -07:00
blk-mq-sysfs.c blk-mq: prepare for implementing hctx table via xarray 2022-03-08 17:57:19 -07:00
blk-mq-tag.c blk-mq: manage hctx map via xarray 2022-03-08 19:39:38 -07:00
blk-mq-tag.h blk-mq: Delete busy_iter_fn 2021-12-06 13:18:47 -07:00
blk-mq-virtio.c blk-mq: Fix typo in comment 2020-03-17 20:55:21 +01:00
blk-mq.c blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx 2022-05-23 06:28:28 -06:00
blk-mq.h blk-mq: manage hctx map via xarray 2022-03-08 19:39:38 -07:00
blk-pm.c scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume() 2021-12-22 23:38:29 -05:00
blk-pm.h block: Remove unused blk_pm_*() function definitions 2021-02-22 06:33:48 -07:00
blk-rq-qos.c rq-qos: fix missed wake-ups in rq_qos_throttle try two 2021-06-08 15:12:57 -06:00
blk-rq-qos.h block: fix rq-qos breakage from skipping rq_qos_done_bio() 2022-03-14 14:23:13 -06:00
blk-settings.c block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD 2022-04-17 19:49:59 -06:00
blk-stat.c block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-stat.h block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-sysfs.c SCSI misc on 20220324 2022-03-24 19:37:53 -07:00
blk-throttle.c blk-throttle: Set BIO_THROTTLED when bio has been throttled 2022-05-17 19:32:10 -06:00
blk-throttle.h block: cancel all throttled bios in del_gendisk() 2022-03-18 09:57:56 -06:00
blk-timeout.c block: blk-timeout: delete duplicated word 2020-07-31 16:29:47 -06:00
blk-wbt.c blk-wbt: prevent NULL pointer dereference in wb_timer_fn 2021-10-19 06:13:41 -06:00
blk-wbt.h blk-wbt: remove wbt_track stub 2022-03-31 12:58:38 -06:00
blk-zoned.c SCSI misc on 20220324 2022-03-24 19:37:53 -07:00
blk.h block: refactor discard bio size limiting 2022-04-17 19:49:59 -06:00
bounce.c block: remove superfluous calls to blkcg_bio_issue_init 2022-05-04 18:29:52 -06:00
bsg-lib.c block: remove the gendisk argument to blk_execute_rq 2021-11-29 06:41:29 -07:00
bsg.c scsi: bsg: Fix device unregistration 2021-09-14 00:22:15 -04:00
disk-events.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
elevator.c for-5.18/block-2022-03-18 2022-03-21 16:48:55 -07:00
elevator.h block: move elevator.h to block/ 2021-10-18 06:17:01 -06:00
fops.c block: ignore RWF_HIPRI hint for sync dio 2022-05-02 10:07:42 -06:00
genhd.c block: remove queue_discard_alignment 2022-04-17 19:49:59 -06:00
holder.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
ioctl.c block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD 2022-04-17 19:49:59 -06:00
ioprio.c for-5.17/block-2022-01-11 2022-01-12 10:26:52 -08:00
kyber-iosched.c block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
mq-deadline.c block: fix async_depth sysfs interface for mq-deadline 2022-01-20 10:54:02 -07:00
opal_proto.h block: sed-opal: Change the check condition for regular session validity 2020-03-12 08:00:10 -06:00
sed-opal.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
t10-pi.c block: add pi for extended integrity 2022-03-07 12:48:35 -07:00