OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Bart Van Assche	b96f3cab59	block/bfq: Enable I/O statistics BFQ uses io_start_time_ns. That member variable is only set if I/O statistics are enabled. Hence this patch that enables I/O statistics at the time BFQ is associated with a request queue. Compile-tested only. Reported-by: Cixi Geng <cixi.geng1@unisoc.com> Cc: Cixi Geng <cixi.geng1@unisoc.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-16 16:59:28 -06:00
Ming Lei	6cfeadbff3	blk-mq: don't clear flush_rq from tags->rqs[] commit `364b61818f` ("blk-mq: clearing flush request reference in tags->rqs[]") is added to clear the to-be-free flush request from tags->rqs[] for avoiding use-after-free on the flush rq. Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time by ~8s because running scsi probe which may create and remove lots of unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and each request queue has lots of hw queues. Improve the situation by not running blk_mq_clear_flush_rq_mapping if disk isn't added when there can't be any flush request issued. Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-16 14:45:15 -06:00
Ming Lei	4d337cebcb	blk-mq: avoid to touch q->elevator without any protection q->elevator is referred in blk_mq_has_sqsched() without any protection, no .q_usage_counter is held, no queue srcu and rcu read lock is held, so potential use-after-free may be triggered. Fix the issue by adding one queue flag for checking if the elevator uses single queue style dispatch. Meantime the elevator feature flag of ELEVATOR_F_MQ_AWARE isn't needed any more. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-16 14:45:15 -06:00
Ming Lei	5fd7a84a09	blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none elevator can be tore down by sysfs switch interface or disk release, so hold ->sysfs_lock before referring to q->elevator, then potential use-after-free can be avoided. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220616014401.817001-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-16 14:45:15 -06:00
Bart Van Assche	14dc7a18ab	block: Fix handling of offline queues in blk_mq_alloc_request_hctx() This patch prevents that test nvme/004 triggers the following: UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9 index 512 is out of range for type 'long unsigned int [512]' Call Trace: show_stack+0x52/0x58 dump_stack_lvl+0x49/0x5e dump_stack+0x10/0x12 ubsan_epilogue+0x9/0x3b __ubsan_handle_out_of_bounds.cold+0x44/0x49 blk_mq_alloc_request_hctx+0x304/0x310 __nvme_submit_sync_cmd+0x70/0x200 [nvme_core] nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics] nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop] nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop] nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics] nvmf_dev_write+0xae/0x111 [nvme_fabrics] vfs_write+0x144/0x560 ksys_write+0xb7/0x140 __x64_sys_write+0x42/0x50 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Fixes: `20e4d81393` ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-16 14:43:31 -06:00
Christoph Hellwig	d5a37b1998	block: remove bioset_init_from_src Unused now, and the interface never really made a whole lot of sense to start with. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-08 14:04:14 -04:00
Linus Torvalds	78c6499c92	for-5.19/drivers-2022-06-02 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmkoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqyrD/4iyg2ULBPyljLoM3Ed8AONbrFBApenKDFN FjiFRZNll8zvJLTtP0GQqJIrljPuRlqb0IkbgqXl+yvPZ+wpB9HgIe2aohkOaqqJ KS49UNR/aIHwC2y7lwlcsdgVqqoPdc4wnZeaQvCsWPCBhCca/k0kR7s3uEhHMK92 OWpV9osl/thLfOBwwt4IEaO1Koz8PM/fCR4XA2KLbs8E4P8EcSFglqi0ap7foLNr pQAJIlPjkmF6nw4xg5fdBjVBo//kVcuf9IBMi5/XinmUL1taFAcn5WyeOvBbi0Fs Sqp/pKkveM8xWZKrDyA/wf8nzRNpBBl6TOQEMFtV6FkZrij3pbKHlCiyL2gBR13e 5gkbVvXgtdqDVlnqlvIV/Swfh5YQFtn7+vlHFOUjP+iObsBRo4fTvDhPTLoO/VCf tIAA7xnq/pquRKS/QGGC7ZxRVc3T1r+EpvbBP5Dc4CDGbbbyZLCSrOh5HWIb0I3k 95GSiipTtf54KqZ9HiG/u+xNAFIdapXgU4Xm+JyDRWLdxSs5Nmy8VgpdflvxBfuo hCvJJw3vtDusjyHc7IafxaZlJVQT9tPgPshs8GfrCMCP19RCALD/5irspFh6dD35 BQTIkhC68XNa0iNn/NTP3uxir/JwRoovxQkA9eD+r1NHsAbL8GTypfr5kKJxDaIK UhawfyZE3Q== =YqhC -----END PGP SIGNATURE----- Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block Pull more block driver updates from Jens Axboe: "A collection of stragglers that were late on sending in their changes and just followup fixes. - NVMe fixes pull request via Christoph: - set controller enable bit in a separate write (Niklas Cassel) - disable namespace identifiers for the MAXIO MAP1001 (Christoph) - fix a comment typo (Julia Lawall)" - MD fixes pull request via Song: - Remove uses of bdevname (Christoph Hellwig) - Bug fixes (Guoqing Jiang, and Xiao Ni) - bcache fixes series (Coly) - null_blk zoned write fix (Damien) - nbd fixes (Yu, Zhang) - Fix for loop partition scanning (Christoph)" * tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits) block: null_blk: Fix null_zone_write() nvmet: fix typo in comment nvme: set controller enable bit in a separate write nvme-pci: disable namespace identifiers for the MAXIO MAP1001 bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() nbd: use pr_err to output error message nbd: fix possible overflow on 'first_minor' in nbd_dev_add() nbd: fix io hung while disconnecting device nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed nbd: fix race between nbd_alloc_config() and module removal nbd: call genl_unregister_family() first in nbd_cleanup() md: bcache: check the return value of kzalloc() in detached_dev_do_request() bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() block, loop: support partitions without scanning bcache: avoid journal no-space deadlock by reserving 1 journal bucket bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() bcache: improve multithreaded bch_sectors_dirty_init() bcache: improve multithreaded bch_btree_check() md: fix double free of io_acct_set bioset md: Don't set mddev private to NULL in raid0 pers->free ...	2022-06-03 10:25:56 -07:00
Linus Torvalds	72fbbc3d0e	for-5.19/block-exec-2022-06-02 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmh0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqg6EACCbkwoH7rrr38iU++xP3c9oqFJCfYR95ho qn9/3FiPulua0Dwg8Fbp12ubqqBy/iNj+4Mk7XTo28P7ahjGtKJec2DZguDuHC5X G3kucgQcDOLs1IMWoil+KrnjGC8qeT9ZPFNaUF0IY084NPxnj1wAOjo1J00QVieN WFgHX1sxzBje8abebf3UxAyXImzfyY2uXbp1F3thzf0ZwHXkSDsbWI3fvpdYF4QC p3z6CX0sR+5v7ZLWF3X6H8MBSO+eRlprYji3O/0jVslLBAS8FlTdizQtzx7C6Hsv JZVY4ZsUswYxtsHCBR0McglDeu/iXZRZ9HX4iiOobYJNaXfycMltvS+4Tb/TsFTB GaG6tbL4JS+NT063ctl5h355vUVhIbw6qEhsiF47+0hgawvRr/xxP0aSq1MPmfjw OgG4Jn0htXF47tpKnszfaj3BmgvgV56mV0IGwF5Sh5NXDnHF+MHFmrhRsP1NenjL 12FTnvWGYyTRGDVIVFDkuwaI9o9iNKdFw0JkKIZa/G5RVmmjukvMGTvvSnKmSxJg dgbYLSBA2ZxCcIPjJDvZroe3QxvGNyqUYtxwyWl4a1HK/qljZfwwyRE1rfjJ47hK F8jNEkOThcjr1anoV2nvSLE1mM3SyA/UqDsntIdwnUlG/ObYsByTgtKc+ETs1RzS 8Ovp6lv0Mw== =lBeb -----END PGP SIGNATURE----- Merge tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block Pull block request execute cleanups from Jens Axboe: "This change was advertised in the initial core block pull request, but didn't actually make that branch as we deferred it to a post-merge pull request to avoid a bunch of cross branch issues. This series cleans up the block execute path quite nicely" * tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block: blk-mq: remove the done argument to blk_execute_rq_nowait blk-mq: avoid a mess of casts for blk_end_sync_rq blk-mq: remove __blk_execute_rq_nowait	2022-06-03 10:21:43 -07:00
Linus Torvalds	34845d92bc	for-5.19/block-2022-06-02 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmfQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsLlEACPbK/ms8dMDwKjfEF/RMoc7uL/j6oC0cpf 0D2sfMka8D41QdrUfMiUismXZ61dyKdsiX/U/Q0gcjIomnlco8ZeLcLa6DlafjwY DtvO2aCb+eBAkII5sX2WM4ANNgFTy08Y4wBmgEy5En5u4nPlIGZ8DsulQQodqygx 1lJh31OXQKw+2kIyUdAeC0GMiD9nddYDsH0CTFDSZsAijCcOBDOHbDPk27wHapzM GR1UAK5/SA7RfZgIMRHHclF6Ea49/uPJ45crD1T+8p6jLW+ldbxpiRD3ux9BnK2v U7EWS5MLMFAvb/nTLc8T37srJuEhBAT0r2bn614rjOiJofalPeD0eDeHfz4vRpPe +qTQREtpBUtJizYN+8rpcxP8f9S/hmPOBvIKD3XC0TlOo1NCf35fqWLWMli2hkTQ AfcY1auKjC/UYcnR0TQ91aHo1puM4fK5Pdc6lDGznrcxy9t1g1NvKAEL9Y3xK3No paglrliBCUbAN8vogKr4jc7jRkh/GLEqkxV2LIpOVp3lyT9GepvYM1xLQ8X/rszn /Il3fAwf5AyP+1RoVcmmOy1XW0ptUbKXWn03NlxN55Ya8x3tKCwWWDSmL2CP8SwV Vo5Qt+rKUkqA/TmHW8HOd7i+44Sa8oD/6WpSSPkwXN2cgRQmvmtaGmpXCKNTn5tk PMgFJOq3uw== =7JDU -----END PGP SIGNATURE----- Merge tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Just a collection of fixes that have been queued up since the initial merge window pull request, the majority of which are targeted for stable as well. One bio_set fix that fixes an issue with the dm adoption of cached bio structs that got introduced in this merge window" * tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block: block: Fix potential deadlock in blk_ia_range_sysfs_show() block: fix bio_clone_blkg_association() to associate with proper blkcg_gq block: remove useless BUG_ON() in blk_mq_put_tag() blk-mq: do not update io_ticks with passthrough requests block: make bioset_exit() fully resilient against being called twice block: use bio_queue_enter instead of blk_queue_enter in bio_poll block: document BLK_STS_AGAIN usage block: take destination bvec offsets into account in bio_copy_data_iter blk-iolatency: Fix inflight count imbalances and IO hangs on offline blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx	2022-06-03 10:14:48 -07:00
Damien Le Moal	41e46b3c2a	block: Fix potential deadlock in blk_ia_range_sysfs_show() When being read, a sysfs attribute is already protected against removal with the kobject node active reference counter. As a result, in blk_ia_range_sysfs_show(), there is no need to take the queue sysfs lock when reading the value of a range attribute. Using the queue sysfs lock in this function creates a potential deadlock situation with the disk removal, something that a lockdep signals with a splat when the device is removed: [ 760.703551] Possible unsafe locking scenario: [ 760.703551] [ 760.703554] CPU0 CPU1 [ 760.703556] ---- ---- [ 760.703558] lock(&q->sysfs_lock); [ 760.703565] lock(kn->active#385); [ 760.703573] lock(&q->sysfs_lock); [ 760.703579] lock(kn->active#385); [ 760.703587] [ 760.703587] * DEADLOCK * Solve this by removing the mutex_lock()/mutex_unlock() calls from blk_ia_range_sysfs_show(). Fixes: `a2247f19ee` ("block: Add independent access ranges support") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220603021905.1441419-1-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-02 23:02:37 -06:00
Jan Kara	22b106e535	block: fix bio_clone_blkg_association() to associate with proper blkcg_gq Commit `d92c370a16` ("block: really clone the block cgroup in bio_clone_blkg_association") changed bio_clone_blkg_association() to just clone bio->bi_blkg reference from source to destination bio. This is however wrong if the source and destination bios are against different block devices because struct blkcg_gq is different for each bdev-blkcg pair. This will result in IOs being accounted (and throttled as a result) multiple times against the same device (src bdev) while throttling of the other device (dst bdev) is ignored. In case of BFQ the inconsistency can even result in crashes in bfq_bic_update_cgroup(). Fix the problem by looking up correct blkcg_gq for the cloned bio. Reported-by: Logan Gunthorpe <logang@deltatee.com> Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Fixes: `d92c370a16` ("block: really clone the block cgroup in bio_clone_blkg_association") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-02 02:15:05 -06:00
Damien Le Moal	ff47dbd18b	block: remove useless BUG_ON() in blk_mq_put_tag() Since the if condition in blk_mq_put_tag() checks that the tag to put is not a reserved one, the BUG_ON() check in the else branch checking if the tag is indeed a reserved one is useless. Remove it. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220602075159.1273366-1-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-02 02:05:56 -06:00
Haisu Wang	b81c14ca14	blk-mq: do not update io_ticks with passthrough requests Flush or passthrough requests are not accounted as normal IO in completion. To reflect iostat for slow IO, io_ticks is updated when stat show called based on inflight numbers. It may cause inconsistent io_ticks calculation result. So do not account non-passthrough request when check inflight. Fixes: `86d7331299` ("block: update io_ticks when io hang") Signed-off-by: Haisu Wang <haisuwang@tencent.com> Reviewed-by: samuelliao <samuelliao@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-30 07:13:12 -06:00
Jens Axboe	605f7415ec	block: make bioset_exit() fully resilient against being called twice Most of bioset_exit() is fine being called twice, as it clears the various allocations etc when they are freed. The exception is bio_alloc_cache_destroy(), which does not clear ->cache when it has freed it. This isn't necessarily a bug, but can be if buggy users does call the exit path more then once, or with just a memset() bioset which has never been initialized. dm appears to be one such user. Fixes: `be4d234d7a` ("bio: add allocation cache abstraction") Link: https://lore.kernel.org/linux-block/YpK7m+14A+pZKs5k@casper.infradead.org/ Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-29 07:36:31 -06:00
Christoph Hellwig	e2e5308672	blk-mq: remove the done argument to blk_execute_rq_nowait Let the caller set it together with the end_io_data instead of passing a pointless argument. Note the the target code did in fact already set it and then just overrode it again by calling blk_execute_rq_nowait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-28 06:15:27 -06:00
Christoph Hellwig	32ac5a9b8b	blk-mq: avoid a mess of casts for blk_end_sync_rq Instead of trying to cast a __bitwise 32-bit integer to a larger integer and then a pointer, just allow a struct with the blk_status_t and the completion on stack and set the end_io_data to that. Use the opportunity to move the code to where it belongs and drop rather confusing comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-28 06:15:27 -06:00
Christoph Hellwig	ae948fd6d0	blk-mq: remove __blk_execute_rq_nowait We don't want to plug for synchronous execution that where we immediately wait for the request. Once that is done not a whole lot of code is shared, so just remove __blk_execute_rq_nowait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-28 06:15:27 -06:00
Christoph Hellwig	ebd076bf7d	block: use bio_queue_enter instead of blk_queue_enter in bio_poll We want to have a valid live gendisk to call ->poll and not just a request_queue, so call the right helper. Fixes: `3e08773c38` ("block: switch polling to be bio based") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220523124302.526186-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-28 06:14:35 -06:00
Christoph Hellwig	403d50341c	block: take destination bvec offsets into account in bio_copy_data_iter Appartly bcache can copy into bios that do not just contain fresh pages but can have offsets into the bio_vecs. Restore support for tht in bio_copy_data_iter. Fixes: `f8b679a070` ("block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220524143919.1155501-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-27 20:35:55 -06:00
Christoph Hellwig	b9684a71fc	block, loop: support partitions without scanning Historically we did distinguish between a flag that surpressed partition scanning, and a combinations of the minors variable and another flag if any partitions were supported. This was generally confusing and doesn't make much sense, but some corner case uses of the loop driver actually do want to support manually added partitions on a device that does not actively scan for partitions. To make things worsee the loop driver also wants to dynamically toggle the scanning for partitions on a live gendisk, which makes the disk->flags updates non-atomic. Introduce a new GD_SUPPRESS_PART_SCAN bit in disk->state that disables just scanning for partitions, and toggle that instead of GENHD_FL_NO_PART in the loop driver. Fixes: `1ebe2e5f9d` ("block: remove GENHD_FL_EXT_DEVT") Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220527055806.1972352-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-27 06:39:00 -06:00
Tejun Heo	8a177a36da	blk-iolatency: Fix inflight count imbalances and IO hangs on offline iolatency needs to track the number of inflight IOs per cgroup. As this tracking can be expensive, it is disabled when no cgroup has iolatency configured for the device. To ensure that the inflight counters stay balanced, iolatency_set_limit() freezes the request_queue while manipulating the enabled counter, which ensures that no IO is in flight and thus all counters are zero. Unfortunately, iolatency_set_limit() isn't the only place where the enabled counter is manipulated. iolatency_pd_offline() can also dec the counter and trigger disabling. As this disabling happens without freezing the q, this can easily happen while some IOs are in flight and thus leak the counts. This can be easily demonstrated by turning on iolatency on an one empty cgroup while IOs are in flight in other cgroups and then removing the cgroup. Note that iolatency shouldn't have been enabled elsewhere in the system to ensure that removing the cgroup disables iolatency for the whole device. The following keeps flipping on and off iolatency on sda: echo +io > /sys/fs/cgroup/cgroup.subtree_control while true; do mkdir -p /sys/fs/cgroup/test echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency sleep 1 rmdir /sys/fs/cgroup/test sleep 1 done and there's concurrent fio generating direct rand reads: fio --name test --filename=/dev/sda --direct=1 --rw=randread \ --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k while monitoring with the following drgn script: while True: for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()): for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list): blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node') pd = blkg.pd[prog['blkcg_policy_iolatency'].plid] if pd.value_() == 0: continue iolat = container_of(pd, 'struct iolatency_grp', 'pd') inflight = iolat.rq_wait.inflight.counter.value_() if inflight: print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} ' f'{cgroup_path(css.cgroup).decode("utf-8")}') time.sleep(1) The monitoring output looks like the following: inflight=1 sda /user.slice inflight=1 sda /user.slice ... inflight=14 sda /user.slice inflight=13 sda /user.slice inflight=17 sda /user.slice inflight=15 sda /user.slice inflight=18 sda /user.slice inflight=17 sda /user.slice inflight=20 sda /user.slice inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19 inflight=19 sda /user.slice inflight=19 sda /user.slice If a cgroup with stuck inflight ends up getting throttled, the throttled IOs will never get issued as there's no completion event to wake it up leading to an indefinite hang. This patch fixes the bug by unifying enable handling into a work item which is automatically kicked off from iolatency_set_min_lat_nsec() which is called from both iolatency_set_limit() and iolatency_pd_offline() paths. Punting to a work item is necessary as iolatency_pd_offline() is called under spinlocks while freezing a request_queue requires a sleepable context. This also simplifies the code reducing LOC sans the comments and avoids the unnecessary freezes which were happening whenever a cgroup's latency target is newly set or cleared. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Liu Bo <bo.liu@linux.alibaba.com> Fixes: `8c772a9bfc` ("blk-iolatency: fix IO hang due to negative inflight counter") Cc: stable@vger.kernel.org # v5.0+ Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-26 11:43:00 -06:00
Linus Torvalds	fdaf9a5840	Page cache changes for 5.19 - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A== =djLv -----END PGP SIGNATURE----- Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...	2022-05-24 19:55:07 -07:00
Linus Torvalds	850f6033cd	Description for this pull request: - fix referencing wrong parent directory information during rename. - introduce a sys_tz mount option to use system timezone. - improve performance while zeroing a cluster with dirsync mount option. - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmKLhPkWHGxpbmtpbmpl b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCOeXD/9nJofEx/n9KK0pA1WP2zKoLBnP YgLfnWTBlXfErInklW4kg057S6q8M0pDm0iASLw9P6GZNe0VFV5PigrTyAjy5ghW hm3JiFAHIZgaOlOk2NQd/1Qv/IdlnkbRkngXqHcizxEX/LcKZpP+pb1mdW6NzWrt /HagLClFTUhb0Su3DT7TCqiam5lI+lkarRI0Jo4Scstgsn4aT+25jk9N0bfUih7f hGKJpii+5UWCLlBJnyyghrBRQiiPdsETadJdRnHgeDdzKg/UNWxMP0C+G6PmSko/ mScrR+FeH2toURSUESi1Q558z1+3Fhb8rMbl3aWV70FJDmzMwn9YyhPgBrX3x6Gb AF7UBHFvORStYRUmmSMbX9XkY2gNoI9qZMXghDRlgF8t/WWY8VeVnyslaWwqDQhw qXyOIThiCuhLfKTD+r+MM08oUPcyFBtuGvdzDOH7/b56zEDwzab+hSHZz94xPWEz ESk0hNhaJCEvcgEr7IlSSF4k5Ff+hWVKN4R/DD78yxmjHveOuNMibTeE2YgDIFEX SiKFaAiXUuWGVsAPuAeJ+np/7rW3OEBG8yKhtri5vsoXk+Mqd56rp0EkEHzqbVHQ Ki5gua549KNRuxnbXRtWLjCKucwN86mE45WD0P0ORnBOjlfmgg8adp6BBSW5yVD2 SZkfgI0FL8rWdBwRcQ== =fr78 -----END PGP SIGNATURE----- Merge tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - fix referencing wrong parent directory information during rename - introduce a sys_tz mount option to use system timezone - improve performance while zeroing a cluster with dirsync mount option - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot * tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: check if cluster num is valid exfat: reduce block requests when zeroing a cluster block: add sync_blockdev_range() exfat: introduce mount option 'sys_tz' exfat: fix referencing wrong parent directory information after renaming	2022-05-24 18:30:27 -07:00
Linus Torvalds	5dc921868c	for-5.19/drivers-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrTcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgph/REAC0/7odRfJeTJ1PkJhSKFc7dhyS7rK4du2s 3+z+H6Yeua2yVIJb0mYYGEJcOUUQ9nD2T9424n3NzDOw88U4y8Vg2YEH+UiJBuj4 AJoxPNkQdxL7WzmwHmRNLCcOOFhISLqWiCJSr45d+LP1f6aO24Q9lewYWxtNA4TW mqb7Ne7e3Z77m9rmsCsZ26bzQHg1EEQ6qgjZM9tqMhOeTqYhmrqfrD9KtG8TIkpK N8277E5QcequHf7v6VpKqEOzf3d2kx55JaZdu+oxLPVMED3wJJFwcYF1/xmM7Fgx tp7xCjqqUHXwKvJNCFJpnvw+cXu0Ct7cWOIG4ROCvaTD4vBI1KzZLc0gO7pKFW0Y hNIlMXr4n8PmonS81tMV4TqmRWxedX/jxuaeJCVNr89PqYU4luPpigJZqv7rlGry KZUlktQot22M/7FC2MS6KhgbQKLPrRGTAEyY/JNwBHckCZiduWQFlmKLQ926xQIJ 6vdjSzHK5MrT/d+yow3bGFxAJWloGJ+L+RsH0b+WikF81+6ic9P3AoStgbVilfKD 6sbjcju8SShDlQ+W/Ocm0rHC+i/RDKT3QqItXgfhA/1FfMPODQGc/xcZg+AdTswn VSnUIkvk9/mTO0StilVfNJDfG1QkSpJ5Ilvs/DnIahZj6IG4QbJvtnVNbmQX6ptz AUB4DdGwXg== =geQL -----END PGP SIGNATURE----- Merge tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Here are the driver updates queued up for 5.19. This contains: - NVMe pull requests via Christoph: - tighten the PCI presence check (Stefan Roese) - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, Christoph) - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (Christoph) - MD pull request via Song: - Improve annotation in raid5 code, by Logan Gunthorpe - Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk - Other small fixes/cleanups - null_blk series making the configfs side much saner (Damien) - Various minor drbd cleanups and fixes (Haowen, Uladzislau, Jiapeng, Arnd, Cai) - Avoid using the system workqueue (and hence flushing it) in rnbd (Jack) - Avoid using the system workqueue (and hence flushing it) in aoe (Tetsuo) - Series fixing discard_alignment issues in drivers (Christoph) - Small series fixing drivers poking at disk->part0 for openers information (Christoph) - Series fixing deadlocks in loop (Christoph, Tetsuo) - Remove loop.h and add SPDX headers (Christoph) - Various fixes and cleanups (Julia, Xie, Yu)" * tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block: (72 commits) mtip32xx: fix typo in comment nvme: set non-mdts limits in nvme_scan_work nvme: add support for TP4084 - Time-to-Ready Enhancements nvme: split the enum used for various register constants nbd: Fix hung on disconnect request if socket is closed before nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL loop: remove most the top-of-file boilerplate comment from the UAPI header loop: remove most the top-of-file boilerplate comment loop: add a SPDX header loop: remove loop.h block: null_blk: Improve device creation with configfs block: null_blk: Cleanup messages block: null_blk: Cleanup device creation and deletion ...	2022-05-23 14:04:14 -07:00
Linus Torvalds	115cd47132	for-5.19/block-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis iomvJ4yk0Q== =0Rw1 -----END PGP SIGNATURE----- Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...	2022-05-23 13:56:39 -07:00
Linus Torvalds	9836e93c0a	for-5.19/io_uring-passthrough-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKovAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv9oD/4qCs7k3bPZZWZ6xoWb4EObyyWOUifi26lp vpsJHFUbA67S/i4++LV9H18SazWJ7h08ac4bjgZ+NQz40/1WkTN8/Fa76jo+BnNK 7T10Wp4Ak6uwWVrKaA81pnT+G9+xmHlJ3X27aKxzLuT7BEPpShZ6ouFVjTkx9CzN LrLjuCDTOBBN+ZoaroWYfdLwTQX2VCAl9B15lOtQIlFvuuU8VlrvLboY+80K8TvY 1wvTA2HTjnXoYx+/cTTMIFZIwQH3r1hsbwEDD8/YJj1+ouhSRQ1b0p/nk2pA+3ws HF5r/YS/rLBjlPF094IzeOBaUyA433AN1VhZqnII8ek7ViT3W3x+BRrgE9O6ZkWT 0AjX1BXReI5rdFmxBmwsSdBnrSoGaJOf2GdsCCdubXBIi+F/RvyajrPf7PTB5zbW 9WEK/uy3xvZsRVkUGAzOb9QGdvjcllgMzwPJsDegDCw5PdcPdT3mzy6KGIWipFLp j8R+br7hRMpOJv/YpihJDMzSDkQ/r1/SCwR4fpLid/QdSHG/eRTQK6c4Su5bNYEy QDy2F6kQdBVtEJCQHcEOsbhXzSTNBcdB+ujUUM5653FkaHe6y4JbomLrsNx407Id i/4ROwA5K1dioJx503Eap+OhbI5rV+PFytJTwxvLrNyVGccwbH2YOVq80fsVBP2e cZbn6EX4Vg== =/peE -----END PGP SIGNATURE----- Merge tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block Pull io_uring NVMe command passthrough from Jens Axboe: "On top of everything else, this adds support for passthrough for io_uring. The initial feature for this is NVMe passthrough support, which allows non-filesystem based IO commands and admin commands. To support this, io_uring grows support for SQE and CQE members that are twice as big, allowing to pass in a full NVMe command without having to copy data around. And to complete with more than just a single 32-bit value as the output" * tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block: (22 commits) io_uring: cleanup handling of the two task_work lists nvme: enable uring-passthrough for admin commands nvme: helper for uring-passthrough checks blk-mq: fix passthrough plugging nvme: add vectored-io support for uring-cmd nvme: wire-up uring-cmd support for io-passthru on char-device. nvme: refactor nvme_submit_user_cmd() block: wire-up support for passthrough plugging fs,io_uring: add infrastructure for uring-cmd io_uring: support CQE32 for nop operation io_uring: enable CQE32 io_uring: support CQE32 in /proc info io_uring: add tracing for additional CQE32 fields io_uring: overflow processing for CQE32 io_uring: flush completions for CQE32 io_uring: modify io_get_cqe for CQE32 io_uring: add CQE32 completion processing io_uring: add CQE32 setup processing io_uring: change ring size calculation for CQE32 io_uring: store add. return values for CQE32 ...	2022-05-23 13:06:15 -07:00
Ming Lei	5d05426e2d	blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx blk_mq_run_hw_queues() could be run when there isn't queued request and after queue is cleaned up, at that time tagset is freed, because tagset lifetime is covered by driver, and often freed after blk_cleanup_queue() returns. So don't touch ->tagset for figuring out current default hctx by the mapping built in request queue, so use-after-free on tagset can be avoided. Meantime this way should be fast than retrieving mapping from tagset. Cc: "yukuai (C)" <yukuai3@huawei.com> Cc: Jan Kara <jack@suse.cz> Fixes: `b6e68ee825` ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-23 06:28:28 -06:00
Yuezhang Mo	97d6fb1b48	block: add sync_blockdev_range() sync_blockdev_range() is to support syncing multiple sectors with as few block device requests as possible, it is helpful to make the block device to give full play to its performance. Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Andy Wu <Andy.Wu@sony.com> Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>	2022-05-23 11:17:30 +09:00
Julia Lawall	2aaf516084	blk-mq: fix typo in comment Spelling mistake (triple letters) in comment. Detected with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20220521111145.81697-29-Julia.Lawall@inria.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-21 06:32:16 -06:00
Jan Kara	a249ca7dfb	bfq: Remove bfq_requeue_request_body() The function has only a single caller and two lines. Just remove it since it is pointless and just harming readability. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-19 06:52:36 -06:00
Jan Kara	e79cf8892e	bfq: Remove superfluous conversion from RQ_BIC() We store struct bfq_io_cq pointer in rq->elv.priv[0] in bfq_init_rq(). Thus a call to icq_to_bic() in RQ_BIC() is wrong. Luckily it does no harm currently because struct io_iq is the first one in struct bfq_io_cq. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-19 06:52:36 -06:00
Jan Kara	c5ac56bb61	bfq: Allow current waker to defend against a tentative one The code in bfq_check_waker() ignores wake up events from the current waker. This makes it more likely we select a new tentative waker although the current one is generating more wake up events. Treat current waker the same way as any other process and allow it to reset the waker detection logic. Fixes: `71217df39d` ("block, bfq: make waker-queue detection more robust") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-19 06:52:36 -06:00
Jan Kara	f950667356	bfq: Relax waker detection for shared queues Currently we look for waker only if current queue has no requests. This makes sense for bfq queues with a single process however for shared queues when there is a larger number of processes the condition that queue has no requests is difficult to meet because often at least one process has some request in flight although all the others are waiting for the waker to do the work and this harms throughput. Relax the "no queued request for bfq queue" condition to "the current task has no queued requests yet". For this, we also need to start tracking number of requests in flight for each task. This patch (together with the following one) restores the performance for dbench with 128 clients that regressed with commit `c65e6fd460` ("bfq: Do not let waker requests skip proper accounting") because this commit makes requests of wakers properly enter BFQ queues and thus these queues become ineligible for the old waker detection logic. Dbench results: Vanilla 5.18-rc3 5.18-rc3 + revert 5.18-rc3 patched Mean 1237.36 ( 0.00%) 950.16 * 23.21%* 988.35 * 20.12%* Numbers are time to complete workload so lower is better. Fixes: `c65e6fd460` ("bfq: Do not let waker requests skip proper accounting") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-19 06:52:33 -06:00
Jens Axboe	1305e2c9d9	blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() A previous commit got rid of unnecessary rcu_read_lock() inside the IRQ disabling queue_lock, but this debug statement was left. It's now firing since we are indeed not inside a RCU read lock, but we don't need to be as we're still preempt safe. Get rid of the check, as we have a lockdep assert for holding the queue lock right after it anyway. Link: https://lore.kernel.org/linux-block/46253c48-81cb-0787-20ad-9133afdd9e21@samsung.com/ Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Fixes: `77c570a1ea` ("blk-cgroup: Remove unnecessary rcu_read_lock/unlock()") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 16:32:00 -06:00
Laibin Qiu	5a011f889b	blk-throttle: Set BIO_THROTTLED when bio has been throttled 1.In current process, all bio will set the BIO_THROTTLED flag after __blk_throtl_bio(). 2.If bio needs to be throttled, it will start the timer and stop submit bio directly. Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.But in the current process, if bio is throttled. The BIO_THROTTLED will be set to bio after timer start. If the bio has been completed, it may cause use-after-free blow. BUG: KASAN: use-after-free in blk_throtl_bio+0x12f0/0x2c70 Read of size 2 at addr ffff88801b8902d4 by task fio/26380 dump_stack+0x9b/0xce print_address_description.constprop.6+0x3e/0x60 kasan_report.cold.9+0x22/0x3a blk_throtl_bio+0x12f0/0x2c70 submit_bio_checks+0x701/0x1550 submit_bio_noacct+0x83/0xc80 submit_bio+0xa7/0x330 mpage_readahead+0x380/0x500 read_pages+0x1c1/0xbf0 page_cache_ra_unbounded+0x471/0x6f0 do_page_cache_ra+0xda/0x110 ondemand_readahead+0x442/0xae0 page_cache_async_ra+0x210/0x300 generic_file_buffered_read+0x4d9/0x2130 generic_file_read_iter+0x315/0x490 blkdev_read_iter+0x113/0x1b0 aio_read+0x2ad/0x450 io_submit_one+0xc8e/0x1d60 __se_sys_io_submit+0x125/0x350 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Allocated by task 26380: kasan_save_stack+0x19/0x40 __kasan_kmalloc.constprop.2+0xc1/0xd0 kmem_cache_alloc+0x146/0x440 mempool_alloc+0x125/0x2f0 bio_alloc_bioset+0x353/0x590 mpage_alloc+0x3b/0x240 do_mpage_readpage+0xddf/0x1ef0 mpage_readahead+0x264/0x500 read_pages+0x1c1/0xbf0 page_cache_ra_unbounded+0x471/0x6f0 do_page_cache_ra+0xda/0x110 ondemand_readahead+0x442/0xae0 page_cache_async_ra+0x210/0x300 generic_file_buffered_read+0x4d9/0x2130 generic_file_read_iter+0x315/0x490 blkdev_read_iter+0x113/0x1b0 aio_read+0x2ad/0x450 io_submit_one+0xc8e/0x1d60 __se_sys_io_submit+0x125/0x350 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 0: kasan_save_stack+0x19/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x1b/0x30 __kasan_slab_free+0x111/0x160 kmem_cache_free+0x94/0x460 mempool_free+0xd6/0x320 bio_free+0xe0/0x130 bio_put+0xab/0xe0 bio_endio+0x3a6/0x5d0 blk_update_request+0x590/0x1370 scsi_end_request+0x7d/0x400 scsi_io_completion+0x1aa/0xe50 scsi_softirq_done+0x11b/0x240 blk_mq_complete_request+0xd4/0x120 scsi_mq_done+0xf0/0x200 virtscsi_vq_done+0xbc/0x150 vring_interrupt+0x179/0x390 __handle_irq_event_percpu+0xf7/0x490 handle_irq_event_percpu+0x7b/0x160 handle_irq_event+0xcc/0x170 handle_edge_irq+0x215/0xb20 common_interrupt+0x60/0x120 asm_common_interrupt+0x1e/0x40 Fix this by move BIO_THROTTLED set into the queue_lock. Signed-off-by: Laibin Qiu <qiulaibin@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220301123919.2381579-1-qiulaibin@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-17 19:32:10 -06:00
Fanjun Kong	77c570a1ea	blk-cgroup: Remove unnecessary rcu_read_lock/unlock() spin_lock_irq/spin_unlock_irq contains preempt_disable/enable(). Which can serve as RCU read-side critical region, so remove rcu_read_lock/unlock(). Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220516173930.159535-1-bh1scw@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-17 06:12:23 -06:00
Wolfgang Bumiller	3607849df4	blk-cgroup: always terminate io.stat lines With the removal of seq_get_buf in blkcg_print_one_stat, we cannot make adding the newline conditional on there being relevant stats because the name was already written out unconditionally. Otherwise we may end up with multiple device names in one line which is confusing and doesn't follow the nested-keyed file format. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Fixes: `252c651a4c` ("blk-cgroup: stop using seq_get_buf") Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220111083159.42340-1-w.bumiller@proxmox.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-17 06:11:17 -06:00
Yu Kuai	ddc25c86b4	block, bfq: make bfq_has_work() more accurate bfq_has_work() is using busy_queues currently, which is not accurate because bfq_queue is busy doesn't represent that it has requests. Since bfqd aready has a counter 'queued' to record how many requests are in bfq, use it instead of busy_queues. Noted that bfq_has_work() can be called with 'bfqd->lock' held, thus the lock can't be held in bfq_has_work() to protect 'bfqd->queued'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220513023507.2625717-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-16 11:39:20 -06:00
Yu Kuai	181490d532	block, bfq: protect 'bfqd->queued' by 'bfqd->lock' If bfq_schedule_dispatch() is called from bfq_idle_slice_timer_body(), then 'bfqd->queued' is read without holding 'bfqd->lock'. This is wrong since it can be wrote concurrently. Fix the problem by holding 'bfqd->lock' in such case. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220513023507.2625717-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-16 11:39:20 -06:00
Christoph Hellwig	a3e7689bfa	block: cleanup the VM accounting in submit_bio submit_bio uses some extremely convoluted checks and confusing comments to only account REQ_OP_READ/REQ_OP_WRITE comments. Just switch to the plain obvious checks instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220516063654.2782792-1-hch@lst.de [axboe: fixup WRITE -> REQ_OP_WRITE] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-16 11:37:50 -06:00
Bart Van Assche	725f22a147	block/mq-deadline: Set the fifo_time member also if inserting at head Before commit `322cff70d4` the fifo_time member of requests on a dispatch list was not used. Commit `322cff70d4` introduces code that reads the fifo_time member of requests on dispatch lists. Hence this patch that sets the fifo_time member when adding a request to a dispatch list. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Fixes: `322cff70d4` ("block/mq-deadline: Prioritize high-priority requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220513171307.32564-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 17:02:46 -06:00
Ming Lei	a327c341dc	blk-mq: fix passthrough plugging First we can't add request into plug list in blk_mq_request_bypass_insert which may be called when flushing plug list, so nested plug is caused. Second if polled passthrough request is inserted via blk_execute_rq(), it can't be added to plug list too since io polling needs the request to be issued to driver. Fixes the two by moving plugging into blk_execute_rq_no_wait(). Cc: Christoph Hellwig <hch@lst.de> Fixes: `1c2d2fff6d` ("block: wire-up support for passthrough plugging") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220512140010.1458645-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-12 08:45:39 -06:00
Chengming Zhou	2a371f7d5f	blk-iocost: combine local_stat and desc_stat to stat When we flush usage, wait, indebt stat in iocg_flush_stat(), we use local_stat and desc_stat, which has no point since the leaf iocg only has local_stat and the inner iocg only has desc_stat. Also we don't need to flush percpu abs_vusage for these inner iocgs. This patch combine local_stat and desc_stat to stat, only flush percpu abs_vusage for active leaf iocgs, then build inner walk list to propagate. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220510034757.21761-1-zhouchengming@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-11 17:05:52 -06:00
Jens Axboe	1c2d2fff6d	block: wire-up support for passthrough plugging Add support for plugging in passthrough path. When plugging is enabled, the requests are added to a plug instead of getting dispatched to the driver. And when the plug is finished, the whole batch gets dispatched via ->queue_rqs which turns out to be more efficient. Otherwise dispatching used to happen via ->queue_rq, one request at a time. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220511054750.20432-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-11 07:40:58 -06:00
Matthew Wilcox (Oracle)	2c69e20579	fs: Convert block_read_full_page() to block_read_full_folio() This function is NOT converted to handle large folios, so include an assert that the filesystem isn't passing one in. Otherwise, use the folio functions instead of the page functions, where they exist. Convert all filesystems which use block_read_full_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-05-09 16:21:44 -04:00
Matthew Wilcox (Oracle)	9d6b0cd757	fs: Remove flags parameter from aops->write_begin There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)	b3992d1e2e	fs: Remove aop flags parameter from block_write_begin() There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-05-08 14:28:19 -04:00
Christoph Hellwig	069adbac2c	block: improve the error message from bio_check_eod Print the start sector and length separately instead of the combined value to help with debugging. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220504143355.568660-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 18:30:10 -06:00
Christoph Hellwig	7ecc56c62b	block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone Device mapper wants to allocate a bio before knowing the device it gets send to, so add explicit support for that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20220504142950.567582-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 18:29:52 -06:00
Christoph Hellwig	513616843d	block: remove superfluous calls to blkcg_bio_issue_init blkcg_bio_issue_init is called in submit_bio. There is no need to have extra calls that just get overriden in __bio_clone and the two places that copy and pasted from it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20220504142950.567582-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 18:29:52 -06:00

1 2 3 4 5 ...

6249 Commits