OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	462abc9de7	block-5.19-2022-06-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKr3gUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphQbD/9q3RlFADRFQICCPoTbh9RBflhjN3D21l7y OarLbQiOpSLT1KRrwUdzxKjNkps/tKrSWKke7znrGpgAWAUXrRX9O2mqraGwQ0JQ cxMI2DDnekQbDaIsv/9lAsF+9rPaSU1NQR09jm4jw5Wc59XnY6mjJF+Lb83Psoow r4MRMXg+JQZGOn4ekN/RujwRaDluK+R1RQPUePXM1/LZPbsbTMFbQR1hrPIcrsYp R5f2an3mpmOrYpVAZQBEl5F451pKv0lah+QVd6VY8/CyReWLTzKh4yHYZs7fgkAK 0aTxXEskrQozLYQgEPefTDt1JoyHqMLQ1EIvVbSoGrlcgdmbjEpDd0ee2IhBdq6m 80gJZOFlE/K4r5RFClIhrsYdJVdSQ+fHGqO+R0WLxWe+w0T80X6tUEewSRO0WnB9 nlXvlj/SDmeXmsrkjFGH55yLubgKTXigCbwwvtdv24zsmmq5QQSgMrTkdey4NH7U C3zs8PVvGsqlXR7/N5GkaSvE/zPe6mx+pXF9SdzFwAAYYglqTZs1moInT3WGrqBw iK/aKJkoCDh06LHkNmQXKh2nTNCl/VLFvlcmyTfbKthfQC6dWD8v3Qe7ce4P7kyM bxEF9hAdJSPNPWXMITU4CRCz6JqK0vZUAUNn8g5lf7SJ1QFrqciwqAQF3/OIxeS0 3+Qj2izcPA== =5gxO -----END PGP SIGNATURE----- Merge tag 'block-5.19-2022-06-16' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Christoph - Quirks, quirks, quirks to work around buggy consumer grade devices (Keith Bush, Ning Wang, Stefan Reiter, Rasheed Hsueh) - Better kernel messages for devices that need quirking (Keith Bush) - Make a kernel message more useful (Thomas Weißschuh) - MD pull request from Song, with a few fixes - blk-mq sysfs locking fixes (Ming) - BFQ stats fix (Bart) - blk-mq offline queue fix (Bart) - blk-mq flush request tag fix (Ming) * tag 'block-5.19-2022-06-16' of git://git.kernel.dk/linux-block: block/bfq: Enable I/O statistics blk-mq: don't clear flush_rq from tags->rqs[] blk-mq: avoid to touch q->elevator without any protection blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none block: Fix handling of offline queues in blk_mq_alloc_request_hctx() md/raid5-ppl: Fix argument order in bio_alloc_bioset() Revert "md: don't unregister sync_thread with reconfig_mutex held" nvme-pci: disable write zeros support on UMIC and Samsung SSDs nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs nvme-pci: sk hynix p31 has bogus namespace ids nvme-pci: smi has bogus namespace ids nvme-pci: phison e12 has bogus namespace ids nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50 nvme-pci: add trouble shooting steps for timeouts nvme: add bug report info for global duplicate id nvme: add device name to warning in uuid_show()	2022-06-17 11:22:58 -07:00
Mikulas Patocka	85e123c27d	dm mirror log: round up region bitmap size to BITS_PER_LONG The code in dm-log rounds up bitset_size to 32 bits. It then uses find_next_zero_bit_le on the allocated region. find_next_zero_bit_le accesses the bitmap using unsigned long pointers. So, on 64-bit architectures, it may access 4 bytes beyond the allocated size. Fix this bug by rounding up bitset_size to BITS_PER_LONG. This bug was found by running the lvm2 testsuite with kasan. Fixes: `29121bd0b0` ("[PATCH] dm mirror log: bitset_size fix") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-16 19:39:29 -04:00
Mikulas Patocka	1ee88de395	dm: fix narrow race for REQ_NOWAIT bios being issued despite no support Starting with the commit 63a225c9fd20, device mapper has an optimization that it will take cheaper table lock (dm_get_live_table_fast instead of dm_get_live_table) if the bio has REQ_NOWAIT. The bios with REQ_NOWAIT must not block in the target request routine, if they did, we would be blocking while holding rcu_read_lock, which is prohibited. The targets that are suitable for REQ_NOWAIT optimization (and that don't block in the map routine) have the flag DM_TARGET_NOWAIT set. Device mapper will test if all the targets and all the devices in a table support nowait (see the function dm_table_supports_nowait) and it will set or clear the QUEUE_FLAG_NOWAIT flag on its request queue according to this check. There's a test in submit_bio_noacct: "if ((bio->bi_opf & REQ_NOWAIT) && !blk_queue_nowait(q)) goto not_supported" - this will make sure that REQ_NOWAIT bios can't enter a request queue that doesn't support them. This mechanism works to prevent REQ_NOWAIT bios from reaching dm targets that don't support the REQ_NOWAIT flag (and that may block in the map routine) - except that there is a small race condition: submit_bio_noacct checks if the queue has the QUEUE_FLAG_NOWAIT without holding any locks. Immediatelly after this check, the device mapper table may be reloaded with a table that doesn't support REQ_NOWAIT (for example, if we start moving the logical volume or if we activate a snapshot). However the REQ_NOWAIT bio that already passed the check in submit_bio_noacct would be sent to device mapper, where it could be redirected to a dm target that doesn't support REQ_NOWAIT - the result is sleeping while we hold rcu_read_lock. In order to fix this race, we double-check if the target supports REQ_NOWAIT while we hold the table lock (so that the table can't change under us). Fixes: `563a225c9f` ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-16 19:39:02 -04:00
Mikulas Patocka	5d7362d0d5	dm: fix use-after-free in dm_put_live_table_bio dm_put_live_table_bio is called from the end of dm_submit_bio. However, at this point, the bio may be already finished and the caller may have freed the bio. Consequently, dm_put_live_table_bio accesses the stale "bio" pointer. Fix this bug by loading the bi_opf value and passing it to dm_get_live_table_bio and dm_put_live_table_bio instead of the bio. This bug was found by running the lvm2 testsuite with kasan. Fixes: `563a225c9f` ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-16 19:38:49 -04:00
Logan Gunthorpe	f34fdcd4a0	md/raid5-ppl: Fix argument order in bio_alloc_bioset() bio_alloc_bioset() takes a block device, number of vectors, the OP flags, the GFP mask and the bio set. However when the prototype was changed, the callisite in ppl_do_flush() had the OP flags and the GFP flags reversed. This introduced some sparse error: drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3 (different base types) drivers/md/raid5-ppl.c:632:57: expected unsigned int opf drivers/md/raid5-ppl.c:632:57: got restricted gfp_t [usertype] drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4 (different base types) drivers/md/raid5-ppl.c:633:61: expected restricted gfp_t [usertype] gfp_mask drivers/md/raid5-ppl.c:633:61: got unsigned long long The sparse error introduction may not have been reported correctly by 0day due to other work that was cleaning up other sparse errors in this area. Fixes: `609be10667` ("block: pass a block_device and opf to bio_alloc_bioset") Cc: stable@vger.kernel.org # 5.18+ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-06-15 10:32:48 -07:00
Guoqing Jiang	d0a180341f	Revert "md: don't unregister sync_thread with reconfig_mutex held" The 07reshape5intr test is broke because of below path. md_reap_sync_thread -> mddev_unlock -> md_unregister_thread(&mddev->sync_thread) And md_check_recovery is triggered by, mddev_unlock -> md_wakeup_thread(mddev->thread) then mddev->reshape_position is set to MaxSector in raid5_finish_reshape since MD_RECOVERY_INTR is cleared in md_check_recovery, which means feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's reshape_position can't be updated accordingly. Fixes: `8b48ec23cc` ("md: don't unregister sync_thread with reconfig_mutex held") Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-06-15 10:30:14 -07:00
Benjamin Marzinski	10eb3a0d51	dm: fix race in dm_start_io_acct After commit `82f6cdcc36` ("dm: switch dm_io booleans over to proper flags") dm_start_io_acct stopped atomically checking and setting was_accounted, which turned into the DM_IO_ACCOUNTED flag. This opened the possibility for a race where IO accounting is started twice for duplicate bios. To remove the race, check the flag while holding the io->lock. Fixes: `82f6cdcc36` ("dm: switch dm_io booleans over to proper flags") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-15 11:51:41 -04:00
Mike Snitzer	dddf305640	dm: fix zoned locking imbalance due to needless check in clone_endio After the commit `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone"), clone_endio() only calls dm_zone_endio() when DM targets remap the clone bio's bdev to something other than the md->disk->part0 default. However, if a DM target (e.g. dm-crypt) stacked ontop of a dm-zoned does not remap the clone bio using bio_set_dev() then dm_zone_endio() is not called at completion of the bios and zone locks are not properly unlocked. This triggers a hang, in dm_zone_map_bio(), when blktests block/004 is run for dm-crypt on zoned block devices. To avoid the hang, simply remove the clone_endio() check that verifies the target remapped the clone bio to a device other than the default. Fixes: `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-10 15:23:54 -04:00
Christoph Hellwig	29dec90a0f	dm: fix bio_set allocation The use of bioset_init_from_src mean that the pre-allocated pools weren't used for anything except parameter passing, and the integrity pool creation got completely lost for the actual live mapped_device. Fix that by assigning the actual preallocated dm_md_mempools to the mapped_device and using that for I/O instead of creating new mempools. Fixes: `2a2a4c510b` ("dm: use bioset_init_from_src() to copy bio_set") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-06-08 14:04:14 -04:00
Linus Torvalds	78c6499c92	for-5.19/drivers-2022-06-02 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmkoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqyrD/4iyg2ULBPyljLoM3Ed8AONbrFBApenKDFN FjiFRZNll8zvJLTtP0GQqJIrljPuRlqb0IkbgqXl+yvPZ+wpB9HgIe2aohkOaqqJ KS49UNR/aIHwC2y7lwlcsdgVqqoPdc4wnZeaQvCsWPCBhCca/k0kR7s3uEhHMK92 OWpV9osl/thLfOBwwt4IEaO1Koz8PM/fCR4XA2KLbs8E4P8EcSFglqi0ap7foLNr pQAJIlPjkmF6nw4xg5fdBjVBo//kVcuf9IBMi5/XinmUL1taFAcn5WyeOvBbi0Fs Sqp/pKkveM8xWZKrDyA/wf8nzRNpBBl6TOQEMFtV6FkZrij3pbKHlCiyL2gBR13e 5gkbVvXgtdqDVlnqlvIV/Swfh5YQFtn7+vlHFOUjP+iObsBRo4fTvDhPTLoO/VCf tIAA7xnq/pquRKS/QGGC7ZxRVc3T1r+EpvbBP5Dc4CDGbbbyZLCSrOh5HWIb0I3k 95GSiipTtf54KqZ9HiG/u+xNAFIdapXgU4Xm+JyDRWLdxSs5Nmy8VgpdflvxBfuo hCvJJw3vtDusjyHc7IafxaZlJVQT9tPgPshs8GfrCMCP19RCALD/5irspFh6dD35 BQTIkhC68XNa0iNn/NTP3uxir/JwRoovxQkA9eD+r1NHsAbL8GTypfr5kKJxDaIK UhawfyZE3Q== =YqhC -----END PGP SIGNATURE----- Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block Pull more block driver updates from Jens Axboe: "A collection of stragglers that were late on sending in their changes and just followup fixes. - NVMe fixes pull request via Christoph: - set controller enable bit in a separate write (Niklas Cassel) - disable namespace identifiers for the MAXIO MAP1001 (Christoph) - fix a comment typo (Julia Lawall)" - MD fixes pull request via Song: - Remove uses of bdevname (Christoph Hellwig) - Bug fixes (Guoqing Jiang, and Xiao Ni) - bcache fixes series (Coly) - null_blk zoned write fix (Damien) - nbd fixes (Yu, Zhang) - Fix for loop partition scanning (Christoph)" * tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits) block: null_blk: Fix null_zone_write() nvmet: fix typo in comment nvme: set controller enable bit in a separate write nvme-pci: disable namespace identifiers for the MAXIO MAP1001 bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() nbd: use pr_err to output error message nbd: fix possible overflow on 'first_minor' in nbd_dev_add() nbd: fix io hung while disconnecting device nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed nbd: fix race between nbd_alloc_config() and module removal nbd: call genl_unregister_family() first in nbd_cleanup() md: bcache: check the return value of kzalloc() in detached_dev_do_request() bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() block, loop: support partitions without scanning bcache: avoid journal no-space deadlock by reserving 1 journal bucket bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() bcache: improve multithreaded bch_sectors_dirty_init() bcache: improve multithreaded bch_btree_check() md: fix double free of io_acct_set bioset md: Don't set mddev private to NULL in raid0 pers->free ...	2022-06-03 10:25:56 -07:00
Linus Torvalds	fa78526acc	- Fix DM core's dm_table_supports_poll to return false if no data devices. - Fix DM verity target so that it cannot be switched to a different DM target type (e.g. dm-linear) via DM table reload. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmKX0aMACgkQxSPxCi2d A1okwgf+NMX+Il0qgda1mZrId7nVdQGGyv7uapLXIxRm5Z+LcUxqw+hqzoeabeQ/ 7rou3KqsXuPcpu1AATPHis0Ub9CcXwtpbevf2rmh3Ey4kqLLuFqUP6IjwvFtyp2Y Ms9QAhTvIZXxAPcrb7HH2v7ULOCmdI89OAr8Q/hQ+F4wjI8BO2tNJ4WfxeqpMy5M EFwEO485Ct+XvLDek4+7hYxvSO/6ANgjgzWx4dwsP+iC9SFJurvNVnoXpIl+69DU v8R6Udp0buQqFscyfRbHVOYxVkBROMWg/lKX/4hhgiSoV8j5xSm9hp3S13BffHj7 Bbp8cW+E6IkaASDpIRBAa/6a6k7K1w== =+C+g -----END PGP SIGNATURE----- Merge tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM core's dm_table_supports_poll to return false if target has no data devices. - Fix DM verity target so that it cannot be switched to a different DM target type (e.g. dm-linear) via DM table reload. * tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity: set DM_TARGET_IMMUTABLE feature flag dm table: fix dm_table_supports_poll to return false if no data devices	2022-06-01 14:25:04 -07:00
Sarthak Kukreti	4caae58406	dm verity: set DM_TARGET_IMMUTABLE feature flag The device-mapper framework provides a mechanism to mark targets as immutable (and hence fail table reloads that try to change the target type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's feature flags to prevent switching the verity target with a different target type. Fixes: `a4ffc15219` ("dm: add verity target") Cc: stable@vger.kernel.org Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-31 16:22:30 -04:00
Mike Snitzer	9571f829f3	dm table: fix dm_table_supports_poll to return false if no data devices It was reported that the "generic/250" test in xfstests (which uses the dm-error target) demonstrates a regression where the kernel crashes in bioset_exit(). Since commit `cfc97abcbe` ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset") the bioset_init() for the dm_io bioset will setup the bioset's per-cpu alloc cache if all devices have QUEUE_FLAG_POLL set. But there was an bug where a target that doesn't have any data devices (and that doesn't even set the .iterate_devices dm target callback) will incorrectly return true from dm_table_supports_poll(). Fix this by updating dm_table_supports_poll() to follow dm-table.c's well-worn pattern for testing that _all_ targets in a DM table do in fact have underlying devices that set QUEUE_FLAG_POLL. NOTE: An additional block fix is still needed so that bio_alloc_cache_destroy() clears the bioset's ->cache member. Otherwise, a DM device's table reload that transitions the DM device's bioset from using a per-cpu alloc cache to _not_ using one will result in bioset_exit() crashing in bio_alloc_cache_destroy() because dm's dm_io bioset ("io_bs") was left with a stale ->cache member. Fixes: `cfc97abcbe` ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset") Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-31 14:44:17 -04:00
Coly Li	a1a2d8f016	bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() The kworker routine update_writeback_rate() is schedued to update the writeback rate in every 5 seconds by default. Before calling __update_writeback_rate() to do real job, semaphore dc->writeback_lock should be held by the kworker routine. At the same time, bcache writeback thread routine bch_writeback_thread() also needs to hold dc->writeback_lock before flushing dirty data back into the backing device. If the dirty data set is large, it might be very long time for bch_writeback_thread() to scan all dirty buckets and releases dc->writeback_lock. In such case update_writeback_rate() can be starved for long enough time so that kernel reports a soft lockup warn- ing started like: watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713] Such soft lockup condition is unnecessary, because after the writeback thread finishes its job and releases dc->writeback_lock, the kworker update_writeback_rate() may continue to work and everything is fine indeed. This patch avoids the unnecessary soft lockup by the following method, - Add new member to struct cached_dev - dc->rate_update_retry (0 by default) - In update_writeback_rate() call down_read_trylock(&dc->writeback_lock) firstly, if it fails then lock contention happens. - If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't acquire the lock and reschedules the kworker for next try. - If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry anymore and call down_read(&dc->writeback_lock) to wait for the lock. By the above method, at worst case update_writeback_rate() may retry for 1+ minutes before blocking on dc->writeback_lock by calling down_read(). For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft lockup warning message can be avoided. When retrying to acquire dc->writeback_lock in update_writeback_rate(), of course the writeback rate cannot be updated. It is fair, because when the kworker is blocked on the lock contention of dc->writeback_lock, the writeback rate cannot be updated neither. This change follows Jens Axboe's suggestion to a more clear and simple version. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-28 06:48:26 -06:00
Linus Torvalds	35cdd8656e	libnvdimm for 5.19 - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj kxOtJEU2bA9naninXGORsDUndiNkwQo= =gVIn -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and DAX updates from Dan Williams: "New support for clearing memory errors when a file is in DAX mode, alongside with some other fixes and cleanups. Previously it was only possible to clear these errors using a truncate or hole-punch operation to trigger the filesystem to reallocate the block, now, any page aligned write can opportunistically clear errors as well. This change spans x86/mm, nvdimm, and fs/dax, and has received the appropriate sign-offs. Thanks to Jane for her work on this. Summary: - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)" * tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: pmem: implement pmem_recovery_write() pmem: refactor pmem_clear_poison() dax: add .recovery_write dax_operation dax: introduce DAX_RECOVERY_WRITE dax access mode mce: fix set_mce_nospec to always unmap the whole page x86/mce: relocate set{clear}_mce_nospec() functions acpi/nfit: rely on mce->misc to determine poison granularity testing: nvdimm: asm/mce.h is not needed in nfit.c testing: nvdimm: iomap: make __nfit_test_ioremap a macro nvdimm: Allow overwrite in the presence of disabled dimms tools/testing/nvdimm: remove unneeded flush_workqueue	2022-05-27 15:49:30 -07:00
Jia-Ju Bai	40f567bbb3	md: bcache: check the return value of kzalloc() in detached_dev_do_request() The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked. Fixes: `bc082a55d2` ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-27 09:49:48 -06:00
Coly Li	7d6b902ea0	bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() The local variables check_state (in bch_btree_check()) and state (in bch_sectors_dirty_init()) should be fully filled by 0, because before allocating them on stack, they were dynamically allocated by kzalloc(). Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-27 09:49:48 -06:00
Linus Torvalds	7e284070ab	- Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set. This change improves DM's hipri bio polling (REQ_POLLED) performance by 7 - 20% depending on the system. - Update DM core to use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. - Various DM core changes to reduce bio-based DM overhead and simplify IO accounting. - Fundamental DM core improvements to dm_io reference counting and the elimination of using bio_split()+bio_chain() -- instead DM's bio-based IO accounting is updated to account that a split occurred. - Improve DM core's abnormal bio processing to do less work. - Improve DM core's hipri polling support to use a single list rather than an hlist. - Update DM core to pass NULL bdev to bio_alloc_clone() so that initialization that isn't useful for DM can be elided. - Add cond_resched to DM stats' various loops that loop over all entries. - Fix incorrect error code return from DM integrity's constructor. - Make DM crypt's printing of the key constant-time. - Update bio-based DM multipath to provide high-resolution timer to the Historical Service Time (HST) path selector. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmKOiAwACgkQxSPxCi2d A1qCxgf/VLmiywUR7zCIDiPyJkc547z2MlPVaTCzpclklyLZ5wgfTtqsb8RnEqRs KIXrvKbakbfZtrG0k9ZYdbIIVO4jHQavCXtVZH0Mj4XqQArDhHxZzlBx8i6BirO5 aya7HKWHcXj+s7GF146hRcPtD53LbSNDrB/bVaiZcHxOemthSgtht2xwmZfzoCcS JtorpWkA97lgDbdSAcR0kn1zkHDWFkQMJgnUL9FotnlhOVXZVUvAbgb4OjfxB3fG bCYlytDhs+6VwO+/CZXcOA7LNNtkLJPEkxUxjrChS5do64SgJUlDvY66iB6+7K0w 0Jri5D/NZvjwYVeLgXn5UwyQJ75cjQ== =zlaE -----END PGP SIGNATURE----- Merge tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set. This change improves DM's hipri bio polling (REQ_POLLED) performance by 7 - 20% depending on the system. - Update DM core to use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. - Various DM core changes to reduce bio-based DM overhead and simplify IO accounting. - Fundamental DM core improvements to dm_io reference counting and the elimination of using bio_split()+bio_chain() -- instead DM's bio-based IO accounting is updated to account that a split occurred. - Improve DM core's abnormal bio processing to do less work. - Improve DM core's hipri polling support to use a single list rather than an hlist. - Update DM core to pass NULL bdev to bio_alloc_clone() so that initialization that isn't useful for DM can be elided. - Add cond_resched to DM stats' various loops that loop over all entries. - Fix incorrect error code return from DM integrity's constructor. - Make DM crypt's printing of the key constant-time. - Update bio-based DM multipath to provide high-resolution timer to the Historical Service Time (HST) path selector. * tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm: pass NULL bdev to bio_alloc_clone dm cache metadata: remove unnecessary variable in __dump_mapping dm mpath: provide high-resolution timer to HST for bio-based dm crypt: make printing of the key constant-time dm integrity: fix error code in dm_integrity_ctr() dm stats: add cond_resched when looping over entries dm: improve abnormal bio processing dm: simplify bio-based IO accounting further dm: put all polled dm_io instances into a single list dm: improve dm_io reference counting dm: don't grab target io reference in dm_zone_map_bio dm: improve bio splitting and associated IO accounting dm: switch to bdev based IO accounting interfaces dm: pass dm_io instance to dm_io_acct directly dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct dm: use bio_sectors in dm_aceept_partial_bio dm: simplify basic targets dm: conditionally enable branching for less used features dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio dm: move hot dm_io members to same cacheline as dm_target_io ...	2022-05-26 21:13:45 -07:00
Coly Li	32feee36c3	bcache: avoid journal no-space deadlock by reserving 1 journal bucket The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Coly Li	80db4e4707	bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree(). That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one. In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes. The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does. But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning. Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Coly Li	4dc34ae1b4	bcache: improve multithreaded bch_sectors_dirty_init() Commit `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved. This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible(). With the above changes, the code is more clear, and some potential error conditions are avoided. Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Coly Li	622536443b	bcache: improve multithreaded bch_btree_check() Commit `8e7102273f` ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved. This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible(). With this change, the code is more clear, and some potential error conditions are avoided. Fixes: `8e7102273f` ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-24 06:19:33 -06:00
Linus Torvalds	5dc921868c	for-5.19/drivers-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrTcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgph/REAC0/7odRfJeTJ1PkJhSKFc7dhyS7rK4du2s 3+z+H6Yeua2yVIJb0mYYGEJcOUUQ9nD2T9424n3NzDOw88U4y8Vg2YEH+UiJBuj4 AJoxPNkQdxL7WzmwHmRNLCcOOFhISLqWiCJSr45d+LP1f6aO24Q9lewYWxtNA4TW mqb7Ne7e3Z77m9rmsCsZ26bzQHg1EEQ6qgjZM9tqMhOeTqYhmrqfrD9KtG8TIkpK N8277E5QcequHf7v6VpKqEOzf3d2kx55JaZdu+oxLPVMED3wJJFwcYF1/xmM7Fgx tp7xCjqqUHXwKvJNCFJpnvw+cXu0Ct7cWOIG4ROCvaTD4vBI1KzZLc0gO7pKFW0Y hNIlMXr4n8PmonS81tMV4TqmRWxedX/jxuaeJCVNr89PqYU4luPpigJZqv7rlGry KZUlktQot22M/7FC2MS6KhgbQKLPrRGTAEyY/JNwBHckCZiduWQFlmKLQ926xQIJ 6vdjSzHK5MrT/d+yow3bGFxAJWloGJ+L+RsH0b+WikF81+6ic9P3AoStgbVilfKD 6sbjcju8SShDlQ+W/Ocm0rHC+i/RDKT3QqItXgfhA/1FfMPODQGc/xcZg+AdTswn VSnUIkvk9/mTO0StilVfNJDfG1QkSpJ5Ilvs/DnIahZj6IG4QbJvtnVNbmQX6ptz AUB4DdGwXg== =geQL -----END PGP SIGNATURE----- Merge tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Here are the driver updates queued up for 5.19. This contains: - NVMe pull requests via Christoph: - tighten the PCI presence check (Stefan Roese) - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, Christoph) - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (Christoph) - MD pull request via Song: - Improve annotation in raid5 code, by Logan Gunthorpe - Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk - Other small fixes/cleanups - null_blk series making the configfs side much saner (Damien) - Various minor drbd cleanups and fixes (Haowen, Uladzislau, Jiapeng, Arnd, Cai) - Avoid using the system workqueue (and hence flushing it) in rnbd (Jack) - Avoid using the system workqueue (and hence flushing it) in aoe (Tetsuo) - Series fixing discard_alignment issues in drivers (Christoph) - Small series fixing drivers poking at disk->part0 for openers information (Christoph) - Series fixing deadlocks in loop (Christoph, Tetsuo) - Remove loop.h and add SPDX headers (Christoph) - Various fixes and cleanups (Julia, Xie, Yu)" * tag 'for-5.19/drivers-2022-05-22' of git://git.kernel.dk/linux-block: (72 commits) mtip32xx: fix typo in comment nvme: set non-mdts limits in nvme_scan_work nvme: add support for TP4084 - Time-to-Ready Enhancements nvme: split the enum used for various register constants nbd: Fix hung on disconnect request if socket is closed before nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL loop: remove most the top-of-file boilerplate comment from the UAPI header loop: remove most the top-of-file boilerplate comment loop: add a SPDX header loop: remove loop.h block: null_blk: Improve device creation with configfs block: null_blk: Cleanup messages block: null_blk: Cleanup device creation and deletion ...	2022-05-23 14:04:14 -07:00
Linus Torvalds	115cd47132	for-5.19/block-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis iomvJ4yk0Q== =0Rw1 -----END PGP SIGNATURE----- Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...	2022-05-23 13:56:39 -07:00
Xiao Ni	42b805af10	md: fix double free of io_acct_set bioset Now io_acct_set is alloc and free in personality. Remove the codes that free io_acct_set in md_free and md_stop. Fixes: `0c031fd37f` (md: Move alloc/free acct bioset in to personality) Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:22 -07:00
Xiao Ni	0f2571ad7a	md: Don't set mddev private to NULL in raid0 pers->free In normal stop process, it does like this: do_md_stop \| __md_stop (pers->free(); mddev->private=NULL) \| md_free (free mddev) __md_stop sets mddev->private to NULL after pers->free. The raid device will be stopped and mddev memory is free. But in reshape, it doesn't free the mddev and mddev will still be used in new raid. In reshape, it first sets mddev->private to new_pers and then runs old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free. The new raid can't work anymore. It will panic when dereference mddev->private because of NULL pointer dereference. It can panic like this: [63010.814972] kernel BUG at drivers/md/raid10.c:928! [63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1 [63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020 [63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10] [63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246 [63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000 [63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800 [63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000 [63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200 [63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003 [63010.906399] FS: 0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000 [63010.914485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0 [63010.927363] Call Trace: [63010.929822] ? bio_reset+0xe/0x40 [63010.933144] ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10] [63010.938629] raid10_sync_request+0x756/0x1610 [raid10] [63010.943770] md_do_sync.cold+0x3e4/0x94c [63010.947698] md_thread+0xab/0x160 [63010.951024] ? md_write_inc+0x50/0x50 [63010.954688] kthread+0x149/0x170 [63010.957923] ? set_kthread_struct+0x40/0x40 [63010.962107] ret_from_fork+0x22/0x30 Removing the code that sets mddev->private to NULL in raid0 can fix problem. Fixes: `0c031fd37f` (md: Move alloc/free acct bioset in to personality) Reported-by: Fine Fan <ffan@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Christoph Hellwig	913cce5a1e	md: remove most calls to bdevname Use the %pg format specifier to save on stack consumption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Guoqing Jiang	1e26774228	md: protect md_unregister_thread from reentrancy Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Guoqing Jiang	8b48ec23cc	md: don't unregister sync_thread with reconfig_mutex held Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t And add one parameter to md_reap_sync_thread since it could be called by dm-raid which doesn't hold reconfig_mutex. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>	2022-05-22 23:07:21 -07:00
Jane Chu	047218ec90	dax: add .recovery_write dax_operation Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2022-05-16 13:37:59 -07:00
Jane Chu	e511c4a3d2	dax: introduce DAX_RECOVERY_WRITE dax access mode Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2022-05-16 13:35:56 -07:00
Mike Snitzer	ca522482e3	dm: pass NULL bdev to bio_alloc_clone Most DM targets will remap the clone bio passed to their ->map function using bio_set_bdev(). So this change to pass NULL bdev to bio_alloc_clone avoids clone-time work that sets up resources for a bdev association that will not be used in practice (e.g. clone issued to underlying device will not use DM device's blk-cgroups resources). But clone->bi_bdev is still initialized following bio_alloc_clone to preserve DM target expectations that clone->bi_bdev will be set. Follow-up work is needed to audit DM targets to remove accesses to a clone->bi_bdev that the target didn't initialize with bio_set_dev(). Depends-on: `7ecc56c62b` ("block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-11 13:58:52 -04:00
Guo Zhengkui	d254c3699f	dm cache metadata: remove unnecessary variable in __dump_mapping Fix the following coccicheck warning: drivers/md/dm-cache-metadata.c:1512:5-6: Unneeded variable: "r". Return "0" on line 1520. Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 15:40:10 -04:00
Gabriel Krisman Bertazi	c06dfd124d	dm mpath: provide high-resolution timer to HST for bio-based The precision loss of reading IO start_time with jiffies_to_nsecs instead of using a high resolution timer degrades HST path prediction for BIO-based mpath on high load workloads. Below, I show the utilization percentage of a 10 disk multipath with asymmetrical disk access cost, while being exercised by a randwrite FIO benchmark with high submission queue depth (depth=64). It is possible to see that the HST path selection degrades heavily for high-iops in BIO-mpath, underutilizing the slower paths way beyond expected. This seems to be caused by the start_time truncation, which makes some IO to seem much slower than it actually is. In this scenario ST outperforms HST for bio-mpath, but not for mq-mpath, which already uses ktime_get_ns(). The third column shows utilization with this patch applied. It is easy to see that now HST prediction is much closer to the ideal distribution (calculated considering the real cost of each path). \| \| ST \| HST (orig) \| HST(ktime) \| Best \| \| sdd \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sde \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sdf \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sdg \| 0.06 \| 0.00 \| 0.06 \| 0.04 \| \| sdh \| 0.03 \| 0.00 \| 0.03 \| 0.02 \| \| sdi \| 0.03 \| 0.00 \| 0.03 \| 0.02 \| \| sdj \| 0.02 \| 0.00 \| 0.01 \| 0.01 \| \| sdk \| 0.02 \| 0.00 \| 0.01 \| 0.01 \| \| sdl \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| \| sdm \| 0.17 \| 0.20 \| 0.17 \| 0.18 \| This issue was originally discussed [1] when we first merged HST, and this patch was left as a low hanging fruit to be solved later. Regarding the implementation, as suggested by Mike in that mail thread, in order to avoid the overhead of ktime_get_ns for other selectors, this patch adds a flag for the selector code to request the high-resolution timer. I tested this using the same benchmark used in the original HST submission. Full test and benchmark scripts are available here: https://people.collabora.com/~krisman/HST-BIO-MPATH/ [1] https://lore.kernel.org/lkml/85tv0am9de.fsf@collabora.com/T/ Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> [snitzer: cleaned up various implementation details] Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 15:39:23 -04:00
Mikulas Patocka	567dd8f345	dm crypt: make printing of the key constant-time The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor. Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 12:34:03 -04:00
Dan Carpenter	d3f2a14b89	dm integrity: fix error code in dm_integrity_ctr() The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this: drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r' Fixes: `7eada909bf` ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 12:14:00 -04:00
Mikulas Patocka	bfe2b0146c	dm stats: add cond_resched when looping over entries dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-09 12:11:07 -04:00
Mike Snitzer	4edadf6dcb	dm: improve abnormal bio processing Read/write/flush are the most common operations, optimize switch in is_abnormal_io() for those cases. Follows same pattern established in block perf-wip commit ("block: optimise blk_may_split for normal rw") Also, push is_abnormal_io() check and blk_queue_split() down from dm_submit_bio() to dm_split_and_process_bio() and set new 'is_abnormal_io' flag in clone_info. Optimize __split_and_process_bio and __process_abnormal_io by leveraging ci.is_abnormal_io flag. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Mike Snitzer	9d20653fe8	dm: simplify bio-based IO accounting further Now that io splitting is recorded prior to, or during, ->map IO accounting can happen immediately rather than defer until after bio splitting in dm_split_and_process_bio(). Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task member because there is no longer any need to wait for splitting to occur before accounting. Also move dm_io struct's 'flags' member to consolidate struct holes. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	ec211631ae	dm: put all polled dm_io instances into a single list Now that bio_split() isn't used by DM's bio splitting, it is a bit overkill to link dm_io into an hlist given there is only single dm_io in the list. Convert to using a single list for holding all dm_io instances associated with this bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	0f14d60a02	dm: improve dm_io reference counting Currently each dm_io's reference counter is grabbed before calling __map_bio(), this way isn't efficient since we can move this grabbing to initialization time inside alloc_io(). Meantime it becomes typical async io reference counter model: one is for submission side, the other is for completion side, and the io won't be completed until both sides are done. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	2e803cd99b	dm: don't grab target io reference in dm_zone_map_bio dm_zone_map_bio() is only called from __map_bio in which the io's reference is grabbed already, and the reference won't be released until the bio is submitted, so not necessary to do it dm_zone_map_bio any more. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:36 -04:00
Ming Lei	7dd76d1fee	dm: improve bio splitting and associated IO accounting The current DM code (ab)uses late assignment of dm_io->orig_bio (after __map_bio() returns and any bio splitting is complete) to indicate the FS bio has been processed and can be accounted. This results in awkward waiting until ->orig_bio is set in dm_submit_bio_remap(). Also the bio splitting was implemented using bio_split()+bio_chain() -- a well-worn pattern but it requires bio cloning purely for the benefit of more natural IO accounting. The bio_split() result was stored in ->orig_bio to represent the mapped part of the original FS bio. DM has switched to the bdev based IO accounting interface. DM's IO accounting can be implemented in terms of the original FS bio (now stored early in ->orig_bio) via access to its sectors/bio_op. And if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use new dm_io fields of .sector_offset & .sectors to allow IO accounting for split bios _without_ needing to clone a new bio to store in ->orig_bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Ming Lei	d3de6d1269	dm: switch to bdev based IO accounting interfaces DM splits flush with data into empty flush followed by bio with data payload, switch dm_io_acct() to use bdev_{start,end}_io_acct() to do this accoiunting more naturally (rather than temporarily changing the bio's bi_size). This will allow DM to more easily account bios that are split (in following commit). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Ming Lei	e6926ad0c9	dm: pass dm_io instance to dm_io_acct directly All the other 4 parameters are retrieved from the 'dm_io' instance, so it's not necessary to pass all four to dm_io_acct(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Ming Lei	b992b40dfc	dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct, so it isn't necessary to take one bio parameter for the two helpers. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Mike Snitzer	bdb34759a0	dm: use bio_sectors in dm_aceept_partial_bio Rename 'bi_size' to 'bio_sectors' given bi_size is being stored in sectors. Also, use bio_sectors() rather than open-coding it. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Mike Snitzer	e86f2b005a	dm: simplify basic targets Remove needless factoring and remap bi_sector regardless of bio_sectors() being non-zero. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:35 -04:00
Mike Snitzer	442761fd2b	dm: conditionally enable branching for less used features Use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00
Mike Snitzer	563a225c9f	dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio If a bio is marked REQ_NOWAIT optimize dm_submit_bio()'s dm_table RCU usage to dm_{get,put}_live_table_fast. DM core offers protection against blocking (via suspend) if REQ_NOWAIT. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-05-05 17:31:34 -04:00

1 2 3 4 5 ...

7092 Commits