OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Xiao Ni	72adae23a7	md: Change active_io to percpu Now the type of active_io is atomic. It's used to count how many ios are in the submitting process and it's added and decreased very time. But it only needs to check if it's zero when suspending the raid. So we can switch atomic to percpu to improve the performance. After switching active_io to percpu type, we use the state of active_io to judge if the raid device is suspended. And we don't need to wake up ->sb_wait in md_handle_request anymore. It's done in the callback function which is registered when initing active_io. The argument mddev->suspended is only used to count how many users are trying to set raid to suspend state. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2023-02-01 08:32:58 -08:00
Xiao Ni	d19329133d	md: Factor out is_md_suspended helper This helper function will be used in next patch. It's easy for understanding. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2023-02-01 08:32:58 -08:00
Hou Tao	1d1f25bfda	md: don't update recovery_cp when curr_resync is ACTIVE Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise md may skip the resync of the first 3 sectors if the resync procedure is interrupted before the first calling of ->sync_request() as shown below: md_do_sync thread control thread // setup resync mddev->recovery_cp = 0 j = 0 mddev->curr_resync = MD_RESYNC_ACTIVE // e.g., set array as idle set_bit(MD_RECOVERY_INTR, &&mddev_recovery) // resync loop // check INTR before calling sync_request !test_bit(MD_RECOVERY_INTR, &mddev->recovery // resync interrupted // update recovery_cp from 0 to 3 // the resync of three 3 sectors will be skipped mddev->recovery_cp = 3 Fixes: `eac58d08d4` ("md: Use enum for overloaded magic numbers used by mddev->curr_resync") Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2023-02-01 08:32:57 -08:00
Joe Thornber	22c40e134c	dm cache: Add some documentation to dm-cache-background-tracker.h Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-01-30 14:20:04 -05:00
Joe Thornber	95ab80a8a0	dm cache: free background tracker's queued work in btracker_destroy Otherwise the kernel can BUG with: [ 2245.426978] ============================================================================= [ 2245.435155] BUG bt_work (Tainted: G B W ): Objects remaining in bt_work on __kmem_cache_shutdown() [ 2245.445233] ----------------------------------------------------------------------------- [ 2245.445233] [ 2245.454879] Slab 0x00000000b0ce2b30 objects=64 used=2 fp=0x000000000a3c6a4e flags=0x17ffffc0000200(slab\|node=0\|zone=2\|lastcpupid=0x1fffff) [ 2245.467300] CPU: 7 PID: 10805 Comm: lvm Kdump: loaded Tainted: G B W 6.0.0-rc2 #19 [ 2245.476078] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.5.6 10/06/2021 [ 2245.483646] Call Trace: [ 2245.486100] <TASK> [ 2245.488206] dump_stack_lvl+0x34/0x48 [ 2245.491878] slab_err+0x95/0xcd [ 2245.495028] __kmem_cache_shutdown.cold+0x31/0x136 [ 2245.499821] kmem_cache_destroy+0x49/0x130 [ 2245.503928] btracker_destroy+0x12/0x20 [dm_cache] [ 2245.508728] smq_destroy+0x15/0x60 [dm_cache_smq] [ 2245.513435] dm_cache_policy_destroy+0x12/0x20 [dm_cache] [ 2245.518834] destroy+0xc0/0x110 [dm_cache] [ 2245.522933] dm_table_destroy+0x5c/0x120 [dm_mod] [ 2245.527649] __dm_destroy+0x10e/0x1c0 [dm_mod] [ 2245.532102] dev_remove+0x117/0x190 [dm_mod] [ 2245.536384] ctl_ioctl+0x1a2/0x290 [dm_mod] [ 2245.540579] dm_ctl_ioctl+0xa/0x20 [dm_mod] [ 2245.544773] __x64_sys_ioctl+0x8a/0xc0 [ 2245.548524] do_syscall_64+0x5c/0x90 [ 2245.552104] ? syscall_exit_to_user_mode+0x12/0x30 [ 2245.556897] ? do_syscall_64+0x69/0x90 [ 2245.560648] ? do_syscall_64+0x69/0x90 [ 2245.564394] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 2245.569447] RIP: 0033:0x7fe52583ec6b ... [ 2245.646771] ------------[ cut here ]------------ [ 2245.651395] kmem_cache_destroy bt_work: Slab cache still has objects when called from btracker_destroy+0x12/0x20 [dm_cache] [ 2245.651408] WARNING: CPU: 7 PID: 10805 at mm/slab_common.c:478 kmem_cache_destroy+0x128/0x130 Found using: lvm2-testsuite --only "cache-single-split.sh" Ben bisected and found that commit `0495e337b7` ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock") first exposed dm-cache's incomplete cleanup of its background tracker work objects. Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Tested-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-01-30 14:20:04 -05:00
Mike Snitzer	c87791bcc4	dm: improve shrinker debug names Commit `e33c267ab7` ("mm: shrinkers: provide shrinkers with names") chose some fairly bad names for DM's shrinkers. Fixes: `e33c267ab7` ("mm: shrinkers: provide shrinkers with names") Signed-off-by : Mike Snitzer <snitzer@kernel.org>	2023-01-30 14:20:04 -05:00
Linus Torvalds	28cca23da7	hardening fixes for v6.2-rc6 - Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST - Reorganize gcc-plugin includes for GCC 13 - Silence bcache memcpy run-time false positive warnings -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmPUHrIWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJnuoEAClEu/z51wnPPyYKoVAeYbxurqG Wlf7Ti7tMEsNN/kbSRXbJV3wMeY2RvY/+9iHa7zySJ4p9GzhH6pJXpS14CdSONjr c/ESUEVxJqq5C/ICfC7w1l4my7MwZ/IKMyvZOZET8IWlvTSpe8SW4Xvkaloapl08 7GanyWB24wZKL8rqKv2D1vmVeE1SmPEhLncgknPs77jUmnH9AXDgN8bVxKzOZI+V fa5Mjzcg+ovI/i9e25+EX7GCJ6HxrBnPogB1UM46UmFk/TKY/20l0bJTHzqE8gdK iKYpcU+0kihG/JJEg3h95v3qBNJnbBks27t3am+YSbN45Nb4rhovCcEI7NzeHMSG grwYeWN7iAcL8gJuEtwFPieUsioi7iX3KlmL990XxviutDkcgGgmGK02m64fNngb 0LpVue/r0nnYK7WEW1BcZ4Cd8bJfyXoZj0hN/awDkma3vHbsLuYQ6VXCPezOET7u uaDWCkgKrKgjRUJQZEqVs3nrEkrIjhODyH7Aa24EwoioGYuqcsVO5frGeu21vY+A t52S3WKWcuF5Hr8dDrYHMSA6ntWHwaYN1gaxPaObK9KD4aS4uV/SnJZLmTftZMU8 P4fvBKYV8W2SU9Y4ppzaYRpaCVSH04rW1wyFjkCN7HtCBIGc5QFLW9v3F3QLe7lW 1YucSUHg0S1plIKgGQ== =0orM -----END PGP SIGNATURE----- Merge tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST - Reorganize gcc-plugin includes for GCC 13 - Silence bcache memcpy run-time false positive warnings * tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: bcache: Silence memcpy() run-time false positive warnings gcc-plugins: Reorganize gimple includes for GCC 13 kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST	2023-01-27 16:09:12 -08:00
Kees Cook	be0d8f48ad	bcache: Silence memcpy() run-time false positive warnings struct bkey has internal padding in a union, but it isn't always named the same (e.g. key ## _pad, key_p, etc). This makes it extremely hard for the compiler to reason about the available size of copies done against such keys. Use unsafe_memcpy() for now, to silence the many run-time false positive warnings: memcpy: detected field-spanning write (size 264) of single field "&i->j" at drivers/md/bcache/journal.c:152 (size 240) memcpy: detected field-spanning write (size 24) of single field "&b->key" at drivers/md/bcache/btree.c:939 (size 16) memcpy: detected field-spanning write (size 24) of single field "&temp.key" at drivers/md/bcache/extents.c:428 (size 16) Reported-by: Alexandre Pereira <alexpereira@disroot.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216785 Acked-by: Coly Li <colyli@suse.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-bcache@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230106060229.never.047-kees@kernel.org	2023-01-25 12:24:50 -08:00
Adrian Huang	b0907cadab	md: fix incorrect declaration about claim_rdev in md_import_device Commit `fb541ca4c3` ("md: remove lock_bdev / unlock_bdev") removes wrappers for blkdev_get/blkdev_put. However, the uninitialized local static variable of pointer type 'claim_rdev' in md_import_device() is NULL, which leads to the following warning call trace: WARNING: CPU: 22 PID: 1037 at block/bdev.c:577 bd_prepare_to_claim+0x131/0x150 CPU: 22 PID: 1037 Comm: mdadm Not tainted 6.2.0-rc3+ #69 .. RIP: 0010:bd_prepare_to_claim+0x131/0x150 .. Call Trace: <TASK> ? _raw_spin_unlock+0x15/0x30 ? iput+0x6a/0x220 blkdev_get_by_dev.part.0+0x4b/0x300 md_import_device+0x126/0x1d0 new_dev_store+0x184/0x240 md_attr_store+0x80/0xf0 kernfs_fop_write_iter+0x128/0x1c0 vfs_write+0x2be/0x3c0 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc It turns out the md device cannot be used: md: could not open device unknown-block(259,0). md: md127 stopped. Fix the issue by declaring the local static variable of struct type and passing the pointer of the variable to blkdev_get_by_dev(). Fixes: `fb541ca4c3` ("md: remove lock_bdev / unlock_bdev") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2023-01-12 10:42:16 -08:00
Gustavo A. R. Silva	b942a520d9	bcache: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper Zero-length arrays are deprecated and we are moving towards adopting C99 flexible-array members, instead. So, replace zero-length arrays declarations in anonymous union with the new DECLARE_FLEX_ARRAY() helper macro. This helper allows for flexible-array members in unions. Link: https://github.com/KSPP/linux/issues/193 Link: https://github.com/KSPP/linux/issues/213 Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2023-01-05 17:48:45 -06:00
Jens Axboe	613b14884b	block: handle bio_split_to_limits() NULL return This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-04 09:05:23 -07:00
Linus Torvalds	8715c6d310	- Fix use-after-free races due to missing resource cleanup during DM target destruction in DM targets: thin-pool, cache, integrity and clone. - Fix ABBA deadlocks in DM thin-pool and cache targets due to their use of a bufio client (that has a shrinker whose locking can cause the incorrect locking order). - Fix DM cache target to set its needs_check flag after first aborting the metadata (whereby using reset persistent-data objects to update the superblock with, otherwise the superblock update could be dropped due to aborting metadata). This was found with code-inspection when comparing with the equivalent in DM thinp code. - Fix DM thin-pool's presume to continue resuming the device even if the pool in is fail mode -- otherwise bios may never be failed up the IO stack (which will prevent resetting the thin-pool target via table reload) - Fix DM thin-pool's metadata to use proper btree root (from previous transaction) if metadata commit failed. - Add 'waitfor' module param to DM module (dm_mod) to allow dm-init to wait for the specified device before continuing with its DM target initialization. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmOXp8IACgkQxSPxCi2d A1rS2wf/VF82dcQpsAAT05L6j6xqo/eUe3Q3AZYbJGRl1Kg41NE/YRA1A/lsxykO j3wZut5UyL0ZEsjHhhUMC8rOeexfyqiecVdMofuQRkKJYa4l85IbPBE2qOMi/Ida WMsZg8bmm01hDPpZRUwj7L5BRDJlB0n9hdtkh16K0gboiaPZxmIEtum0IYr8krPj hOv6amx9MhrsiUcVWylFipmxTgpsxwwg67g2R6tCGUmHQhxHKFInvySrYdv5hYdy 3kr1pxy87Wn3oETrHNNc506qh6QK85mOC+D/VdNNzNnM6yhgOEkE9lSS/Er2rHyT D6Af4j5SIBy/srv5DjcWd7pi4NHIBA== =fuDb -----END PGP SIGNATURE----- Merge tag 'for-6.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix use-after-free races due to missing resource cleanup during DM target destruction in DM targets: thin-pool, cache, integrity and clone. - Fix ABBA deadlocks in DM thin-pool and cache targets due to their use of a bufio client (that has a shrinker whose locking can cause the incorrect locking order). - Fix DM cache target to set its needs_check flag after first aborting the metadata (whereby using reset persistent-data objects to update the superblock with, otherwise the superblock update could be dropped due to aborting metadata). This was found with code-inspection when comparing with the equivalent in DM thinp code. - Fix DM thin-pool's presume to continue resuming the device even if the pool in is fail mode -- otherwise bios may never be failed up the IO stack (which will prevent resetting the thin-pool target via table reload) - Fix DM thin-pool's metadata to use proper btree root (from previous transaction) if metadata commit failed. - Add 'waitfor' module param to DM module (dm_mod) to allow dm-init to wait for the specified device before continuing with its DM target initialization. * tag 'for-6.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: Use last transaction's pmd->root when commit failed dm init: add dm-mod.waitfor to wait for asynchronously probed block devices dm ioctl: fix a couple ioctl codes dm ioctl: a small code cleanup in list_version_get_info dm thin: resume even if in FAIL mode dm cache: set needs_check flag after aborting metadata dm cache: Fix ABBA deadlock between shrink_slab and dm_cache_metadata_abort dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata dm integrity: Fix UAF in dm_integrity_dtr() dm cache: Fix UAF in destroy() dm clone: Fix UAF in clone_dtr() dm thin: Fix UAF in run_timer_softirq()	2022-12-13 10:58:09 -08:00
Linus Torvalds	ce8a79d560	for-6.2/block-2022-12-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr wxzFGfvHfw== =k/vv -----END PGP SIGNATURE----- Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...	2022-12-13 10:43:59 -08:00
Linus Torvalds	268325bda5	Random number generator updates for Linux 6.2-rc1. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6 /ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4 CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE= =QRhK -----END PGP SIGNATURE----- Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...	2022-12-12 16:22:22 -08:00
Zhihao Cheng	7991dbff68	dm thin: Use last transaction's pmd->root when commit failed Recently we found a softlock up problem in dm thin pool btree lookup code due to corrupted metadata: Kernel panic - not syncing: softlockup: hung tasks CPU: 7 PID: 2669225 Comm: kworker/u16:3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: <IRQ> dump_stack+0x9c/0xd3 panic+0x35d/0x6b9 watchdog_timer_fn.cold+0x16/0x25 __run_hrtimer+0xa2/0x2d0 </IRQ> RIP: 0010:__relink_lru+0x102/0x220 [dm_bufio] __bufio_new+0x11f/0x4f0 [dm_bufio] new_read+0xa3/0x1e0 [dm_bufio] dm_bm_read_lock+0x33/0xd0 [dm_persistent_data] ro_step+0x63/0x100 [dm_persistent_data] btree_lookup_raw.constprop.0+0x44/0x220 [dm_persistent_data] dm_btree_lookup+0x16f/0x210 [dm_persistent_data] dm_thin_find_block+0x12c/0x210 [dm_thin_pool] __process_bio_read_only+0xc5/0x400 [dm_thin_pool] process_thin_deferred_bios+0x1a4/0x4a0 [dm_thin_pool] process_one_work+0x3c5/0x730 Following process may generate a broken btree mixed with fresh and stale btree nodes, which could get dm thin trapped in an infinite loop while looking up data block: Transaction 1: pmd->root = A, A->B->C // One path in btree pmd->root = X, X->Y->Z // Copy-up Transaction 2: X,Z is updated on disk, Y write failed. // Commit failed, dm thin becomes read-only. process_bio_read_only dm_thin_find_block __find_block dm_btree_lookup(pmd->root) The pmd->root points to a broken btree, Y may contain stale node pointing to any block, for example X, which gets dm thin trapped into a dead loop while looking up Z. Fix this by setting pmd->root in __open_metadata(), so that dm thin will use the last transaction's pmd->root if commit failed. Fetch a reproducer in [Link]. Linke: https://bugzilla.kernel.org/show_bug.cgi?id=216790 Cc: stable@vger.kernel.org Fixes: `991d9fa02d` ("dm: add thin provisioning target") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-08 12:17:09 -05:00
Christoph Hellwig	c34b7ac650	block: remove bio_set_op_attrs This macro is obsolete, so replace the last few uses with open coded bi_opf assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-07 09:43:12 -07:00
Peter Korsgaard	035641b01e	dm init: add dm-mod.waitfor to wait for asynchronously probed block devices Just calling wait_for_device_probe() is not enough to ensure that asynchronously probed block devices are available (E.G. mmc, usb), so add a "dm-mod.waitfor=<device1>[,..,<deviceN>]" parameter to get dm-init to explicitly wait for specific block devices before initializing the tables with logic similar to the rootwait logic that was introduced with commit `cc1ed7542c` ("init: wait for asynchronously scanned block devices"). E.G. with dm-verity on mmc using: dm-mod.waitfor="PARTLABEL=hash-a,PARTLABEL=root-a" [ 0.671671] device-mapper: init: waiting for all devices to be available before creating mapped devices [ 0.671679] device-mapper: init: waiting for device PARTLABEL=hash-a ... [ 0.710695] mmc0: new HS200 MMC card at address 0001 [ 0.711158] mmcblk0: mmc0:0001 004GA0 3.69 GiB [ 0.715954] mmcblk0boot0: mmc0:0001 004GA0 partition 1 2.00 MiB [ 0.722085] mmcblk0boot1: mmc0:0001 004GA0 partition 2 2.00 MiB [ 0.728093] mmcblk0rpmb: mmc0:0001 004GA0 partition 3 512 KiB, chardev (249:0) [ 0.738274] mmcblk0: p1 p2 p3 p4 p5 p6 p7 [ 0.751282] device-mapper: init: waiting for device PARTLABEL=root-a ... [ 0.751306] device-mapper: init: all devices available [ 0.751683] device-mapper: verity: sha256 using implementation "sha256-generic" [ 0.759344] device-mapper: ioctl: dm-0 (vroot) is ready [ 0.766540] VFS: Mounted root (squashfs filesystem) readonly on device 254:0. Signed-off-by: Peter Korsgaard <peter@korsgaard.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-02 17:37:45 -05:00
Christoph Hellwig	b5c1acf012	md: fold unbind_rdev_from_array into md_kick_rdev_from_array unbind_rdev_from_array is only called from md_kick_rdev_from_array, so merge it into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-12-02 11:21:01 -08:00
Christoph Hellwig	d57d9d6965	md: mark md_kick_rdev_from_array static md_kick_rdev_from_array is only used in md.c, so unexport it and mark the symbol static. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-12-02 11:21:01 -08:00
Christoph Hellwig	fb541ca4c3	md: remove lock_bdev / unlock_bdev These wrappers for blkdev_get / blkdev_put just horribly confuse the code with their odd naming. Remove them and improve the error unwinding in md_import_device with the now folded code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-12-02 11:21:01 -08:00
Mikulas Patocka	b52c3de84b	dm ioctl: fix a couple ioctl codes Change the ioctl codes from DM_DEV_ARM_POLL to DM_DEV_ARM_POLL_CMD and from DM_GET_TARGET_VERSION to DM_GET_TARGET_VERSION_CMD. Note that the "cmd" field of "struct _ioctls" is never used, thus this commit doesn't fix any bug, it just makes the code consistent. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-01 11:43:41 -05:00
Mikulas Patocka	d043f9a1ca	dm ioctl: a small code cleanup in list_version_get_info No need to modify info->vers if we overwrite it immediately. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-01 11:43:41 -05:00
Luo Meng	19eb1650af	dm thin: resume even if in FAIL mode If a thinpool set fail_io while suspending, resume will fail with: device-mapper: resume ioctl on vg-thinpool failed: Invalid argument The thin-pool also can't be removed if an in-flight bio is in the deferred list. This can be easily reproduced using: echo "offline" > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1 dmsetup suspend /dev/mapper/pool mkfs.ext4 /dev/mapper/thin dmsetup resume /dev/mapper/pool The root cause is maybe_resize_data_dev() will check fail_io and return error before called dm_resume. Fix this by adding FAIL mode check at the end of pool_preresume(). Cc: stable@vger.kernel.org Fixes: `da105ed5fd` ("dm thin metadata: introduce dm_pool_abort_metadata") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-01 11:43:41 -05:00
Mike Snitzer	6b9973861c	dm cache: set needs_check flag after aborting metadata Otherwise the commit that will be aborted will be associated with the metadata objects that will be torn down. Must write needs_check flag to metadata with a reset block manager. Found through code-inspection (and compared against dm-thin.c). Cc: stable@vger.kernel.org Fixes: `028ae9f76f` ("dm cache: add fail io mode and needs_check flag") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-01 11:43:41 -05:00
Mike Snitzer	352b837a55	dm cache: Fix ABBA deadlock between shrink_slab and dm_cache_metadata_abort Same ABBA deadlock pattern fixed in commit 4b60f452ec51 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") to DM-cache's metadata. Reported-by: Zhihao Cheng <chengzhihao1@huawei.com> Cc: stable@vger.kernel.org Fixes: `028ae9f76f` ("dm cache: add fail io mode and needs_check flag") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-01 11:42:45 -05:00
Zhihao Cheng	8111964f1b	dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata Following concurrent processes: P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_read_bh_nowait submit_bh dm_submit_bio do_worker process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata down_write(&pmd->root_lock) - LOCK B __destroy_persistent_data_objects dm_block_manager_destroy dm_bufio_client_destroy unregister_shrinker down_write(&shrinker_rwsem) thin_map \| dm_thin_find_block ↓ down_read(&pmd->root_lock) --> ABBA deadlock , which triggers hung task: [ 76.974820] INFO: task kworker/u4:3:63 blocked for more than 15 seconds. [ 76.976019] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.978521] task:kworker/u4:3 state:D stack:0 pid:63 ppid:2 [ 76.978534] Workqueue: dm-thin do_worker [ 76.978552] Call Trace: [ 76.978564] __schedule+0x6ba/0x10f0 [ 76.978582] schedule+0x9d/0x1e0 [ 76.978588] rwsem_down_write_slowpath+0x587/0xdf0 [ 76.978600] down_write+0xec/0x110 [ 76.978607] unregister_shrinker+0x2c/0xf0 [ 76.978616] dm_bufio_client_destroy+0x116/0x3d0 [ 76.978625] dm_block_manager_destroy+0x19/0x40 [ 76.978629] __destroy_persistent_data_objects+0x5e/0x70 [ 76.978636] dm_pool_abort_metadata+0x8e/0x100 [ 76.978643] metadata_operation_failed+0x86/0x110 [ 76.978649] commit+0x6a/0x230 [ 76.978655] do_worker+0xc6e/0xd90 [ 76.978702] process_one_work+0x269/0x630 [ 76.978714] worker_thread+0x266/0x630 [ 76.978730] kthread+0x151/0x1b0 [ 76.978772] INFO: task test.sh:2646 blocked for more than 15 seconds. [ 76.979756] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.982111] task:test.sh state:D stack:0 pid:2646 ppid:2459 [ 76.982128] Call Trace: [ 76.982139] __schedule+0x6ba/0x10f0 [ 76.982155] schedule+0x9d/0x1e0 [ 76.982159] rwsem_down_read_slowpath+0x4f4/0x910 [ 76.982173] down_read+0x84/0x170 [ 76.982177] dm_thin_find_block+0x4c/0xd0 [ 76.982183] thin_map+0x201/0x3d0 [ 76.982188] __map_bio+0x5b/0x350 [ 76.982195] dm_submit_bio+0x2b6/0x930 [ 76.982202] __submit_bio+0x123/0x2d0 [ 76.982209] submit_bio_noacct_nocheck+0x101/0x3e0 [ 76.982222] submit_bio_noacct+0x389/0x770 [ 76.982227] submit_bio+0x50/0xc0 [ 76.982232] submit_bh_wbc+0x15e/0x230 [ 76.982238] submit_bh+0x14/0x20 [ 76.982241] ext4_read_bh_nowait+0xc5/0x130 [ 76.982247] ext4_read_block_bitmap_nowait+0x340/0xc60 [ 76.982254] ext4_mb_init_cache+0x1ce/0xdc0 [ 76.982259] ext4_mb_load_buddy_gfp+0x987/0xfa0 [ 76.982263] ext4_discard_preallocations+0x45d/0x830 [ 76.982274] ext4_clear_inode+0x48/0xf0 [ 76.982280] ext4_evict_inode+0xcf/0xc70 [ 76.982285] evict+0x119/0x2b0 [ 76.982290] dispose_list+0x43/0xa0 [ 76.982294] prune_icache_sb+0x64/0x90 [ 76.982298] super_cache_scan+0x155/0x210 [ 76.982303] do_shrink_slab+0x19e/0x4e0 [ 76.982310] shrink_slab+0x2bd/0x450 [ 76.982317] drop_slab+0xcc/0x1a0 [ 76.982323] drop_caches_sysctl_handler+0xb7/0xe0 [ 76.982327] proc_sys_call_handler+0x1bc/0x300 [ 76.982331] proc_sys_write+0x17/0x20 [ 76.982334] vfs_write+0x3d3/0x570 [ 76.982342] ksys_write+0x73/0x160 [ 76.982347] __x64_sys_write+0x1e/0x30 [ 76.982352] do_syscall_64+0x35/0x80 [ 76.982357] entry_SYSCALL_64_after_hwframe+0x63/0xcd Function metadata_operation_failed() is called when operations failed on dm pool metadata, dm pool will destroy and recreate metadata. So, shrinker will be unregistered and registered, which could down write shrinker_rwsem under pmd_write_lock. Fix it by allocating dm_block_manager before locking pmd->root_lock and destroying old dm_block_manager after unlocking pmd->root_lock, then old dm_block_manager is replaced with new dm_block_manager under pmd->root_lock. So, shrinker register/unregister could be done without holding pmd->root_lock. Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216676 Cc: stable@vger.kernel.org #v5.2+ Fixes: `e49e582965` ("dm thin: add read only and fail io modes") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-12-01 11:41:40 -05:00
Luo Meng	f50cb2cbab	dm integrity: Fix UAF in dm_integrity_dtr() Dm_integrity also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in dm_integrity_dtr(). Cc: stable@vger.kernel.org Fixes: `7eada909bf` ("dm: add integrity target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-30 13:29:34 -05:00
Luo Meng	6a459d8edb	dm cache: Fix UAF in destroy() Dm_cache also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in destroy(). Cc: stable@vger.kernel.org Fixes: `c6b4fcbad0` ("dm: add cache target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-30 13:29:34 -05:00
Luo Meng	e4b5957c6f	dm clone: Fix UAF in clone_dtr() Dm_clone also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in clone_dtr(). Cc: stable@vger.kernel.org Fixes: `7431b7835f` ("dm: add clone target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-30 13:29:34 -05:00
Luo Meng	88430ebcbc	dm thin: Fix UAF in run_timer_softirq() When dm_resume() and dm_destroy() are concurrent, it will lead to UAF, as follows: BUG: KASAN: use-after-free in __run_timers+0x173/0x710 Write of size 8 at addr ffff88816d9490f0 by task swapper/0/0 <snip> Call Trace: <IRQ> dump_stack_lvl+0x73/0x9f print_report.cold+0x132/0xaa2 _raw_spin_lock_irqsave+0xcd/0x160 __run_timers+0x173/0x710 kasan_report+0xad/0x110 __run_timers+0x173/0x710 __asan_store8+0x9c/0x140 __run_timers+0x173/0x710 call_timer_fn+0x310/0x310 pvclock_clocksource_read+0xfa/0x250 kvm_clock_read+0x2c/0x70 kvm_clock_get_cycles+0xd/0x20 ktime_get+0x5c/0x110 lapic_next_event+0x38/0x50 clockevents_program_event+0xf1/0x1e0 run_timer_softirq+0x49/0x90 __do_softirq+0x16e/0x62c __irq_exit_rcu+0x1fa/0x270 irq_exit_rcu+0x12/0x20 sysvec_apic_timer_interrupt+0x8e/0xc0 One of the concurrency UAF can be shown as below: use free do_resume \| __find_device_hash_cell \| dm_get \| atomic_inc(&md->holders) \| \| dm_destroy \| __dm_destroy \| if (!dm_suspended_md(md)) \| atomic_read(&md->holders) \| msleep(1) dm_resume \| __dm_resume \| dm_table_resume_targets \| pool_resume \| do_waker #add delay work \| dm_put \| atomic_dec(&md->holders) \| \| dm_table_destroy \| pool_dtr \| __pool_dec \| __pool_destroy \| destroy_workqueue \| kfree(pool) # free pool time out __do_softirq run_timer_softirq # pool has already been freed This can be easily reproduced using: 1. create thin-pool 2. dmsetup suspend pool 3. dmsetup resume pool 4. dmsetup remove_all # Concurrent with 3 The root cause of this UAF bug is that dm_resume() adds timer after dm_destroy() skips cancelling the timer because of suspend status. After timeout, it will call run_timer_softirq(), however pool has already been freed. The concurrency UAF bug will happen. Therefore, cancelling timer again in __pool_destroy(). Cc: stable@vger.kernel.org Fixes: `991d9fa02d` ("dm: add thin provisioning target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-30 13:29:34 -05:00
Christoph Hellwig	fce3caea0f	blk-crypto: don't use struct request_queue for public interfaces Switch all public blk-crypto interfaces to use struct block_device arguments to specify the device they operate on instead of th request_queue, which is a block layer implementation detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-21 11:39:05 -07:00
Linus Torvalds	f4408c3dfc	block-6.1-2022-11-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmN38ZUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgXxD/9tUSFUKIVGIn4pmNILfY3XV45HOi1w44yR zCxCELupcBeT+YixmaJcT8sunrrg2fLPOXMrDJk1cG/izXHzkjAQsHZvERfqC7hC f5onH+2MyGm3qBwxV0iGqITJgTwQGInVJijT4f9UZd/8ultymyZR2nOdIdIydHCF qzlOjq6hgIuGKHhFgOqRUg/OAkx510ZEEilUDcZ6XVV+zL7ccN6J9+eNTI3c58wT 7jvxZC4u6QGKteGvVniE3WXgk3QdFiQRORvV09g+PkbG/vPjAIZ5tJFb9PdIOebD 3guDiNUasgz2vnDetMK+yk4LcedcRfWnqgn+Vm8C26j5Fxs13eDx5kMDteVy7CYh 3bokOATHohoZZ9qTApgQUswTfGJfBdoy0nUTPuffxPdKDyUPteIxFCADcnyDHnDG d/+PjU3FKF31o2HcUfvYp7OMO0VZP0hJSWps8znoVXKxb+LH9qKkYzHVlfni5kkS k9XqqD1Ki98Erb346YqgvQjCkz+CUd5DxtGyh9Oh2+oS2qHP6WjdKo1QPFmWD5dp EyXGSqGoZrIPtnKohLUN9EiVXanRQWJr3L0gw2CYXpmwfSKfMC3CQraEC1jOc01l TfsLJGbl3L5XpLzxoBwDu44cqp+VvbalergdcmsDTLDFHhONY2g5LJh6C9/EDdnQ Cde1uHikGw== =sOGG -----END PGP SIGNATURE----- Merge tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Two more bogus nid quirks (Bean Huo, Tiago Dias Ferreira) - Memory leak fix in nvmet (Sagi Grimberg) - Regression fix for block cgroups pinning the wrong blkcg, causing leaks of cgroups and blkcgs (Chris) - UAF fix for drbd setup error handling (Dan) - Fix DMA alignment propagation in DM (Keith) * tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux: dm-log-writes: set dma_alignment limit in io_hints dm-integrity: set dma_alignment limit in io_hints block: make blk_set_default_limits() private dm-crypt: provide dma_alignment limit in io_hints block: make dma_alignment a stacking queue_limit nvmet: fix a memory leak in nvmet_auth_set_key nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000 drbd: use after free in drbd_create_device() nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro blk-cgroup: properly pin the parent in blkcg_css_online	2022-11-18 13:59:45 -08:00
Mikulas Patocka	984bf2cc53	dm integrity: clear the journal on suspend There was a problem that a user burned a dm-integrity image on CDROM and could not activate it because it had a non-empty journal. Fix this problem by flushing the journal (done by the previous commit) and clearing the journal (done by this commit). Once the journal is cleared, dm-integrity won't attempt to replay it on the next activation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-18 11:05:09 -05:00
Mikulas Patocka	5e5dab5ec7	dm integrity: flush the journal on suspend This commit flushes the journal on suspend. It is prerequisite for the next commit that enables activating dm integrity devices in read-only mode. Note that we deliberately didn't flush the journal on suspend, so that the journal replay code would be tested. However, the dm-integrity code is 5 years old now, so that journal replay is well-tested, and we can make this change now. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-18 10:57:17 -05:00
Zhihao Cheng	0dfc1f4cea	dm bufio: Fix missing decrement of no_sleep_enabled if dm_bufio_client_create failed The 'no_sleep_enabled' should be decreased in error handling path in dm_bufio_client_create() when the DM_BUFIO_CLIENT_NO_SLEEP flag is set, otherwise static_branch_unlikely() will always return true even if no dm_bufio_client instances have DM_BUFIO_CLIENT_NO_SLEEP flag set. Cc: stable@vger.kernel.org Fixes: `3c1c875d05` ("dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-18 10:23:55 -05:00
Mikulas Patocka	4fe1ec9954	dm ioctl: fix misbehavior if list_versions races with module loading __list_versions will first estimate the required space using the "dm_target_iterate(list_version_get_needed, &needed)" call and then will fill the space using the "dm_target_iterate(list_version_get_info, &iter_info)" call. Each of these calls locks the targets using the "down_read(&_lock)" and "up_read(&_lock)" calls, however between the first and second "dm_target_iterate" there is no lock held and the target modules can be loaded at this point, so the second "dm_target_iterate" call may need more space than what was the first "dm_target_iterate" returned. The code tries to handle this overflow (see the beginning of list_version_get_info), however this handling is incorrect. The code sets "param->data_size = param->data_start + needed" and "iter_info.end = (char )vers+len" - "needed" is the size returned by the first dm_target_iterate call; "len" is the size of the buffer allocated by userspace. "len" may be greater than "needed"; in this case, the code will write up to "len" bytes into the buffer, however param->data_size is set to "needed", so it may write data past the param->data_size value. The ioctl interface copies only up to param->data_size into userspace, thus part of the result will be truncated. Fix this bug by setting "iter_info.end = (char )vers + needed;" - this guarantees that the second "dm_target_iterate" call will write only up to the "needed" buffer and it will exit with "DM_BUFFER_FULL_FLAG" if it overflows the "needed" space - in this case, userspace will allocate a larger buffer and retry. Note that there is also a bug in list_version_get_needed - we need to add "strlen(tt->name) + 1" to the needed size, not "strlen(tt->name)". Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-11-18 10:23:55 -05:00
Jason A. Donenfeld	8032bf1233	treewide: use get_random_u32_below() instead of deprecated function This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>	2022-11-18 02:15:15 +01:00
Keith Busch	50a893359c	dm-log-writes: set dma_alignment limit in io_hints This device mapper needs bio vectors to be sized and memory aligned to the logical block size. Set the minimum required queue limit accordingly. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221110184501.2451620-6-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:58:11 -07:00
Keith Busch	29aa778bb6	dm-integrity: set dma_alignment limit in io_hints This device mapper needs bio vectors to be sized and memory aligned to the logical block size. Set the minimum required queue limit accordingly. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221110184501.2451620-5-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:58:11 -07:00
Keith Busch	86e4d3e8d1	dm-crypt: provide dma_alignment limit in io_hints This device mapper needs bio vectors to be sized and memory aligned to the logical block size. Set the minimum required queue limit accordingly. Link: https://lore.kernel.org/linux-block/20221101001558.648ee024@xps.demsh.org/ Fixes: `b1a000d3b8` ("block: relax direct io memory alignment") Reportred-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Dmitrii Tcvetkov <me@demsh.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221110184501.2451620-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:58:11 -07:00
Christoph Hellwig	1a581b7216	dm: track per-add_disk holder relations in DM dm is a bit special in that it opens the underlying devices. Commit `89f871af1b` ("dm: delay registering the gendisk") tried to accommodate that by allowing to add the holder to the list before add_gendisk and then just add them to sysfs once add_disk is called. But that leads to really odd lifetime problems and error handling problems as we can't know the state of the kobjects and don't unwind properly. To fix this switch to just registering all existing table_devices with the holder code right after add_disk, and remove them before calling del_gendisk. Fixes: `89f871af1b` ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Yu Kuai	d563792c89	dm: make sure create and remove dm device won't race with open and close table open_table_device() and close_table_device() is protected by table_devices_lock, hence use it to protect add_disk() and del_gendisk(). Prepare to track per-add_disk holder relations in dm. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Christoph Hellwig	7b5865831c	dm: cleanup close_table_device Take the list unlink and free into close_table_device so that no half torn down table_devices exist. Also remove the check for a NULL bdev as that can't happen - open_table_device never adds a table_device to the list that does not have a valid block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Christoph Hellwig	b9a785d2dc	dm: cleanup open_table_device Move all the logic for allocation the table_device and linking it into the list into the open_table_device. This keeps the code tidy and ensures that the table_devices only exist in fully initialized state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Christoph Hellwig	992ec6a92a	dm: remove free_table_devices free_table_devices just warns and frees all table_device structures when the target removal did not remove them. This should never happen, but if it did, just freeing the structure without deleting them from the list or cleaning up the resources would not help at all. So just WARN on a non-empty list instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-11-16 15:19:56 -07:00
Jiang Li	b611ad1400	md/raid1: stop mdx_raid1 thread when raid1 array run failed fail run raid1 array when we assemble array with the inactive disk only, but the mdx_raid1 thread were not stop, Even if the associated resources have been released. it will caused a NULL dereference when we do poweroff. This causes the following Oops: [ 287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070 [ 287.594762] #PF: supervisor read access in kernel mode [ 287.599912] #PF: error_code(0x0000) - not-present page [ 287.605061] PGD 0 P4D 0 [ 287.607612] Oops: 0000 [#1] SMP NOPTI [ 287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G U 5.10.146 #0 [ 287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022 [ 287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod] [ 287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ...... [ 287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202 [ 287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000 [ 287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800 [ 287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff [ 287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800 [ 287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500 [ 287.692052] FS: 0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 287.700149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0 [ 287.713033] Call Trace: [ 287.715498] raid1d+0x6c/0xbbb [raid1] [ 287.719256] ? __schedule+0x1ff/0x760 [ 287.722930] ? schedule+0x3b/0xb0 [ 287.726260] ? schedule_timeout+0x1ed/0x290 [ 287.730456] ? __switch_to+0x11f/0x400 [ 287.734219] md_thread+0xe9/0x140 [md_mod] [ 287.738328] ? md_thread+0xe9/0x140 [md_mod] [ 287.742601] ? wait_woken+0x80/0x80 [ 287.746097] ? md_register_thread+0xe0/0xe0 [md_mod] [ 287.751064] kthread+0x11a/0x140 [ 287.754300] ? kthread_park+0x90/0x90 [ 287.757974] ret_from_fork+0x1f/0x30 In fact, when raid1 array run fail, we need to do md_unregister_thread() before raid1_free(). Signed-off-by: Jiang Li <jiang.li@ugreen.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 10:15:35 -08:00
Christoph Hellwig	ad831a16b0	md/raid5: use bdev_write_cache instead of open coding it Use the bdev_write_cache instead of two equivalent open coded checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 10:15:35 -08:00
Mikulas Patocka	341097ee53	md: fix a crash in mempool_free There's a crash in mempool_free when running the lvm test shell/lvchange-rebuild-raid.sh. The reason for the crash is this: * super_written calls atomic_dec_and_test(&mddev->pending_writes) and wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev) and bio_put(bio). * so, the process that waited on sb_wait and that is woken up is racing with bio_put(bio). * if the process wins the race, it calls bioset_exit before bio_put(bio) is executed. * bio_put(bio) attempts to free a bio into a destroyed bio set - causing a crash in mempool_free. We fix this bug by moving bio_put before atomic_dec_and_test. We also move rdev_dec_pending before atomic_dec_and_test as suggested by Neil Brown. The function md_end_flush has a similar bug - we must call bio_put before we decrement the number of in-progress bios. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 11557f0067 P4D 11557f0067 PUD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: kdelayd flush_expired_bios [dm_delay] RIP: 0010:mempool_free+0x47/0x80 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05 FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0 Call Trace: <TASK> clone_endio+0xf4/0x1c0 [dm_mod] clone_endio+0xf4/0x1c0 [dm_mod] __submit_bio+0x76/0x120 submit_bio_noacct_nocheck+0xb6/0x2a0 flush_expired_bios+0x28/0x2f [dm_delay] process_one_work+0x1b4/0x300 worker_thread+0x45/0x3e0 ? rescuer_thread+0x380/0x380 kthread+0xc2/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd] CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 10:15:35 -08:00
Xiao Ni	8e1a2279ca	md/raid0, raid10: Don't set discard sectors for request queue It should use disk_stack_limits to get a proper max_discard_sectors rather than setting a value by stack drivers. And there is a bug. If all member disks are rotational devices, raid0/raid10 set max_discard_sectors. So the member devices are not ssd/nvme, but raid0/raid10 export the wrong value. It reports warning messages in function __blkdev_issue_discard when mkfs.xfs like this: [ 4616.022599] ------------[ cut here ]------------ [ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0 [ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0 [ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7 [ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246 [ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080 [ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080 [ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000 [ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000 [ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080 [ 4616.213259] FS: 00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000 [ 4616.222298] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0 [ 4616.236689] Call Trace: [ 4616.239428] blkdev_issue_discard+0x52/0xb0 [ 4616.244108] blkdev_common_ioctl+0x43c/0xa00 [ 4616.248883] blkdev_ioctl+0x116/0x280 [ 4616.252977] __x64_sys_ioctl+0x8a/0xc0 [ 4616.257163] do_syscall_64+0x5c/0x90 [ 4616.261164] ? handle_mm_fault+0xc5/0x2a0 [ 4616.265652] ? do_user_addr_fault+0x1d8/0x690 [ 4616.270527] ? do_syscall_64+0x69/0x90 [ 4616.274717] ? exc_page_fault+0x62/0x150 [ 4616.279097] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 4616.284748] RIP: 0033:0x7f9a55398c6b Signed-off-by: Xiao Ni <xni@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 10:15:34 -08:00
Florian-Ewald Mueller	4555211190	md/bitmap: Fix bitmap chunk size overflow issues - limit bitmap chunk size internal u64 variable to values not overflowing the u32 bitmap superblock structure variable stored on persistent media - assign bitmap chunk size internal u64 variable from unsigned values to avoid possible sign extension artifacts when assigning from a s32 value The bug has been there since at least kernel 4.0. Steps to reproduce it: 1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2 -n2 /dev/rnbd1 /dev/rnbd2 2 resize member device rnbd1 and rnbd2 to 8 TB 3 mdadm --grow /dev/mdx --size=max The bitmap_chunksize will overflow without patch. Cc: stable@vger.kernel.org Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 09:35:50 -08:00
Ye Bin	f97a5528b2	md: introduce md_ro_state Introduce md_ro_state for mddev->ro, so it is easy to understand. Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 09:35:50 -08:00
Ye Bin	2f6d261e15	md: factor out __md_set_array_info() Factor out __md_set_array_info(). No functional change. Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 09:35:50 -08:00
Uros Bizjak	9487a0f685	raid5-cache: use try_cmpxchg in r5l_wake_reclaim Use try_cmpxchg instead of cmpxchg (ptr, old, new) == old in r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Cc: Song Liu <song@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 09:35:50 -08:00
Li Zhong	3bd548e5b8	drivers/md/md-bitmap: check the return value of md_bitmap_get_counter() Check the return value of md_bitmap_get_counter() in case it returns NULL pointer, which will result in a null pointer dereference. v2: update the check to include other dereference Signed-off-by: Li Zhong <floridsleeves@gmail.com> Signed-off-by: Song Liu <song@kernel.org>	2022-11-14 09:35:49 -08:00
Nikos Tsironis	5434ee8d28	dm clone: Fix typo in block_device format specifier Use %pg for printing the block device name, instead of %pd. Fixes: `385411ffba` ("dm: stop using bdevname") Cc: stable@vger.kernel.org # v5.18+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:17:48 -04:00
Genjian Zhang	99f4f5bcb9	dm: remove unnecessary assignment statement in alloc_dev() Fixes: `74fe6ba923` ("dm: convert to blk_alloc_disk/blk_cleanup_disk") Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:17:48 -04:00
Shaomin Deng	48d1a964dc	dm cache: delete the redundant word 'each' in comment Signed-off-by: Shaomin Deng <dengshaomin@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:17:47 -04:00
Jiangshan Yi	96fccdce97	dm raid: fix typo in analyse_superblocks code comment Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:17:47 -04:00
Nathan Huckleberry	afd41fff9c	dm verity: enable WQ_HIGHPRI on verify_wq WQ_HIGHPRI increases throughput and decreases disk latency when using dm-verity. This is important in Android for camera startup speed. The following tests were run by doing 60 seconds of random reads using a dm-verity device backed by two ramdisks. Without WQ_HIGHPRI lat (usec): min=13, max=3947, avg=69.53, stdev=50.55 READ: bw=51.1MiB/s (53.6MB/s), 51.1MiB/s-51.1MiB/s (53.6MB/s-53.6MB/s) With WQ_HIGHPRI: lat (usec): min=13, max=7854, avg=31.15, stdev=30.42 READ: bw=116MiB/s (121MB/s), 116MiB/s-116MiB/s (121MB/s-121MB/s) Further testing was done by measuring how long it takes to open a camera on an Android device. Without WQ_HIGHPRI Total verity work queue wait times (ms): 880.960, 789.517, 898.852 With WQ_HIGHPRI: Total verity work queue wait times (ms): 528.824, 439.191, 433.300 The average time to open the camera is reduced by 350ms (or 40-50%). Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:17:47 -04:00
Jilin Yuan	cea446630f	dm raid: delete the redundant word 'that' in comment Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:17:47 -04:00
Mikulas Patocka	43e6c11182	dm: change from DMWARN to DMERR or DMCRIT for fatal errors Change DMWARN to DMERR in cases when there is an unrecoverable error. Change DMWARN to DMCRIT when handling of a case is unimplemented. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 17:16:00 -04:00
Mikulas Patocka	141b3523e9	dm bufio: use the acquire memory barrier when testing for B_READING The function test_bit doesn't provide any memory barrier. It may be possible that the read requests that follow test_bit(B_READING, &b->state) are reordered before the test, reading invalid data that existed before B_READING was cleared. Fix this bug by changing test_bit to test_bit_acquire. This is particularly important on arches with weak(er) memory ordering (e.g. arm64). Depends-On: `8238b45798` ("wait_on_bit: add an acquire memory barrier") Depends-On: `d6ffe6067a` ("provide arch_test_bit_acquire for architectures that define test_bit") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-10-18 12:38:16 -04:00
Jason A. Donenfeld	a251c17aa5	treewide: use get_random_u32() when possible The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>	2022-10-11 17:42:58 -06:00
Jason A. Donenfeld	81895a65ec	treewide: use prandom_u32_max() when possible, part 1 Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int\|prandom_u32\|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) \| - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) \| - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) \| - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int\|prandom_u32\|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int\|prandom_u32\|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 232 - 1 or value == 231 - 1 or value == 224 - 1 or value == 216 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>	2022-10-11 17:42:55 -06:00
Linus Torvalds	7c989b1da3	for-6.1/passthrough-2022-10-04 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM8rp4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjTHD/9eeWwaG7oSSu5J1YzkKn+hptaDzZwreL98 Mh8euiQScUVpvHGkNowBhjBZ5cIAAcYaH17rjW7dWu6A7tv/iqygWd/YvIbs1JOe STSD9yf0RV4dI0MG6Wu2w6YxObaLvE5BTRxqb/WuFWNTgsYf2HEp4PM9sTio71+H WwWdRvsIxsRxVYemds3vBxd+BcM8vm26EoUTSaCwRhfopaJwBNceCYIIrM7VHUNM 5G6+DJkm3mB1a8nsdguYZQC/y8F/9P5Ch9CdxA12yOZEryr3wzsyRNGdm7oRmFGM bAkjFcddhwk5+SuTzGX6t4/Z3ODIjeCXbMBg4p7AShHws4Yx1trJePiqoNQ8xd5A PkMfxhQpBPlDFKLmwtObPLInyzMpp5P8KYMIZfyymKD/+XjmqAlR6TXbFUTihzBU lHSFhwG8ysT2cAVrFBMDJu4UPIThIHqfkkF/nTkHePTSArJ/k5rGV7v5sQpZ+jtY R0gvoNHTq2IvgKGEEbTgDjpwVcCn5ERVorZuGjVN2nMdLj35kXpo7YNgyYMaD5LJ 9SOR5a8iQjjudAfdGyZCGzNaOecizVFjABozUYc1XJi/boNuFTsq4XCE/tCLTixc V4sElRpgrlXxNXkiVdbuWIPuYo4sDw5gqZQynpVNH5PkmX/NqmpWYVEWJ20o+pwg 3ag39nZQVQ== =nwLk -----END PGP SIGNATURE----- Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux Pull passthrough updates from Jens Axboe: "With these changes, passthrough NVMe support over io_uring now performs at the same level as block device O_DIRECT, and in many cases 6-8% better. This contains: - Add support for fixed buffers for passthrough (Anuj, Kanchan) - Enable batched allocations and freeing on passthrough, similarly to what we support on the normal storage path (me) - Fix from Geert fixing an issue with !CONFIG_IO_URING" * tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux: io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy nvme: wire up fixed buffer support for nvme passthrough nvme: pass ubuffer as an integer block: extend functionality to map bvec iterator block: factor out blk_rq_map_bio_alloc helper block: rename bio_map_put to blk_mq_map_bio_put nvme: refactor nvme_alloc_request nvme: refactor nvme_add_user_metadata nvme: Use blk_rq_map_user_io helper scsi: Use blk_rq_map_user_io helper block: add blk_rq_map_user_io io_uring: introduce fixed buffer support for io_uring_cmd io_uring: add io_uring_cmd_import_fixed nvme: enable batched completions of passthrough IO nvme: split out metadata vs non metadata end_io uring_cmd completions block: allow end_io based requests in the completion batch handling block: change request end_io handler to pass back a return value block: enable batched allocation for blk_mq_alloc_request() block: kill deprecated BUG_ON() in the flush handling	2022-10-07 09:35:50 -07:00
Linus Torvalds	513389809e	for-6.1/block-2022-10-03 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a vQNj/iOBQA== =R05e -----END PGP SIGNATURE----- Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...	2022-10-07 09:19:14 -07:00
Linus Torvalds	f4309528f3	dlm for 6.1 This set of commits includes: . Fix a couple races found with a new torture test. . Improve errors when api functions are used incorrectly. . Improve tracing for lock requests from user space. . Fix use after free in recently added tracing code. . Small internal code cleanups. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJjOyfeAAoJEDgbc8f8gGmqHF4QALKGo+95JGzfXN37dNL2ve8L DAKxESYIwaTEWuKxmD4AGogClEl55UoC8kxMB3dHwLZEd4U0v5ZDULR6NUYXMpos 6miaoF+pJfBnpNRqpCieWRW5dYXD4TwSdquv5rUSmUBrdOSy34s/nORWB4kL443K hFPcbo5Mv1L0W70/+gdj1uBlBsenZxnXu6aEmrckONqwj9Q2SBjJTik9WuNwh+FF tEcmUt8kDanGkbwtMCxnbT3HDOdfQyW+qq4IJ6MOYHlW9Cqbp9QUvAIho4DEpr7f eGurQ/urSD3dltzuYQcZ81zGhaGxzaRt5d2AEHRrGugQ2ZvnsG74oSAmEINZTSw4 RV2EXyJ4hXcXK/yJXo3fGzFm2/5JFvYhnvddo6wts3vQZHwefExIRCHVz2cJL9eS gFpfFu4uB8z7w7l9s9LJKv7cTriaDd1WHuIWZGonz3wlFSUOn7IxunDxM3Hc5YO3 okawhr6sWe03fFcKsw1WeWymfDUwmk/7OV15OSDanItAwX5vkBYDBvAcA/cwm8cj P0Vb3c1/Sf1IjjHGGA13vHpD1JXJ7FHafg6jyWmjJNqaS+wtShvs2As9MqbtSWMb o2OcYTEEzME4mMIXZzVlKP7hhkLMaVR5PwGmbPovlyAkEUX0soH7nefyLMAqP3JG 7VZYV46VCL7wm3yjrKYw =sL1G -----END PGP SIGNATURE----- Merge tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - Fix a couple races found with a new torture test - Improve errors when api functions are used incorrectly - Improve tracing for lock requests from user space - Fix use after free in recently added tracing cod. - Small internal code cleanups * tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: fix possible use after free if tracing fs: dlm: const void resource name parameter fs: dlm: LSFL_CB_DELAY only for kernel lockspaces fs: dlm: remove DLM_LSFL_FS from uapi fs: dlm: trace user space callbacks fs: dlm: change ls_clear_proc_locks to spinlock fs: dlm: remove dlm_del_ast prototype fs: dlm: handle rcom in else if branch fs: dlm: allow lockspaces have zero lvblen fs: dlm: fix invalid derefence of sb_lvbptr fs: dlm: handle -EINVAL as log_error() fs: dlm: use __func__ for function name fs: dlm: handle -EBUSY first in unlock validation fs: dlm: handle -EBUSY first in lock arg validation fs: dlm: fix race between test_bit() and queue_work() fs: dlm: fix race in lowcomms	2022-10-03 20:11:59 -07:00
Linus Torvalds	d0989d01c6	hardening updates for v6.1-rc1 Various fixes across several hardening areas: - loadpin: Fix verity target enforcement (Matthias Kaehlcke). - zero-call-used-regs: Add missing clobbers in paravirt (Bill Wendling). - CFI: clean up sparc function pointer type mismatches (Bart Van Assche). - Clang: Adjust compiler flag detection for various Clang changes (Sami Tolvanen, Kees Cook). - fortify: Fix warnings in arch-specific code in sh, ARM, and xen. Improvements to existing features: - testing: improve overflow KUnit test, introduce fortify KUnit test, add more coverage to LKDTM tests (Bart Van Assche, Kees Cook). - overflow: Relax overflow type checking for wider utility. New features: - string: Introduce strtomem() and strtomem_pad() to fill a gap in strncpy() replacement needs. - um: Enable FORTIFY_SOURCE support. - fortify: Enable run-time struct member memcpy() overflow warning. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM4chcWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJvq1D/9uKU03RozAOnzhi4gcgRnHZSAK oOQOkPwnkUgFU0yOnMkNYOZ7njLnM+CjCN3RJ9SSpD2lrQ23PwLeThAuOzy0brPO 0iAksIztSF3e5tAyFjtFkjswrY8MSv/TkF0WttTOSOj3lCUcwatF0FBkclCOXtwu ILXfG7K8E17r/wsUejN+oMAI42ih/YeVQAZpKRymEEJsK+Lly7OT4uu3fdFWVb1P M77eRLI2Vg1eSgMVwv6XdwGakpUdwsboK7do0GGX+JOrhayJoCfY2IpwyPz9ciel jsp9OQs8NrlPJMa2sQ7LDl+b5EQl/MtggX3JlQEbLs2LV7gDtYgAWNo6vxCT5Lvd zB7TZqIR3lrVjbtw4FAKQ+41bS4VOajk2NB3Mkiy5AfivB+6zKF+P56a+xSoNhOl iktpjCEP7bp4oxmTMXpOfmywjh/ZsyoMhQ2ABP7S+JZ5rHUndpPAjjuBetIcHxX2 28Wlr4aFIF9ff9caasg4sMYXcQMGnuLUlUKngceUbd1umZZRNZ1gaIxYpm9poefm qd/lvTIvzn9V8IB8wHVmvafbvDbV88A+2bKJdSUDA352Dt9PvqT7yI0dmbMNliGL os+iLPW6Y6x38BxhXax0HR9FEhO3Eq7kLdNdc4J29NvISg8HHaifwNrG41lNwaWL cuc6IAjLxiRk3NsUpg== =HZ6+ -----END PGP SIGNATURE----- Merge tag 'hardening-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: "Most of the collected changes here are fixes across the tree for various hardening features (details noted below). The most notable new feature here is the addition of the memcpy() overflow warning (under CONFIG_FORTIFY_SOURCE), which is the next step on the path to killing the common class of "trivially detectable" buffer overflow conditions (i.e. on arrays with sizes known at compile time) that have resulted in many exploitable vulnerabilities over the years (e.g. BleedingTooth). This feature is expected to still have some undiscovered false positives. It's been in -next for a full development cycle and all the reported false positives have been fixed in their respective trees. All the known-bad code patterns we could find with Coccinelle are also either fixed in their respective trees or in flight. The commit message in commit `54d9469bc5` ("fortify: Add run-time WARN for cross-field memcpy()") for the feature has extensive details, but I'll repeat here that this is a warning _only_, and is not intended to actually block overflows (yet). The many patches fixing array sizes and struct members have been landing for several years now, and we're finally able to turn this on to find any remaining stragglers. Summary: Various fixes across several hardening areas: - loadpin: Fix verity target enforcement (Matthias Kaehlcke). - zero-call-used-regs: Add missing clobbers in paravirt (Bill Wendling). - CFI: clean up sparc function pointer type mismatches (Bart Van Assche). - Clang: Adjust compiler flag detection for various Clang changes (Sami Tolvanen, Kees Cook). - fortify: Fix warnings in arch-specific code in sh, ARM, and xen. Improvements to existing features: - testing: improve overflow KUnit test, introduce fortify KUnit test, add more coverage to LKDTM tests (Bart Van Assche, Kees Cook). - overflow: Relax overflow type checking for wider utility. New features: - string: Introduce strtomem() and strtomem_pad() to fill a gap in strncpy() replacement needs. - um: Enable FORTIFY_SOURCE support. - fortify: Enable run-time struct member memcpy() overflow warning" * tag 'hardening-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (27 commits) Makefile.extrawarn: Move -Wcast-function-type-strict to W=1 hardening: Remove Clang's enable flag for -ftrivial-auto-var-init=zero sparc: Unbreak the build x86/paravirt: add extra clobbers with ZERO_CALL_USED_REGS enabled x86/paravirt: clean up typos and grammaros fortify: Convert to struct vs member helpers fortify: Explicitly check bounds are compile-time constants x86/entry: Work around Clang __bdos() bug ARM: decompressor: Include .data.rel.ro.local fortify: Adjust KUnit test for modular build sh: machvec: Use char[] for section boundaries kunit/memcpy: Avoid pathological compile-time string size lib: Improve the is_signed_type() kunit test LoadPin: Require file with verity root digests to have a header dm: verity-loadpin: Only trust verity targets with enforcement LoadPin: Fix Kconfig doc about format of file with verity digests um: Enable FORTIFY_SOURCE lkdtm: Update tests for memcpy() run-time warnings fortify: Add run-time WARN for cross-field memcpy() fortify: Use SIZE_MAX instead of (size_t)-1 ...	2022-10-03 17:24:22 -07:00
Jens Axboe	de671d6116	block: change request end_io handler to pass back a return value Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:49:09 -06:00
Jens Axboe	736feaa3a0	Merge branch 'for-6.1/block' into for-6.1/passthrough * for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-30 07:47:38 -06:00
Christoph Hellwig	568ec936bf	block: replace blk_queue_nowait with bdev_nowait Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-27 09:57:58 -06:00
Zhou nan	65b94b527d	md: Fix spelling mistake in comments of r5l_log Fix spelling of dones't in comments. Signed-off-by: Zhou nan <zhounan@nfschina.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:06 -07:00
Logan Gunthorpe	5e2cf333b7	md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d A complicated deadlock exists when using the journal and an elevated group_thrtead_cnt. It was found with loop devices, but its not clear whether it can be seen with real disks. The deadlock can occur simply by writing data with an fio script. When the deadlock occurs, multiple threads will hang in different ways: 1) The group threads will hang in the blk-wbt code with bios waiting to be submitted to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 ops_run_io+0x46b/0x1a30 handle_stripe+0xcd3/0x36b0 handle_active_stripes.constprop.0+0x6f6/0xa60 raid5_do_work+0x177/0x330 Or: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 flush_deferred_bios+0x136/0x170 raid5_do_work+0x262/0x330 2) The r5l_reclaim thread will hang in the same way, submitting a bio to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 submit_bio+0x3f/0xf0 md_super_write+0x12f/0x1b0 md_update_sb.part.0+0x7c6/0xff0 md_update_sb+0x30/0x60 r5l_do_reclaim+0x4f9/0x5e0 r5l_reclaim_thread+0x69/0x30b However, before hanging, the MD_SB_CHANGE_PENDING flag will be set for sb_flags in r5l_write_super_and_discard_space(). This flag will never be cleared because the submit_bio() call never returns. 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe() will do no processing on any pending stripes and re-set STRIPE_HANDLE. This will cause the raid5d thread to enter an infinite loop, constantly trying to handle the same stripes stuck in the queue. The raid5d thread has a blk_plug that holds a number of bios that are also stuck waiting seeing the thread is in a loop that never schedules. These bios have been accounted for by blk-wbt thus preventing the other threads above from continuing when they try to submit bios. --Deadlock. To fix this, add the same wait_event() that is used in raid5_do_work() to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will schedule and wait until the flag is cleared. The schedule action will flush the plug which will allow the r5l_reclaim thread to continue, thus preventing the deadlock. However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING from the same thread and can thus deadlock if the thread is put to sleep. So avoid waiting if md_check_recovery() is being called in the loop. It's not clear when the deadlock was introduced, but the similar wait_event() call in raid5_do_work() was added in 2017 by this commit: `16d997b78b` ("md/raid5: simplfy delaying of writes while metadata is updated.") Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:06 -07:00
Yu Kuai	b9b083f904	md/raid10: convert resync_lock to use seqlock Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier', and io can't be dispatched until 'barrier' is dropped. Since holding the 'barrier' is not common, convert 'resync_lock' to use seqlock so that holding lock can be avoided in fast path. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-and-Tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:05 -07:00
Yu Kuai	4f350284a7	md/raid10: fix improper BUG_ON() in raise_barrier() 'conf->barrier' is protected by 'conf->resync_lock', reading 'conf->barrier' without holding the lock is wrong. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:05 -07:00
Yu Kuai	0c0be98bbe	md/raid10: prevent unnecessary calls to wake_up() in fast path Currently, wake_up() is called unconditionally in fast path such as raid10_make_request(), which will cause lock contention under high concurrency: raid10_make_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty in allow_barrier() and raid10_make_request(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:05 -07:00
Yu Kuai	0de57e541b	md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait For the case nowait in wait_barrier(), there is no point to increase nr_waiting and then decrease it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:05 -07:00
Yu Kuai	ed2e063f92	md/raid10: factor out code from wait_barrier() to stop_waiting_barrier() Currently the nasty condition in wait_barrier() is hard to read. This patch factors out the condition into a function. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:05 -07:00
Logan Gunthorpe	3bfc3bcd78	md: Remove extra mddev_get() in md_seq_start() A regression is seen where mddev devices stay permanently after they are stopped due to an elevated reference count. This was tracked down to an extra mddev_get() in md_seq_start(). It only happened rarely because most of the time the md_seq_start() is called with a zero offset. The path with an extra mddev_get() only happens when it starts with a non-zero offset. The commit noted below changed an mddev_get() to check its success but inadvertently left the original call in. Remove the extra call. Fixes: `12a6caf273` ("md: only delete entries from all_mddevs when the disk is freed") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:04 -07:00
David Sloan	c66a6f41e0	md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() When running chunk-sized reads on disks with badblocks duplicate bio free/puts are observed: ============================================================================= BUG bio-200 (Not tainted): Object already free ----------------------------------------------------------------------------- Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504 __slab_alloc.constprop.0+0x5a/0xb0 kmem_cache_alloc+0x31e/0x330 mempool_alloc_slab+0x17/0x20 mempool_alloc+0x100/0x2b0 bio_alloc_bioset+0x181/0x460 do_mpage_readpage+0x776/0xd00 mpage_readahead+0x166/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 force_page_cache_ra+0x181/0x1c0 page_cache_sync_ra+0x65/0xb0 filemap_get_pages+0x1df/0xaf0 filemap_read+0x1e1/0x700 blkdev_read_iter+0x1e5/0x330 vfs_read+0x42a/0x570 Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 raid5_make_request+0x2259/0x2450 md_handle_request+0x402/0x600 md_submit_bio+0xd9/0x120 __submit_bio+0x11f/0x1b0 submit_bio_noacct_nocheck+0x204/0x480 submit_bio_noacct+0x32e/0xc70 submit_bio+0x98/0x1a0 mpage_readahead+0x250/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked\|slab\|head\|node=0\|zone=2\|lastcpupid=0x1fffff) CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: raid5wq raid5_do_work Call Trace: <TASK> dump_stack_lvl+0x5a/0x78 dump_stack+0x10/0x16 print_trailer+0x158/0x165 object_err+0x35/0x50 free_debug_processing.cold+0xb7/0xbe __slab_free+0x1ae/0x330 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 mpage_end_io+0x36/0x150 bio_endio+0x2fd/0x360 md_end_io_acct+0x7e/0x90 bio_endio+0x2fd/0x360 handle_failed_stripe+0x960/0xb80 handle_stripe+0x1348/0x3760 handle_active_stripes.constprop.0+0x72a/0xaf0 raid5_do_work+0x177/0x330 process_one_work+0x616/0xb20 worker_thread+0x2bd/0x6f0 kthread+0x179/0x1b0 ret_from_fork+0x22/0x30 </TASK> The double free is caused by an unnecessary bio_put() in the if(is_badblock(...)) error path in raid5_read_one_chunk(). The error path was moved ahead of bio_alloc_clone() in `c82aa1b767` ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk"). The previous code checked and freed align_bio which required a bio_put. After the move that is no longer needed as raid_bio is returned to the control of the common io path which performs its own endio resulting in a double free on bad device blocks. Fixes: `c82aa1b767` ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk") Signed-off-by: David Sloan <david.sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:04 -07:00
Logan Gunthorpe	e2eed85bc7	md/raid5: Ensure stripe_fill happens on non-read IO with journal When doing degrade/recover tests using the journal a kernel BUG is hit at drivers/md/raid5.c:4381 in handle_parity_checks5(): BUG_ON(!test_bit(R5_UPTODATE, &dev->flags)); This was found to occur because handle_stripe_fill() was skipped for stripes in the journal due to a condition in that function. Thus blocks were not fetched and R5_UPTODATE was not set when the code reached handle_parity_checks5(). To fix this, don't skip handle_stripe_fill() unless the stripe is for read. Fixes: `07e8336484` ("md/r5cache: shift complex rmw from read path to write path") Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/ Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:04 -07:00
Logan Gunthorpe	f9287c3e93	md/raid5: Don't read ->active_stripes if it's not needed The atomic_read() is not needed in many cases so only do the read after the first checks are done. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:04 -07:00
Logan Gunthorpe	2f2d51efd8	md/raid5: Cleanup prototype of raid5_get_active_stripe() Drop the three bools in the prototype of raid5_get_active_stripe() and replace them with a flags parameter. At the same time, drop the distinction with __raid5_get_active_stripe(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:04 -07:00
Logan Gunthorpe	9892fa993f	md/raid5: Drop extern on function declarations in raid5.h externs should not be used in function declarations, so clean those up. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:04 -07:00
Logan Gunthorpe	b6d56144fe	md/raid5: Refactor raid5_get_active_stripe() Refactor raid5_get_active_stripe() without the gotos with an explicit infinite loop and some additional nesting. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:03 -07:00
Saurabh Sengar	1727fd5015	md: Replace snprintf with scnprintf Current code produces a warning as shown below when total characters in the constituent block device names plus the slashes exceeds 200. snprintf() returns the number of characters generated from the given input, which could cause the expression “200 – len” to wrap around to a large positive number. Fix this by using scnprintf() instead, which returns the actual number of characters written into the buffer. [ 1513.267938] ------------[ cut here ]------------ [ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510 [ 1513.267944] Modules linked in: <snip> [ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu [ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022 [ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510 <-snip-> [ 1513.267982] Call Trace: [ 1513.267986] snprintf+0x45/0x70 [ 1513.267990] ? disk_name+0x71/0xa0 [ 1513.267993] dump_zones+0x114/0x240 [raid0] [ 1513.267996] ? _cond_resched+0x19/0x40 [ 1513.267998] raid0_run+0x19e/0x270 [raid0] [ 1513.268000] md_run+0x5e0/0xc50 [ 1513.268003] ? security_capable+0x3f/0x60 [ 1513.268005] do_md_run+0x19/0x110 [ 1513.268006] md_ioctl+0x195e/0x1f90 [ 1513.268007] blkdev_ioctl+0x91f/0x9f0 [ 1513.268010] block_ioctl+0x3d/0x50 [ 1513.268012] do_vfs_ioctl+0xa9/0x640 [ 1513.268014] ? __fput+0x162/0x260 [ 1513.268016] ksys_ioctl+0x75/0x80 [ 1513.268017] __x64_sys_ioctl+0x1a/0x20 [ 1513.268019] do_syscall_64+0x5e/0x200 [ 1513.268021] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `766038846e` ("md/raid0: replace printk() with pr_*()") Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:03 -07:00
Guoqing Jiang	62bca04bb7	md/raid10: fix compile warning With W=1, compiler complains. drivers/md/raid10.c:1983: warning: bad line: Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:03 -07:00
XU pengfei	12ba6676b9	md/raid5: Fix spelling mistakes in comments Fix spelling of 'waitting' in comments. Signed-off-by: XU pengfei <xupengfei@nfschina.com> Signed-off-by: Song Liu <song@kernel.org>	2022-09-22 00:05:03 -07:00
Coly Li	d2d05b8803	bcache: fix set_at_max_writeback_rate() for multiple attached devices Inside set_at_max_writeback_rate() the calculation in following if() check is wrong, if (atomic_inc_return(&c->idle_counter) < atomic_read(&c->attached_dev_nr) * 6) Because each attached backing device has its own writeback thread running and increasing c->idle_counter, the counter increates much faster than expected. The correct calculation should be, (counter / dev_nr) < dev_nr * 6 which equals to, counter < dev_nr * dev_nr * 6 This patch fixes the above mistake with correct calculation, and helper routine idle_counter_exceeded() is added to make code be more clear. Reported-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Acked-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Link: https://lore.kernel.org/r/20220919161647.81238-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-19 11:12:35 -06:00
Jilin Yuan	6dd3be6923	bcache:: fix repeated words in comments Delete the redundant word 'we'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220919161647.81238-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-19 11:12:35 -06:00
Jules Maselbas	11e529ccea	bcache: bset: Fix comment typos Remove the redundant word `by`, correct the typo `creaated`. CC: Kent Overstreet <kent.overstreet@gmail.com> CC: linux-bcache@vger.kernel.org Signed-off-by: Jules Maselbas <jmaselbas@kalray.eu> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220919161647.81238-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-19 11:12:35 -06:00
Lin Feng	d86b4e6dc8	bcache: remove unused bch_mark_cache_readahead function def in stats.h This is a cleanup for commit `1616a4c2ab` ("bcache: remove bcache device self-defined readahead")', currently no user for bch_mark_cache_readahead() since that commit. Signed-off-by: Lin Feng <linf@wangsu.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220919161647.81238-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-19 11:12:35 -06:00
Li Lei	97d26ae764	bcache: remove unnecessary flush_workqueue All pending works will be drained by destroy_workqueue(), no need to call flush_workqueue() explicitly. Signed-off-by: Li Lei <lilei@szsandstone.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220919161647.81238-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-19 11:12:35 -06:00
Matthias Kaehlcke	916ef6232c	dm: verity-loadpin: Only trust verity targets with enforcement Verity targets can be configured to ignore corrupted data blocks. LoadPin must only trust verity targets that are configured to perform some kind of enforcement when data corruption is detected, like returning an error, restarting the system or triggering a panic. Fixes: `b6c1c5745c` ("dm: Add verity helpers for LoadPin") Reported-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220907133055.1.Ic8a1dafe960dc0f8302e189642bc88ebb785d274@changeid	2022-09-07 16:37:27 -07:00
Linus Torvalds	3e5c673f0d	block-6.0-2022-08-26 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMI9VgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgY+D/0axaG4MQODNSCyzTkKdFJqeL1Ab6tgz+SP w4hqGhBA50P18nqQo14v/l2wBnc3DnvX6OcoAj2v+27WRw4VAC6pYu9cHiiz0sK0 WyfYE/3qdB5PTe3149UcMd0BlPzMiadYKYhDbS5zLq5VGnun4oOT6uJugdz5toSl GgPJ4zQTjUE/w7pJ40fSTZ82FPCqVndBaSMycLvzzAkzfKdMxMIPBqFTDdW/gFqG 1Xc7NI4vASw8jsrG+TkPESNf4j0WcEs8+zCV9d9WFSQjWgktV1NpdCZ9CA+jD0cV h0i31SB43pttZZTARPraPgQdC23L22CpxxNTuzwz5ZKAuSulML87S/W2lN7HnagQ GAl6irQdB9RXN5o6vmOXkZIfmfuTctrXay7g13tNk/WUrGb/3euk4CJMYJWA4fM7 VkUS3yCjvP5z8auqOqti7zEy1QSN8mvdtsafU1n+tsf0WdIIqH+sufSnmJExbOs9 gzL2xcabOFCD5WYDCpl/SkbnY4G7Qrk4oHIB9S/r0aOKVHFdQ8JMp3GBVjSlHulN Nz3oScNQBG3CSvJ2QbFi6UQxt6YSs+tx/oa1plmeCe4w9tModooL0CdfmKMBwPXT AGlHvUIGyl4ZT/XamQ19AdFOE7WLHPG7wYtZUaFiNtkjAsgMoQr2bwYeTpb29Cfu 881BssmWCQ== =qR5N -----END PGP SIGNATURE----- Merge tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - MD pull request via Song: - Fix for clustered raid (Guoqing Jiang) - req_op fix (Bart Van Assche) - Fix race condition in raid recreate (David Sloan) - loop configuration overflow fix (Siddh) - Fix missing commit_rqs call for certain conditions (Yu) * tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block: md: call __md_stop_writes in md_stop Revert "md-raid: destroy the bitmap after destroying the thread" md: Flush workqueue md_rdev_misc_wq in md_alloc() md/raid10: Fix the data type of an r10_sync_page_io() argument loop: Check for overflow while configuring loop blk-mq: fix io hung due to missing commit_rqs	2022-08-26 11:05:54 -07:00
Guoqing Jiang	0dd84b3193	md: call __md_stop_writes in md_stop From the link [1], we can see raid1d was running even after the path raid_dtr -> md_stop -> __md_stop. Let's stop write first in destructor to align with normal md-raid to fix the KASAN issue. [1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a Fixes: `48df498daf` ("md: move bitmap_destroy to the beginning of __md_stop") Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-08-24 11:19:59 -07:00
Guoqing Jiang	1d258758cf	Revert "md-raid: destroy the bitmap after destroying the thread" This reverts commit `e151db8ecf`. Because it obviously breaks clustered raid as noticed by Neil though it fixed KASAN issue for dm-raid, let's revert it and fix KASAN issue in next commit. [1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470 Fixes: `e151db8ecf` ("md-raid: destroy the bitmap after destroying the thread") Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>	2022-08-24 11:19:23 -07:00
David Sloan	5e8daf906f	md: Flush workqueue md_rdev_misc_wq in md_alloc() A race condition still exists when removing and re-creating md devices in test cases. However, it is only seen on some setups. The race condition was tracked down to a reference still being held to the kobject by the rdev in the md_rdev_misc_wq which will be released in rdev_delayed_delete(). md_alloc() waits for previous deletions by waiting on the md_misc_wq, but the md_rdev_misc_wq may still be holding a reference to a recently removed device. To fix this, also flush the md_rdev_misc_wq in md_alloc(). Signed-off-by: David Sloan <david.sloan@eideticom.com> [logang@deltatee.com: rewrote commit message] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-08-24 10:26:35 -07:00
Bart Van Assche	265ad47a40	md/raid10: Fix the data type of an r10_sync_page_io() argument Fix the following sparse warning: drivers/md/raid10.c:2647:60: sparse: sparse: incorrect type in argument 5 (different base types) @@ expected restricted blk_opf_t [usertype] opf @@ got int rw @@ This patch does not change any functionality since REQ_OP_READ = READ = 0 and since REQ_OP_WRITE = WRITE = 1. Cc: Rong A Chen <rong.a.chen@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Fixes: `4ce4c73f66` ("md/core: Combine two sync_page_io() arguments") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Song Liu <song@kernel.org>	2022-08-24 10:26:35 -07:00
Alexander Aring	12cda13cfd	fs: dlm: remove DLM_LSFL_FS from uapi The DLM_LSFL_FS flag is set in lockspaces created directly for a kernel user, as opposed to those lockspaces created for user space applications. The user space libdlm allowed this flag to be set for lockspaces created from user space, but then used by a kernel user. No kernel user has ever used this method, so remove the ability to do it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2022-08-23 14:54:54 -05:00
Linus Torvalds	c3adefb5ba	- A few fixes for the DM verity and bufio changes from the 6.0 merge. - A smatch warning fix for DM writecache locking in writecache_map. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmL1q20ACgkQxSPxCi2d A1rZvggAxFFaj2V12TmJZ/D75ptDFfqsfEomsLKGqjq0pIYfhawWBnz0IgHd34vC 6Qy8bgoUGFQaexruFw6AjJQ3goaTJfFMy1/Nrf5Mvs7URlk7ckWgSz5ng9BRvePx Qyp03BKjtWpu++uKJpKq1DrHXTor0J73dVkHCnAcqHHKmaiZdy9gf+5OdUMYcBX6 rNwLqlSqGGMG2TQp6/tUdytUsB2GyhAPs/uhTtTDa0glTwRvYWU0dhacuBqngLwr G+UkPbE23s3NWMhvCK9qbTPT79prEgLrC/WXioeFBxJaV4LbXY/6hZ3817JpwjRv aDxFXr8V6lWLJEeYY0MxImAGE2DUiA== =yCb1 -----END PGP SIGNATURE----- Merge tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - A few fixes for the DM verity and bufio changes in this merge window - A smatch warning fix for DM writecache locking in writecache_map * tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm bufio: fix some cases where the code sleeps with spinlock held dm writecache: fix smatch warning about invalid return from writecache_map dm verity: fix verity_parse_opt_args parsing dm verity: fix DM_VERITY_OPTS_MAX value yet again dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking	2022-08-11 19:46:48 -07:00
Mikulas Patocka	e3a7c2947b	dm bufio: fix some cases where the code sleeps with spinlock held Commit `b32d45824a` ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag") added a "NO_SLEEP" mode, it replaces a mutex with a spinlock, and it is only usable when the device is in read-only mode (because the write path may be sleeping while holding the dm_bufio_client lock). However, there are still two points where the code could sleep even in read-only mode. One is in __get_unclaimed_buffer -> __make_buffer_clean. The other is in __try_evict_buffer -> __make_buffer_clean. These functions will call __make_buffer_clean which sleeps if the buffer is being read. Fix these cases so that if c->no_sleep is set __make_buffer_clean will not be called and the buffer will be skipped instead. Fixes: `b32d45824a` ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-11 11:10:42 -04:00
Mikulas Patocka	b7f362d641	dm writecache: fix smatch warning about invalid return from writecache_map There's a smatch warning "inconsistent returns '&wc->lock'" in dm-writecache. The reason for the warning is that writecache_map() doesn't drop the lock on the impossible path. Fix this warning by adding wc_unlock() after the BUG statement (so that it will be compiled-away anyway). Fixes: `df699cc16e` ("dm writecache: report invalid return from writecache_map helpers") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-09 19:20:23 -04:00
Mike Snitzer	f876df9f12	dm verity: fix verity_parse_opt_args parsing Commit `df326e7a06` ("dm verity: allow optional args to alter primary args handling") introduced a bug where verity_parse_opt_args() wouldn't properly shift past an optional argument's additional params (by ignoring them). Fix this by avoiding returning with error if an unknown argument is encountered when @only_modifier_opts=true is passed to verity_parse_opt_args(). In practice this regressed the cryptsetup testsuite's FEC testing because unknown optional arguments were encountered, wherey short-circuiting ever testing FEC mode. With this fix all of the cryptsetup testsuite's verity FEC tests pass. Fixes: `df326e7a06` ("dm verity: allow optional args to alter primary args handling") Reported-by: Milan Broz <gmazyland@gmail.com>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-09 19:15:17 -04:00
Mike Snitzer	8c22816dbc	dm verity: fix DM_VERITY_OPTS_MAX value yet again Must account for the possibility that "try_verify_in_tasklet" is used. This is the same issue that was fixed with commit `160f99db94` -- it is far too easy to miss that additional a new argument(s) require bumping DM_VERITY_OPTS_MAX accordingly. Fixes: `5721d4e5a9` ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-09 19:12:14 -04:00
Mike Snitzer	b33b6fdca3	dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking Historically none of the bufio code runs in interrupt context but with the use of DM_BUFIO_CLIENT_NO_SLEEP a bufio client can, see: commit `5721d4e5a9` ("dm verity: Add optional "try_verify_in_tasklet" feature") That said, the new tasklet usecase still doesn't require interrupts be disabled by bufio (let alone conditionally restore them). Yet with PREEMPT_RT, and falling back from tasklet to workqueue, care must be taken to properly synchronize between softirq and process context, otherwise ABBA deadlock may occur. While it is unnecessary to disable bottom-half preemption within a tasklet, we must consistently do so in process context to ensure locking is in the proper order. Fix these issues by switching from spin_lock_irq{save,restore} to using spin_{lock,unlock}_bh instead. Also remove the 'spinlock_flags' member in dm_bufio_client struct (that can be used unsafely if bufio must recurse on behalf of some caller, e.g. block layer's submit_bio). Fixes: `5721d4e5a9` ("dm verity: Add optional "try_verify_in_tasklet" feature") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2022-08-09 19:12:14 -04:00
Linus Torvalds	20cf903a0c	- Add flags argument to dm_bufio_client_create and introduce DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather than mutex for its locking. - Add optional "try_verify_in_tasklet" feature to DM verity target. This feature gives users the option to improve IO latency by using a tasklet to verify, using hashes in bufio's cache, rather than wait to schedule a work item via workqueue. But if there is a bufio cache miss, or an error, then the tasklet will fallback to using workqueue. - Incremental changes to both dm-bufio and the DM verity target to use jump_label to minimize cost of branching associated with the niche "try_verify_in_tasklet" feature. DM-bufio in particular is used by quite a few other DM targets so it doesn't make sense to incur additional bufio cost in those targets purely for the benefit of this niche verity feature if the feature isn't ever used. - Optimize verity_verify_io, which is used by both workqueue and tasklet based verification, if FEC is not configured or tasklet based verification isn't used. - Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if "try_verify_in_tasklet" is specified. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmLtYU0ACgkQxSPxCi2d A1pIDwgAjQi7jSxN7n+Fb4sJLL5x3WvuVGcockIkucj+Pvr3nvijwkf27+kbCWhn d4bDhA60gCebd87lf2PZTf8LL2+h9SLzFDTrgBVg5eC4O8aoQNrgwMMKVvYn+MmK OShurwHXS/7iqCETFaUA7hVtH/NwSWzP7WL5+QIDVOWVGaTLnqdvA4TYSZnljEg2 c02bL2KK+ndsYYshDq7HnVuqr4hIBWKF6y0lApU42mfTCnghX8ZnUMG9pO9K+20X qVfQH58CjOTP0MaHsddyR1sTKKZ1qY1HdoDhnlMVfZD5XqnCMhzefKoMxbxJKmJ3 7hS5w2tNxSx4yYWGj3dXHKhEZi0buA== =ZBi4 -----END PGP SIGNATURE----- Merge tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull more device mapper updates from Mike Snitzer: - Add flags argument to dm_bufio_client_create and introduce DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather than mutex for its locking. - Add optional "try_verify_in_tasklet" feature to DM verity target. This feature gives users the option to improve IO latency by using a tasklet to verify, using hashes in bufio's cache, rather than wait to schedule a work item via workqueue. But if there is a bufio cache miss, or an error, then the tasklet will fallback to using workqueue. - Incremental changes to both dm-bufio and the DM verity target to use jump_label to minimize cost of branching associated with the niche "try_verify_in_tasklet" feature. DM-bufio in particular is used by quite a few other DM targets so it doesn't make sense to incur additional bufio cost in those targets purely for the benefit of this niche verity feature if the feature isn't ever used. - Optimize verity_verify_io, which is used by both workqueue and tasklet based verification, if FEC is not configured or tasklet based verification isn't used. - Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if "try_verify_in_tasklet" is specified. * tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet" dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND dm verity: only copy bvec_iter in verity_verify_io if in_tasklet dm verity: optimize verity_verify_io if FEC not configured dm verity: conditionally enable branching for "try_verify_in_tasklet" dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP dm verity: allow optional args to alter primary args handling dm verity: Add optional "try_verify_in_tasklet" feature dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag dm bufio: Add flags argument to dm_bufio_client_create	2022-08-06 11:09:55 -07:00
Linus Torvalds	6614a3c316	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/ SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE= =w/UH -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...	2022-08-05 16:32:45 -07:00
Linus Torvalds	fa9db655d0	for-5.20/block-2022-08-04 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLsRfkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj43EADBydQhe7nQHH65gecqvttnio2GqEmcbozt lKFQlPPd3SHGMAJjSdR1dIwqtPsJ8q6xZXH+TjHhLXb2kgVu+TQ31krNHIqBwE14 s7SsgGRgvopA46lSf/ls18/8sh6Yz1NgI39YcMVPjvkbLaVFK7zRkL9OSp4RQCwH u/IIHJmV415EeF6QNTgABBel/gEIPBLsvwOxTBIkzDOyUohtExZPYj83MDm7jdr3 jsTUd2MiumNMh7ziMJIp1iN32nQOtIKtwWZaMHDCzfU/IUnBSmh2nj9oXr3+vcwo IsBMDUfUj9Eig5QQ/XcVIrFezi0GnunpBhScXPqL+dxPN812lzxNjkx6PsC+rPn8 mWmXoaeK1ayoyotdHJlmINNmWUSCkOMwVnA2r1c4Hp4cQS5vRUtkKcpNLTpMhk4I OwQ3bjt9mA//WlH+apbhJqXqxjcoBwCwMoveJ4mHVtku9lo+JJAKVGdUs17QjZkC NxACP1MtBcXy1hurNQf14oH5C0Hyg4TBJShPauKmrqGtOFnbOAdX2qIhldvyNfH1 l9cOvGNSgbQ6FLD6MVto6dC/KYOEM3LelVxgNB/80GbSmGwj88Kd/nzQLYFP89JJ 0Wkt14mSkm82gabOvNqXGG8P8hLb/+v6sp4qZv0mf+op0xmb4FB5eaZvoceptVzM 3Z+hmT7MfA== =pgNf -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - NVMe pull requests via Christoph: - add support for In-Band authentication (Hannes Reinecke) - handle the persistent internal error AER (Michael Kelley) - use in-capsule data for TCP I/O queue connect (Caleb Sander) - remove timeout for getting RDMA-CM established event (Israel Rukshin) - misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni, Guixin Liu, Xiang wangx) - use command_id instead of req->tag in trace_nvme_complete_rq() (Bean Huo) - various fixes for the new authentication code (Lukas Bulwahn, Dan Carpenter, Colin Ian King, Chaitanya Kulkarni, Hannes Reinecke) - small cleanups (Liu Song, Christoph Hellwig) - restore compat_ioctl support (Nick Bowler) - make a nvmet-tcp workqueue lockdep-safe (Sagi Grimberg) - enable generic interface (/dev/ngXnY) for unknown command sets (Joel Granados, Christoph Hellwig) - don't always build constants.o (Christoph Hellwig) - print the command name of aborted commands (Christoph Hellwig) - MD pull requests via Song: - Improve raid5 lock contention, by Logan Gunthorpe. - Misc fixes to raid5, by Logan Gunthorpe. - Fix race condition with md_reap_sync_thread(), by Guoqing Jiang. - Fix potential deadlock with raid5_quiesce and raid5_get_active_stripe, by Logan Gunthorpe. - Refactoring md_alloc(), by Christoph" - Fix md disk_name lifetime problems, by Christoph Hellwig - Convert prepare_to_wait() to wait_woken() api, by Logan Gunthorpe; - Fix sectors_to_do bitmap issue, by Logan Gunthorpe. - Work on unifying the null_blk module parameters and configfs API (Vincent) - drbd bitmap IO error fix (Lars) - Set of rnbd fixes (Guoqing, Md Haris) - Remove experimental marker on bcache async device registration (Coly) - Series from cleaning up the bio splitting (Christoph) - Removal of the sx8 block driver. This hardware never really widespread, and it didn't receive a lot of attention after the initial merge of it back in 2005 (Christoph) - A few fixes for s390 dasd (Eric, Jiang) - Followup set of fixes for ublk (Ming) - Support for UBLK_IO_NEED_GET_DATA for ublk (ZiyangZhang) - Fixes for the dio dma alignment (Keith) - Misc fixes and cleanups (Ming, Yu, Dan, Christophe * tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block: (136 commits) s390/dasd: Establish DMA alignment s390/dasd: drop unexpected word 'for' in comments ublk_drv: add support for UBLK_IO_NEED_GET_DATA ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA ublk_drv: cleanup ublksrv_ctrl_dev_info ublk_drv: add SET_PARAMS/GET_PARAMS control command ublk_drv: fix ublk device leak in case that add_disk fails ublk_drv: cancel device even though disk isn't up block: fix leaking page ref on truncated direct io block: ensure bio_iov_add_page can't fail block: ensure iov_iter advances for added pages drivers:md:fix a potential use-after-free bug md/raid5: Ensure batch_last is released before sleeping for quiesce md/raid5: Move stripe_request_ctx up md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage() md/raid5: Make is_inactive_blocked() helper md/raid5: Refactor raid5_get_active_stripe() block: pass struct queue_limits to the bio splitting helpers block: move bio_allowed_max_sectors to blk-merge.c block: move the call to get_max_io_size out of blk_bio_segment_split ...	2022-08-04 20:00:14 -07:00
Mike Snitzer	12907efde6	dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet" Allow verify_wq to preempt softirq since verification in tasklet will fall-back to using it for error handling (or if the bufio cache doesn't have required hashes). Suggested-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 15:59:52 -04:00
Mike Snitzer	43fa47cb11	dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND The documentation [1] says that WQ_CPU_INTENSIVE is "meaningless" for unbound wq. So remove WQ_CPU_INTENSIVE from the verify_wq allocation. 1. https://www.kernel.org/doc/html/latest/core-api/workqueue.html#flags Suggested-by: Maksym Planeta <mplaneta@os.inf.tu-dresden.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 14:53:07 -04:00
Mike Snitzer	e9307e3deb	dm verity: only copy bvec_iter in verity_verify_io if in_tasklet Avoid extra bvec_iter copy unless it is needed to allow retrying verification, that failed from a tasklet, from a workqueue. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 14:33:42 -04:00
Mike Snitzer	0a36463f4c	dm verity: optimize verity_verify_io if FEC not configured Only declare and copy bvec_iter if CONFIG_DM_VERITY_FEC is defined and FEC enabled for the verity device. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 14:02:20 -04:00
Mike Snitzer	ba2cce82ba	dm verity: conditionally enable branching for "try_verify_in_tasklet" Use jump_label to limit the need for branching unless the optional "try_verify_in_tasklet" feature is used. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 13:50:43 -04:00
Mike Snitzer	3c1c875d05	dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP Use jump_label to limit the need for branching unless the optional DM_BUFIO_CLIENT_NO_SLEEP is used. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 13:50:43 -04:00
Mike Snitzer	df326e7a06	dm verity: allow optional args to alter primary args handling The previous commit ("dm verity: Add optional "try_verify_in_tasklet" feature") imposed that CRYPTO_ALG_ASYNC mask be used even if the optional "try_verify_in_tasklet" feature was not specified. This was because verity_parse_opt_args() was called after handling the primary args (due to it having data dependencies on having first parsed all primary args). Enhance verity_ctr() so that simple optional args, that don't have a data dependency on primary args parsing, can alter how the primary args are handled. In practice this means verity_parse_opt_args() gets called twice. First with the new 'only_modifier_opts' arg set to true, then again with it set to false _after_ parsing all primary args. This allows the v->use_tasklet flag to be properly set and then used when verity_ctr() parses the primary args and then calls crypto_alloc_ahash() with CRYPTO_ALG_ASYNC conditionally set. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 13:50:43 -04:00
Nathan Huckleberry	5721d4e5a9	dm verity: Add optional "try_verify_in_tasklet" feature Using tasklets for disk verification can reduce IO latency. When there are accelerated hash instructions it is often better to compute the hash immediately using a tasklet rather than deferring verification to a work-queue. This reduces time spent waiting to schedule work-queue jobs, but requires spending slightly more time in interrupt context. If the dm-bufio cache does not have the required hashes we fallback to the work-queue implementation. FEC is only possible using work-queue because code to support the FEC feature may sleep. The following shows a speed comparison of random reads on a dm-verity device. The dm-verity device uses a 1G ramdisk for data and a 1G ramdisk for hashes. One test was run using tasklets and one test was run using the existing work-queue solution. Both tests were run when the dm-bufio cache was hot. The tasklet implementation performs significantly better since there is no time spent waiting for work-queue jobs to be scheduled. READ: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s), io=512MiB (537MB), run=2827-2827msec READ: bw=23.6MiB/s (24.8MB/s), 23.6MiB/s-23.6MiB/s (24.8MB/s-24.8MB/s), io=512MiB (537MB), run=21688-21688msec Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-08-04 13:50:40 -04:00
Wentao_Liang	104212471b	drivers:md:fix a potential use-after-free bug In line 2884, "raid5_release_stripe(sh);" drops the reference to sh and may cause sh to be released. However, sh is subsequently used in lines 2886 "if (sh->batch_head && sh != sh->batch_head)". This may result in an use-after-free bug. It can be fixed by moving "raid5_release_stripe(sh);" to the bottom of the function. Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Logan Gunthorpe	20313b1b8c	md/raid5: Ensure batch_last is released before sleeping for quiesce A race condition exists where if raid5_quiesce() is called in the middle of a request that has set batch_last, it will deadlock. batch_last will hold a reference to a stripe when raid5_quiesce() is called. This will cause the next raid5_get_active_stripe() call to sleep waiting for the quiesce to finish, but the raid5_quiesce() thread will wait for active_stripes to go to zero which will never happen because request thread is waiting for the quiesce to stop. Fix this by creating a special __raid5_get_active_stripe() function which takes the request context and clears the last_batch before sleeping. While we're at it, change the arguments of raid5_get_active_stripe() to bools. Fixes: `3312e6c887` ("md/raid5: Keep a reference to last stripe_head for batch") Reported-by: David Sloan <David.Sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Logan Gunthorpe	df6b0e205d	md/raid5: Move stripe_request_ctx up Move stripe_request_ctx up. No functional changes intended. This will be necessary in the next patch to release the batch_last in the context before sleeping. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Logan Gunthorpe	9734fe7bd5	md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage() Now that raid5_get_active_stripe() has been refactored it is appearant that r5c_check_stripe_cache_usage() doesn't need to be called in the wait_for_stripe branch. r5c_check_stripe_cache_usage() will only conditionally call r5l_wake_reclaim(), but that function is called two lines later. Drop the call for cleanup. Reported-by: Martin Oliveira <martin.oliveira@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Logan Gunthorpe	3514da58be	md/raid5: Make is_inactive_blocked() helper The logic to wait_for_stripe is difficult to parse being on so many lines and with confusing operator precedence. Move it to a helper function to make it easier to read. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Logan Gunthorpe	5165ed40a1	md/raid5: Refactor raid5_get_active_stripe() Refactor the raid5_get_active_stripe() to read more linearly in the order it's typically executed. The init_stripe() call is called if a free stripe is found and the function is exited early which removes a lot of if (sh) checks and unindents the following code. Remove the while loop in favour of the 'goto retry' pattern, which reduces indentation further. And use a 'goto wait_for_stripe' instead of an additional indent seeing it is the unusual path and this makes the code easier to read. No functional changes intended. Will make subsequent changes in patches easier to understand. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:53 -06:00
Christoph Hellwig	46754bd056	block: move ->bio_split to the gendisk Only non-passthrough requests are split by the block layer and use the ->bio_split bio_set. Move it from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 21:08:49 -06:00
Christoph Hellwig	5a97806f7d	block: change the blk_queue_split calling convention The double indirect bio leads to somewhat suboptimal code generation. Instead return the (original or split) bio, and make sure the request_queue arguments to the lower level helpers is passed after the bio to avoid constant reshuffling of the argument passing registers. Also give it and the helpers used to implement it more descriptive names. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:53 -06:00
Mikulas Patocka	d17f744e88	md-raid10: fix KASAN warning There's a KASAN warning in raid10_remove_disk when running the lvm test lvconvert-raid-reshape.sh. We fix this warning by verifying that the value "number" is valid. BUG: KASAN: slab-out-of-bounds in raid10_remove_disk+0x61/0x2a0 [raid10] Read of size 8 at addr ffff889108f3d300 by task mdX_raid10/124682 CPU: 3 PID: 124682 Comm: mdX_raid10 Not tainted 5.19.0-rc6 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x45/0x57a ? __lock_text_start+0x18/0x18 ? raid10_remove_disk+0x61/0x2a0 [raid10] kasan_report+0xa8/0xe0 ? raid10_remove_disk+0x61/0x2a0 [raid10] raid10_remove_disk+0x61/0x2a0 [raid10] Buffer I/O error on dev dm-76, logical block 15344, async page read ? __mutex_unlock_slowpath.constprop.0+0x1e0/0x1e0 remove_and_add_spares+0x367/0x8a0 [md_mod] ? super_written+0x1c0/0x1c0 [md_mod] ? mutex_trylock+0xac/0x120 ? _raw_spin_lock+0x72/0xc0 ? _raw_spin_lock_bh+0xc0/0xc0 md_check_recovery+0x848/0x960 [md_mod] raid10d+0xcf/0x3360 [raid10] ? sched_clock_cpu+0x185/0x1a0 ? rb_erase+0x4d4/0x620 ? var_wake_function+0xe0/0xe0 ? psi_group_change+0x411/0x500 ? preempt_count_sub+0xf/0xc0 ? _raw_spin_lock_irqsave+0x78/0xc0 ? __lock_text_start+0x18/0x18 ? raid10_sync_request+0x36c0/0x36c0 [raid10] ? preempt_count_sub+0xf/0xc0 ? _raw_spin_unlock_irqrestore+0x19/0x40 ? del_timer_sync+0xa9/0x100 ? try_to_del_timer_sync+0xc0/0xc0 ? _raw_spin_lock_irqsave+0x78/0xc0 ? __lock_text_start+0x18/0x18 ? _raw_spin_unlock_irq+0x11/0x24 ? __list_del_entry_valid+0x68/0xa0 ? finish_wait+0xa3/0x100 md_thread+0x161/0x260 [md_mod] ? unregister_md_personality+0xa0/0xa0 [md_mod] ? _raw_spin_lock_irqsave+0x78/0xc0 ? prepare_to_wait_event+0x2c0/0x2c0 ? unregister_md_personality+0xa0/0xa0 [md_mod] kthread+0x148/0x180 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 124495: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x80/0xa0 setup_conf+0x140/0x5c0 [raid10] raid10_run+0x4cd/0x740 [raid10] md_run+0x6f9/0x1300 [md_mod] raid_ctr+0x2531/0x4ac0 [dm_raid] dm_table_add_target+0x2b0/0x620 [dm_mod] table_load+0x1c8/0x400 [dm_mod] ctl_ioctl+0x29e/0x560 [dm_mod] dm_compat_ctl_ioctl+0x7/0x20 [dm_mod] __do_compat_sys_ioctl+0xfa/0x160 do_syscall_64+0x90/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Last potentially related work creation: kasan_save_stack+0x1e/0x40 __kasan_record_aux_stack+0x9e/0xc0 kvfree_call_rcu+0x84/0x480 timerfd_release+0x82/0x140 L __fput+0xfa/0x400 task_work_run+0x80/0xc0 exit_to_user_mode_prepare+0x155/0x160 syscall_exit_to_user_mode+0x12/0x40 do_syscall_64+0x42/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Second to last potentially related work creation: kasan_save_stack+0x1e/0x40 __kasan_record_aux_stack+0x9e/0xc0 kvfree_call_rcu+0x84/0x480 timerfd_release+0x82/0x140 __fput+0xfa/0x400 task_work_run+0x80/0xc0 exit_to_user_mode_prepare+0x155/0x160 syscall_exit_to_user_mode+0x12/0x40 do_syscall_64+0x42/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The buggy address belongs to the object at ffff889108f3d200 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 0 bytes to the right of 256-byte region [ffff889108f3d200, ffff889108f3d300) The buggy address belongs to the physical page: page:000000007ef2a34c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1108f3c head:000000007ef2a34c order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab\|head\|zone=2) raw: 4000000000010200 0000000000000000 dead000000000001 ffff889100042b40 raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff889108f3d200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff889108f3d280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff889108f3d300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff889108f3d380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff889108f3d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Mikulas Patocka	e151db8ecf	md-raid: destroy the bitmap after destroying the thread When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with kasan, we got failure in write_page. The reason for the failure is that md_bitmap_destroy is called before destroying the thread and the thread may be waiting in the function write_page for the bio to complete. When the thread finishes waiting, it executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which triggers the kasan warning. Note that the commit `48df498daf` that caused this bug claims that it is neede for md-cluster, you should check md-cluster and possibly find another bugfix for it. BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod] Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539 CPU: 10 PID: 5539 Comm: mdX_raid1 Not tainted 5.19.0-rc2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x45/0x57a ? __lock_text_start+0x18/0x18 ? write_page+0x18d/0x680 [md_mod] kasan_report+0xa8/0xe0 ? write_page+0x18d/0x680 [md_mod] kasan_check_range+0x13f/0x180 write_page+0x18d/0x680 [md_mod] ? super_sync+0x4d5/0x560 [dm_raid] ? md_bitmap_file_kick+0xa0/0xa0 [md_mod] ? rs_set_dev_and_array_sectors+0x2e0/0x2e0 [dm_raid] ? mutex_trylock+0x120/0x120 ? preempt_count_add+0x6b/0xc0 ? preempt_count_sub+0xf/0xc0 md_update_sb+0x707/0xe40 [md_mod] md_reap_sync_thread+0x1b2/0x4a0 [md_mod] md_check_recovery+0x533/0x960 [md_mod] raid1d+0xc8/0x2a20 [raid1] ? var_wake_function+0xe0/0xe0 ? psi_group_change+0x411/0x500 ? preempt_count_sub+0xf/0xc0 ? _raw_spin_lock_irqsave+0x78/0xc0 ? __lock_text_start+0x18/0x18 ? raid1_end_read_request+0x2a0/0x2a0 [raid1] ? preempt_count_sub+0xf/0xc0 ? _raw_spin_unlock_irqrestore+0x19/0x40 ? del_timer_sync+0xa9/0x100 ? try_to_del_timer_sync+0xc0/0xc0 ? _raw_spin_lock_irqsave+0x78/0xc0 ? __lock_text_start+0x18/0x18 ? __list_del_entry_valid+0x68/0xa0 ? finish_wait+0xa3/0x100 md_thread+0x161/0x260 [md_mod] ? unregister_md_personality+0xa0/0xa0 [md_mod] ? _raw_spin_lock_irqsave+0x78/0xc0 ? prepare_to_wait_event+0x2c0/0x2c0 ? unregister_md_personality+0xa0/0xa0 [md_mod] kthread+0x148/0x180 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 5522: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x80/0xa0 md_bitmap_create+0xa8/0xe80 [md_mod] md_run+0x777/0x1300 [md_mod] raid_ctr+0x249c/0x4a30 [dm_raid] dm_table_add_target+0x2b0/0x620 [dm_mod] table_load+0x1c8/0x400 [dm_mod] ctl_ioctl+0x29e/0x560 [dm_mod] dm_compat_ctl_ioctl+0x7/0x20 [dm_mod] __do_compat_sys_ioctl+0xfa/0x160 do_syscall_64+0x90/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 5680: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x40 kasan_set_free_info+0x20/0x40 __kasan_slab_free+0xf7/0x140 kfree+0x80/0x240 md_bitmap_free+0x1c3/0x280 [md_mod] __md_stop+0x21/0x120 [md_mod] md_stop+0x9/0x40 [md_mod] raid_dtr+0x1b/0x40 [dm_raid] dm_table_destroy+0x98/0x1e0 [dm_mod] __dm_destroy+0x199/0x360 [dm_mod] dev_remove+0x10c/0x160 [dm_mod] ctl_ioctl+0x29e/0x560 [dm_mod] dm_compat_ctl_ioctl+0x7/0x20 [dm_mod] __do_compat_sys_ioctl+0xfa/0x160 do_syscall_64+0x90/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: `48df498daf` ("md: move bitmap_destroy to the beginning of __md_stop") Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Christoph Hellwig	34cb92c0a5	md: return the allocated devices from md_alloc Two callers of md_alloc want to use the newly allocated devices, so return it instead of letting them find it cumbersomely after the allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Christoph Hellwig	a110876828	md: open code md_probe in autorun_devices autorun_devices should not be limited to the controls for the legacy probe on open, so just call md_alloc directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Yang Li	c0250d16b2	md: remove unneeded semicolon Eliminate the following coccicheck warning: ./drivers/md/md.c:8208:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Stephen Rothwell	2198c51a08	md: fix build failure for !MODULE After merging the block tree, today's linux-next build (x86_64 allmodconfig) failed like this: drivers/md/md.c:717:22: error: 'mddev_find' defined but not used [-Werror=unused-function] 717 \| static struct mddev *mddev_find(dev_t unit) \| ^~~~~~~~~~ cc1: all warnings being treated as errors Caused by commit 4500d5c17910 ("md: simplify md_open") Make mddev_find() available only for non-modular builds. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220721131132.070be166@canb.auug.org.au Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Jackie Liu	a20d636bee	raid5: fix duplicate checks for rdev->saved_raid_disk 'first' will always be greater than or equal to 0, it is unnecessary to repeat the 0 check, clean it up. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Christoph Hellwig	5b26804bb0	md: simplify md_open Now that devices are on the all_mddevs list until the gendisk is freed, there can't be any duplicates. Remove the global list lookup and just grab a reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:46 -06:00
Christoph Hellwig	12a6caf273	md: only delete entries from all_mddevs when the disk is freed This ensures device names don't get prematurely reused. Instead add a deleted flag to skip already deleted devices in mddev_get and other places that only want to see live mddevs. Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:44 -06:00
Christoph Hellwig	16648bac86	md: stop using for_each_mddev in md_exit Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a reference when we drop the lock and delete the now unused for_each_mddev macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:44 -06:00
Christoph Hellwig	f265143422	md: stop using for_each_mddev in md_notify_reboot Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a reference when we drop the lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:44 -06:00
Christoph Hellwig	b0e706a1ba	md: stop using for_each_mddev in md_do_sync Just do a plain list_for_each that only grabs a mddev reference in the case where the thread sleeps and restarts the list iteration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Christoph Hellwig	2652a1bd2e	md: factor out the rdev overlaps check from rdev_size_store This splits the code into nicely readable chunks and also avoids the refcount inc/dec manipulations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Christoph Hellwig	33b614e334	md: rename md_free to md_kobj_release The md_free name is rather misleading, so pick a better one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Christoph Hellwig	e8c59ac419	md: implement ->free_disk Ensure that all private data is only freed once all accesses are done. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Christoph Hellwig	c57094a6e1	md: fix error handling in md_alloc Error handling in md_alloc is a mess. Untangle it to just free the mddev directly before add_disk is called and thus the gendisk is globally visible. After that clear the hold flag and let the mddev_put take care of cleaning up the mddev through the usual mechanisms. Fixes: `5e55e2f5fc` ("[PATCH] md: convert compile time warnings into runtime warnings") Fixes: `9be68dd7ac` ("md: add error handling support for add_disk()") Fixes: `7ad1069166` ("md: properly unwind when failing to add the kobject in md_alloc") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Christoph Hellwig	ca39f75024	md: fix mddev->kobj lifetime Once a kobject is initialized, the containing object should not be directly freed. So delay initialization until it is added. Also remove the kobject_del call as the last put will remove the kobject as well. The explicitly delete isn't needed here, and dropping it will simplify further fixes. With this md_free now does not need to check that ->gendisk is non-NULL as it is always set by the time that kobject_init is called on mddev->kobj. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Logan Gunthorpe	ee1aa06ba3	md/raid5: Convert prepare_to_wait() to wait_woken() api raid5_get_active_stripe() can sleep in various situations and it is called by make_stripe_request() while inside the prepare_to_wait()/finish_wait() section. Nested waits like this are not supported. This was noticed while making other changes that add different sleeps to raid5_get_active_stripe() that caused a WARNING with CONFIG_DEBUG_ATOMIC_SLEEP. No ill effects have been noticed with the code as is, but theoretically a nested and here could cause a dead lock so it should be fixed. To fix this, convert the prepare_to_wait() call to use wake_woken() which supports nested sleeps. Link: https://lwn.net/Articles/628628/ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:43 -06:00
Logan Gunthorpe	b9f91d80de	md/raid5: Fix sectors_to_do bitmap overflow in raid5_make_request() For unaligned IO that have nearly maximum sectors, the number of stripes will end up being one greater than the size of the bitmap. When this happens, the last stripe in the IO will not be processed as it should be, resulting in data corruption. However, this is not normally seen when the backing block devices have 4K physical block sizes since the block layer will split the request before that happens. To fix this increase the bitmap size by one bit and ensure the full number of stripes are checked when calling find_first_bit(). Reported-by: David Sloan <David.Sloan@eideticom.com> Fixes: `7e55c60acf` ("md/raid5: Pivot raid5_make_request()") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:41 -06:00
Coly Li	640c46a21f	bcache: remove EXPERIMENTAL for Kconfig option 'Asynchronous device registration' The "Asynchronous device registration (EXPERIMENTAL)" Kconfig option is for 2+ years, it is used when registration takes too much time for massive amount of cached data, to avoid udev task timeout during boot time. Many users and products enable this Kconfig option for quite long time (e.g. SUSE Linux) and it works as expected and no issue reported. It is time to remove the "EXPERIMENTAL" tag from this Kconfig item. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220719042724.8498-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:22:41 -06:00
Zhang Jiaming	9e26728b5f	md: Fix spelling mistake in comments There are 2 spelling mistakes in comments. Fix it. Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:44 -06:00
Logan Gunthorpe	9ad1a74ff0	md/raid5: Increase restriction on max segments per request The block layer defaults the maximum segments to 128, which means requests tend to get split around the 512KB depending on how many pages can be merged. There's no such restriction in the raid5 code so increase the limit to USHRT_MAX so that larger requests can be sent as one. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:43 -06:00
Logan Gunthorpe	df1b620a3e	md/raid5: Improve debug prints Add a debug print for raid5_make_request() so that each request is printed and add the logical sector number to the debug print in __add_stripe_bio(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	7e55c60acf	md/raid5: Pivot raid5_make_request() raid5_make_request() loops through every page in the request, finds the appropriate stripe and adds the bio for that page in the disk. This causes a great deal of contention on the hash_lock and extra work seeing each stripe must be found once for every data disk. The number of times a stripe must be found can be reduced by pivoting raid5_make_request() so that it loops through every stripe and then loops through every disk in that stripe to see if the bio must be added. This reduces the number of times the hash lock must be taken by a factor equal to the number of data disks. To accomplish this, the logical sectors that have already been added must be tracked. Tracking them is done with a bitmap: the bits for all pages are set at the start of the request and each bit is cleared once the bio is added to a stripe. Finding the next sector to be done is then just a call to find_first_bit() so that sectors that have been done can simply be skipped. One minor downside is that the maximum sectors for a request must be limited so that the bitmap can be appropriately sized on the stack. This limit is arbitrarily chosen to be 256 stripe pages which works out to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually restrict the maximum request further seeing the default block queue settings are used which restricts the number of segments to 128 (which results in request sizes that are approximately 512KB). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	486f605586	md/raid5: Check all disks in a stripe_head for reshape progress When testing if a previous stripe has had reshape expand past it, use the earliest or latest logical sector in all the disks for that stripe head. This will allow adding multiple disks at a time in a subesquent patch. To do this cleaner, refactor the check into a helper function called stripe_ahead_of_reshape(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	4ad1d9849f	md/raid5: Refactor add_stripe_bio() Factor out two helper functions from add_stripe_bio(): one to check for overlap (stripe_bio_overlaps()), and one to actually add the bio to the stripe (__add_stripe_bio()). The latter function will always succeed. This will be useful in the next patch so that overlap can be checked for multiple disks before adding any Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	3312e6c887	md/raid5: Keep a reference to last stripe_head for batch When batching, every stripe head has to find the previous stripe head to add to the batch list. This involves taking the hash lock which is highly contended during IO. Instead of finding the previous stripe_head each time, store a reference to the previous stripe_head in a pointer so that it doesn't require taking the contended lock another time. The reference to the previous stripe must be released before scheduling and waiting for work to get done. Otherwise, it can hold up raid5_activate_delayed() and deadlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	0a2d1694de	md/raid5: Refactor for loop in raid5_make_request() into while loop The for loop with retry label can be more cleanly expressed as a while loop by moving the logical_sector increment into the success path. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	4f35456076	md/raid5: Move read_seqcount_begin() into make_stripe_request() Now that prepare_to_wait() isn't in the way, move read_sequcount_begin() into make_stripe_request(). No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	1cdb5b4170	md/raid5: Drop the do_prepare flag in raid5_make_request() prepare_to_wait() can be reasonably called after schedule instead of setting a flag and preparing in the next loop iteration. This means that prepare_to_wait() will be called before read_seqcount_begin(), but there shouldn't be any reason that the order matters here. On the first iteration of the loop prepare_to_wait() is already called first. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	f4aec6a097	md/raid5: Factor out helper from raid5_make_request() loop Factor out the inner loop of raid5_make_request() into it's own helper called make_stripe_request(). The helper returns a number of statuses: SUCCESS, RETRY, SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to understand and allows the SCHEDULE_AND_RETRY path to be made common. A context structure is added to contain do_flush. It will be used more in subsequent patches for state that needs to be kept outside the loop. No functional changes intended. This will be cleaned up further in subsequent patches to untangle the gen_lock and do_prepare logic further. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:42 -06:00
Logan Gunthorpe	1baa1126e0	md/raid5: Move common stripe get code into new find_get_stripe() helper Both uses of find_stripe() require a fairly complicated dance to increment the reference count. Move this into a common find_get_stripe() helper. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	8757fef675	md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() stripe_add_to_batch_list() is better done in the loop in make_request instead of inside add_stripe_bio(). This is clearer and allows for storing the batch_head state outside the loop in a subsequent patch. The call to add_stripe_bio() in retry_aligned_read() is for read and batching only applies to write. So it's impossible for batching to happen at that call site. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	27fb701046	md/raid5: Refactor raid5_make_request loop Break immediately if raid5_get_active_stripe() returns NULL and deindent the rest of the loop. Annotate this check with an unlikely(). This makes the code easier to read and reduces the indentation level. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	a8bb304ca5	md/raid5: Factor out ahead_of_reshape() function There are a few uses of an ugly ternary operator in raid5_make_request() to check if a sector is a head of a reshape sector. Factor this out into a simple helper called ahead_of_reshape(). No functional changes intended. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	6e3f50d30a	md/raid5: Make logic blocking check consistent with logic that blocks The check in raid5_make_request differs very slightly from the logic that causes it to block lower down. This likely does not cause a bug as the check is fuzzy anyway (as reshape may move on between the first check and the subsequent check). However, make it consistent so it can be cleaned up in a subsequent patch. The condition which causes the schedule is: !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress : logical_sector >= conf->reshape_progress) && (mddev->reshape_backwards ? logical_sector < conf->reshape_safe : logical_sector >= conf->reshape_safe) The condition that causes the early bailout is made to match this. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Guoqing Jiang	9dfbdafda3	md: unlock mddev before reap sync_thread in action_store Since the bug which commit `8b48ec23cc` ("md: don't unregister sync_thread with reconfig_mutex held") fixed is related with action_store path, other callers which reap sync_thread didn't need to be changed. Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous bug with belows. 1. unlock mddev before md_reap_sync_thread in action_store. 2. save reshape_position before unlock, then restore it to ensure position not changed accidentally by others. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Chris Webb	05ce7fb946	md: Explicitly create command-line configured devices Boot-time assembly of arrays with md= command-line arguments breaks when CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c calls blkdev_get_by_dev(), assuming this implicitly creates the block device. Fix this by attempting to md_alloc() the array first. As in the probe path, ignore any error as failure is caught by blkdev_get_by_dev() anyway. Signed-off-by: Chris Webb <chris@arachsys.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	9973f0fa7d	md: Notify sysfs sync_completed in md_reap_sync_thread() The mdadm test 07layouts randomly produces a kernel hung task deadlock. The deadlock is caused by the suspend_lo/suspend_hi files being set by the mdadm background process during reshape and not being cleared because the process hangs. (Leaving aside the issue of the fragility of freezing kernel tasks by buggy userspace processes...) When the background mdadm process hangs it, is waiting (without a timeout) on a change to the sync_completed file signalling that the reshape has completed. The process is woken up a couple times when the reshape finishes but it is woken up before MD_RECOVERY_RUNNING is cleared so sync_completed_show() reports 0 instead of "none". To fix this, notify the sysfs file in md_reap_sync_thread() after MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes it to continue and write to suspend_lo/suspend_hi to allow IO to continue. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	b368856aab	md: Ensure resync is reported after it starts The 07layouts test in mdadm fails on some systems. The failure presents itself as the backup file not being removed before the next layout is grown into: mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup: File exists This is because the background mdadm process, which is responsible for cleaning up this backup file gets into an infinite loop waiting for the reshape to start. mdadm checks the mdstat file if a reshape is going and, if it is not, it waits for an event on the file or times out in 5 seconds. On faster machines, the reshape may complete before the 5 seconds times out, and thus the background mdadm process loops waiting for a reshape to start that has already occurred. mdadm reads the mdstat file to start, but mdstat does not report that the reshape has begun, even though it has indeed begun. So the mdstat_wait() call (in mdadm) which polls on the mdstat file won't ever return until timing out. The reason mdstat reports the reshape has started is due to an issue in status_resync(). recovery_active is subtracted from curr_resync which will result in a value of zero for the first chunk of reshaped data, and the resulting read will report no reshape in progress. To fix this, if "resync - recovery_active" is an overloaded value, force the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	eac58d08d4	md: Use enum for overloaded magic numbers used by mddev->curr_resync Comments in the code document special values used for mddev->curr_resync. Make this clearer by using an enum to label these values. The only functional change is a couple places use the wrong comparison operator that implied 3 is another special value. They are all fixed to imply that 3 or greater is an active resync. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:40 -06:00
Logan Gunthorpe	6f28c5c312	md/raid5-cache: Annotate pslot with __rcu notation radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the slot returned and looked up to be marked with __rcu. Otherwise sparse warnings are generated: drivers/md/raid5-cache.c:2939:23: warning: incorrect type in assignment (different address spaces) drivers/md/raid5-cache.c:2939:23: expected void pslot drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:32 -06:00
Logan Gunthorpe	b13015af94	md/raid5-cache: Clear conf->log after finishing work A NULL pointer dereferlence on conf->log is seen randomly with the mdadm test 21raid5cache. Kasan reporst: BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140 Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086 Call Trace: dump_stack_lvl+0x5a/0x74 kasan_report.cold+0x5f/0x1a9 __asan_load8+0x69/0x90 r5l_reclaimable_space+0xf5/0x140 r5l_do_reclaim+0xf4/0x5e0 r5l_reclaim_thread+0x69/0x3b0 md_thread+0x1a2/0x2c0 kthread+0x177/0x1b0 ret_from_fork+0x22/0x30 This is caused by conf->log being cleared in r5l_exit_log() before stopping the reclaim thread. To fix this, clear conf->log after the reclaim_thread is unregistered and after flushing disable_writeback_work. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	7769085c8d	md/raid5-cache: Drop RCU usage of conf->log The only place that uses RCU to access conf->log is in r5l_log_disk_error(). This function is mostly used in the IO path and once with mddev_lock() held in raid5_change_consistency_policy(). It is known that the IO will be suspended before the log is freed and r5l_log_exit() is called with the mddev_lock() held. This should mean that conf->log can not be freed while the function is being called, so the RCU protection is not necessary. Drop the rcu_read_lock() as well as the synchronize_rcu() and rcu_assign_pointer() usage. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	78ede6a06f	md/raid5-cache: Take mddev_lock in r5c_journal_mode_show() The mddev->lock spinlock doesn't protect against the removal of conf->log in r5l_exit_log() so conf->log may be freed before it is used. To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	c629f345b4	md/raid5: suspend the array for calls to log_exit() The raid5-cache code relies on there being no IO in flight when log_exit() is called. There are two places where this is not guaranteed so add mddev_suspend() and mddev_resume() calls to these sites. The site in raid5_change_consistency_policy() is in the error path, and another similar call site already has suspend/resume calls just below it; so it should be equally safe to make that change here. There is one remaining site in raid5_remove_disk() that we call log_exit() without suspending the array. Unfortunately, as the comment stated, we cannot call mddev_suspend from raid5d. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	e0fccdafc2	md/raid5-ppl: Drop unused argument from ppl_handle_flush_request() ppl_handle_flush_request() takes an struct r5log argument but doesn't use it. It has no buisiness taking this argument as it is only used by raid5-cache and has no way to derference it anyway. Remove the argument. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Logan Gunthorpe	ed0c6a5fbe	md/raid5-log: Drop extern decorators for function prototypes extern is not necessary and recommended against when defining prototype functions in headers. checkpatch.pl complains about these. So remove them. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-08-02 17:14:31 -06:00
Linus Torvalds	6991a564f5	hardening updates for v5.20-rc1 - Fix Sparse warnings with randomizd kstack (GONG, Ruiqi) - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld) - Fix Clang -Wforward warning in LKDTM (Justin Stitt) - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn) - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke) - Clean up warnings and overflow and KASAN tests (Kees Cook) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoEN4WHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJr19D/0fUvHaOui3+ePqKL1CEN2WOYxK Ed/HA0kM7VZnuakS2OoWbHYKurt9wImBkw0EuryNEP4nCBHy5OIyDOmWF7DjWntG 9agKLW5rRgbKe9STbGZpJ92WWosOcJkgkDVES1/NjWt7ujLiefzcZE85hj2Dt1aQ 6nF2LlkdGdtsa07hP5CR5bynQxAAxg1R1pLiJCgZRYn1SEFYtjcnBjUMrPUFJAi2 TNy6ijeG473Oj6V/JiIY88u41KG1fed22SymNj6aQVIjGpH7atn6/ooG076ydAyt QEibSyQP/CwkSbyiqVFOq4v4a+hKEB5j5F+iKZBrCnFWNvt8D3tizBYgm1NymNEZ VBZdg+UhcoVDwiMNzSaAGvt15Qv0INNkQm9PJoeUGSdXz0Yjf4ghOIeaQc3jm6Of tElawmPXxVwRZfNpf5tyPaZFphAPK5EAl35S5mdWinKbAO7Jpz9xqvoyZz9/kygR Kd4qyRPrl0YM8SBKFuYt5rFaYfw9wqF7ox7cMmwR+pbEHt7UDqDvkX2fBbpCyXza 5nJ9PDyvB5SqonIF57RiImXCLKXR6UMJgQvtDGsf+n4hpxL40Nga4pqWY82aQEhj SRdQlkFYhI/izIq1kMJNa8IoOONlJzV6i87D8iOW32bI3/SrykRUTvV3ohpMri7V UFkXzz8pfqfJ4k4zVA== =FRS3 -----END PGP SIGNATURE----- Merge tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Fix Sparse warnings with randomizd kstack (GONG, Ruiqi) - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld) - Fix Clang -Wforward warning in LKDTM (Justin Stitt) - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn) - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke) - Clean up warnings and overflow and KASAN tests (Kees Cook) * tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: dm: verity-loadpin: Drop use of dm_table_get_num_targets() kasan: test: Silence GCC 12 warnings drivers: lkdtm: fix clang -Wformat warning x86: mm: refer to the intended config STRICT_DEVMEM in a comment dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation LoadPin: Enable loading from trusted dm-verity devices dm: Add verity helpers for LoadPin stack: Declare {randomize_,}kstack_offset to fix Sparse warnings lib: overflow: Do not define 64-bit tests on 32-bit MAINTAINERS: Add a general "kernel hardening" section usercopy: use unsigned long instead of uintptr_t	2022-08-02 14:38:59 -07:00
Linus Torvalds	8374cfe647	- Refactor DM core's mempool allocation so that it clearer by not being split acorss files. - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling. - Optimize DM core's more common bio splitting by eliminating the use of bio cloning with bio_split+bio_chain. Shift that cloning cost to the relatively unlikely dm_io requeue case that only occurs during error handling. Introduces dm_io_rewind() that will clone a bio that reflects the subset of the original bio that must be requeued. - Remove DM core's dm_table_get_num_targets() wrapper and audit all dm_table_get_target() callers. - Fix potential for OOM with DM writecache target by setting a default MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory, whichever is smaller). - Fix DM writecache target's stats that are reported through DM-specific table info. - Fix use-after-free crash in dm_sm_register_threshold_callback(). - Refine DM core's Persistent Reservation handling in preparation for broader work Mike Christie is doing to add compatibility with Microsoft Windows Failover Cluster. - Fix various KASAN reported bugs in the DM raid target. - Fix DM raid target crash due to md_handle_request() bio splitting that recurses to block core without properly initializing the bio's bi_dev. - Fix some code comment typos and fix some Documentation formatting. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmLoAAUACgkQxSPxCi2d A1rFUAf/RnLkzNPS1QJ1uVbiiW64zQUD2o5U8kAliOZxoXQ45U+CgL22a5VaT2R+ rI9+YSg/VX9YkdJwB6I+y7VRWjUCO3NfYOfQUX5NS8GfcPQQn2Zp+jy3t8VnjjXN 5+8Ylu5tX1+QSQiBB9JvIgiK71rSorT6H+KF6bZypL7te5kj39hDaRbmh40VOMOY QWfs8DmyJft9IHPG7Mku/rFJbdVJph3f31IoqUOoGTOKoDPDUDezhCVHi2vpFcV+ S+iCKkeXmP6eUdCnNzBreYPTcP9OtCvkZEPhFg4lh+jyhjyFq1o0B8G5cRo5jvgr GcN1GUT040sSqNXFquA4UU6RGvaxlg== =gBRJ -----END PGP SIGNATURE----- Merge tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Refactor DM core's mempool allocation so that it clearer by not being split acorss files. - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling. - Optimize DM core's more common bio splitting by eliminating the use of bio cloning with bio_split+bio_chain. Shift that cloning cost to the relatively unlikely dm_io requeue case that only occurs during error handling. Introduces dm_io_rewind() that will clone a bio that reflects the subset of the original bio that must be requeued. - Remove DM core's dm_table_get_num_targets() wrapper and audit all dm_table_get_target() callers. - Fix potential for OOM with DM writecache target by setting a default MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory, whichever is smaller). - Fix DM writecache target's stats that are reported through DM-specific table info. - Fix use-after-free crash in dm_sm_register_threshold_callback(). - Refine DM core's Persistent Reservation handling in preparation for broader work Mike Christie is doing to add compatibility with Microsoft Windows Failover Cluster. - Fix various KASAN reported bugs in the DM raid target. - Fix DM raid target crash due to md_handle_request() bio splitting that recurses to block core without properly initializing the bio's bi_dev. - Fix some code comment typos and fix some Documentation formatting. * tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm: fix dm-raid crash if md_handle_request() splits bio dm raid: fix address sanitizer warning in raid_resume dm raid: fix address sanitizer warning in raid_status dm: Start pr_preempt from the same starting path dm: Fix PR release handling for non All Registrants dm: Start pr_reserve from the same starting path dm: Allow dm_call_pr to be used for path searches dm: return early from dm_pr_call() if DM device is suspended dm thin: fix use-after-free crash in dm_sm_register_threshold_callback dm writecache: count number of blocks discarded, not number of discard bios dm writecache: count number of blocks written, not number of write bios dm writecache: count number of blocks read, not number of read bios dm writecache: return void from functions dm kcopyd: use __GFP_HIGHMEM when allocating pages dm writecache: set a default MAX_WRITEBACK_JOBS Documentation: dm writecache: Render status list as list Documentation: dm writecache: add blank line before optional parameters dm snapshot: fix typo in snapshot_map() comment dm raid: remove redundant "the" in parse_raid_params() comment dm cache: fix typo in 2 comment blocks ...	2022-08-02 14:21:25 -07:00
Linus Torvalds	c013d0af81	for-5.20/block-2022-07-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR 4N1HEozvIw== =celk -----END PGP SIGNATURE----- Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...	2022-08-02 13:46:35 -07:00
Matthias Kaehlcke	27603a606f	dm: verity-loadpin: Drop use of dm_table_get_num_targets() Commit `2aec377a29` ("dm table: remove dm_table_get_num_targets() wrapper") in linux-dm/for-next removed the function dm_table_get_num_targets() which is used by verity-loadpin. Access table->num_targets directly instead of using the defunct wrapper. Fixes: `b6c1c5745c` ("dm: Add verity helpers for LoadPin") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220728085412.1.I242d21b378410eb6f9897a3160efb56e5608c59d@changeid	2022-07-28 21:48:12 -07:00
Nathan Huckleberry	b32d45824a	dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag Add an optional flag that ensures dm_bufio_client does not sleep (primary focus is to service dm_bufio_get without sleeping). This allows the dm-bufio cache to be queried from interrupt context. To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock instead of a mutex. Additionally, to avoid deadlocks, special care must be taken so that dm-bufio does not sleep while holding the spinlock. But again: the scope of this no_sleep is initially confined to dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to avoid sleeping because __bufio_new avoids allocation for NF_GET. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:46:14 -04:00
Nathan Huckleberry	0fcb100d50	dm bufio: Add flags argument to dm_bufio_client_create Add a flags argument to dm_bufio_client_create and update all the callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:46:14 -04:00
Mike Snitzer	9dd1cd3220	dm: fix dm-raid crash if md_handle_request() splits bio Commit `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone") introduced the optimization to _not_ perform bio_associate_blkg()'s relatively costly work when DM core clones its bio. But in doing so it exposed the possibility for DM's cloned bio to alter DM target behavior (e.g. crash) if a target were to issue IO without first calling bio_set_dev(). The DM raid target can trigger an MD crash due to its need to split the DM bio that is passed to md_handle_request(). The split will recurse to submit_bio_noacct() using a bio with an uninitialized ->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to dereference a NULL blkg_to_tg(bio->bi_blkg). Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that will make alloc_tio() call bio_set_dev() on behalf of the target. dm-raid is the only target that requires this flag. bio_set_dev() initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg, before passing the bio to md_handle_request(). Long-term fix would be to audit and refactor MD code to rely on DM to split its bio, using dm_accept_partial_bio(), but there are MD raid personalities (e.g. raid1 and raid10) whose implementation are tightly coupled to handling the bio splitting inline. Fixes: `ca522482e3` ("dm: pass NULL bdev to bio_alloc_clone") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:36:30 -04:00
Mikulas Patocka	7dad24db59	dm raid: fix address sanitizer warning in raid_resume There is a KASAN warning in raid_resume when running the lvm test lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks is greater than rs->raid_disks, so the loop touches one entry beyond the allocated length. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mikulas Patocka	1fbeea217d	dm raid: fix address sanitizer warning in raid_status There is this warning when using a kernel with the address sanitizer and running this testsuite: https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid ================================================================== BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid] Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319 CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6a/0x9c print_address_description.constprop.0+0x1f/0x1e0 print_report.cold+0x55/0x244 kasan_report+0xc9/0x100 raid_status+0x1747/0x2820 [dm_raid] dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod] table_load+0x35c/0x630 [dm_mod] ctl_ioctl+0x411/0x630 [dm_mod] dm_ctl_ioctl+0xa/0x10 [dm_mod] __x64_sys_ioctl+0x12a/0x1a0 do_syscall_64+0x5b/0x80 The warning is caused by reading conf->max_nr_stripes in raid_status. The code in raid_status reads mddev->private, casts it to struct r5conf and reads the entry max_nr_stripes. However, if we have different raid type than 4/5/6, mddev->private doesn't point to struct r5conf; it may point to struct r0conf, struct r1conf, struct r10conf or struct mpconf. If we cast a pointer to one of these structs to struct r5conf, we will be reading invalid memory and KASAN warns about it. Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	c6adada5b7	dm: Start pr_preempt from the same starting path pr_preempt has a similar issue as reserve where for all the reservation types except the All Registrants ones the preempt can create a reservation. And a follow up reservation or release needs to go down the same path the preempt did. This has the pr_preempt work like reserve and release where we always start from the first path in the first group. This commit has been tested with windows failover clustering's validation test and libiscsi's PGR tests to check for regressions. They both don't have tests to verify this case, so I tested it manually. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	08a3c338e0	dm: Fix PR release handling for non All Registrants This commit fixes a bug where we are leaving the reservation in place even though pr_release has run and returned success. If we have a Write Exclusive, Exclusive Access, or Write/Exclusive Registrants only reservation, the release must be sent down the path that is the reservation holder. The problem is multipath_prepare_ioctl most likely selected path N for the reservation, then later when we do the release multipath_prepare_ioctl will select a completely different path. The device will then return success becuase the nvme and scsi specs say to return success if there is no reservation or if the release is sent down from a path that is not the holder. We then think we have released the reservation. This commit has us loop over each path and send a release so we can make sure the release is executed on the correct path. It has been tested with windows failover clustering's validation test which checks this case, and it has been tested manually (the libiscsi PGR tests don't have a test case for this yet, but I will be adding one). Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	7015108759	dm: Start pr_reserve from the same starting path When an app does a pr_reserve it will go to whatever path we happen to be using at the time. This can result in errors when the app does a second pr_reserve call and expects success but gets a failure because the reserve is not done on the holder's path. This commit has us always start trying to do reserves from the first path in the first group. Windows failover clustering will produce the type of pattern above. With this commit, we will now pass its validation test for this case. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Christie	8dd87f3c52	dm: Allow dm_call_pr to be used for path searches The specs state that if you send a reserve down a path that is already the holder success must be returned and if it goes down a path that is not the holder reservation conflict must be returned. Windows failover clustering will send a second reservation and expects that a device returns success. The problem for multipathing is that for an All Registrants reservation, we can send the reserve down any path but for all other reservation types there is one path that is the holder. To handle this we could add PR state to dm but that can get nasty. Look at target_core_pr.c for an example of the type of things we'd have to track. It will also get more complicated because other initiators can change the state so we will have to add in async event/sense handling. This commit, and the 3 commits that follow, tries to keep dm simple and keep just doing passthrough. This commit modifies dm_call_pr to be able to find the first usable path that can execute our pr_op then return. When dm_pr_reserve is converted to dm_call_pr in the next commit for the normal case we will use the same path for every reserve. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Mike Snitzer	e120a5f1e7	dm: return early from dm_pr_call() if DM device is suspended Otherwise PR ops may be issued while the broader DM device is being reconfigured, etc. Fixes: `9c72bad1f3` ("dm: call PR reserve/unreserve on each underlying device") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-28 17:29:56 -04:00
Linus Torvalds	d945404f74	block-5.19-2022-07-21 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLaEtsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjuZD/4joqo0PVz+O5MUadOmFxQ+Js2RCzxzDYwN rfTlzsS2ybcoG2ehs3TPGU2c3Q+vODcgFOLJtrKPsyxEZM49uuu7+romi9/jxW4P Pju8DmA0037elEGmml4i3EiMoIGyz1ScXFpR2Ii+LLTD6y7hNxHYxC/6PgWWaO59 gIydFz9PenJmz+orlpRI08vkvifdaCCdFfqgnP+KtvJsee2FPUdg2xpA021a1Tzj 7u1qPS1JnylQP1VQypqXnZTiHzmzmkQL4GNQIo/pje/RBstTFb3HOPXEoeVvQKrl WjNntc80LZvYCMBzcksDkGAT8YLxXOckKvDiaYGObIoEoh/K6E7CBoIo4Smu3mwr NqoaN3jZXnfTtZ40VEnxI6HnIbXG8Jc96P0/7Us3lL/W/wP0fD4fqNlumbkeMW2d ualU91J6+YTGrYYWClSTSNyWQF/MNr1XRL9LQiO4GDs0fkIL1WoxaY/+AnUp0Jcc PF8tm83852PcP+YJOOYgXD04gBmg2JCOrdarE87x+fIKp2Z7V4HPeR/hH5iOu9F3 CbXaBTICJlo7jmVUMKMh5yXvTyx4TqDvpWGbpQweXG4t8c77TPMI2cIHVdj1yqgj L015BORXfsWnI0yX4IA32RqcoNwZHrrb930OxGDvUlxTrjfodlaLypYJ1wDKwRMi H8LTxE6igA== =zviX -----END PGP SIGNATURE----- Merge tag 'block-5.19-2022-07-21' of git://git.kernel.dk/linux-block Pull block fix from Jens Axboe: "Just a single fix for missing error propagation for an allocation failure in raid5" * tag 'block-5.19-2022-07-21' of git://git.kernel.dk/linux-block: md/raid5: missing error code in setup_conf()	2022-07-22 12:41:14 -07:00
Dan Carpenter	5f7ef4875f	md/raid5: missing error code in setup_conf() Return -ENOMEM if the allocation fails. Don't return success. Fixes: `8fbcba6b99` ("md/raid5: Cleanup setup_conf() error returns") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-07-19 10:58:33 -07:00
Shiyang Ruan	8012b86608	dax: introduce holder for dax_device Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2. The patchset fsdax-rmap is aimed to support shared pages tracking for fsdax. It moves owner tracking from dax_assocaite_entry() to pmem device driver, by introducing an interface ->memory_failure() for struct pagemap. This interface is called by memory_failure() in mm, and implemented by pmem device. Then call holder operations to find the filesystem which the corrupted data located in, and call filesystem handler to track files or metadata associated with this page. Finally we are able to try to fix the corrupted data in filesystem and do other necessary processing, such as killing processes who are using the files affected. The call trace is like this: memory_failure() \|* fsdax case \|------------ \|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() \| dax_holder_notify_failure() => \| dax_device->holder_ops->notify_failure() => \| - xfs_dax_notify_failure() \| \|* xfs_dax_notify_failure() \| \|-------------------------- \| \| xfs_rmap_query_range() \| \| xfs_dax_failure_fn() \| \| * corrupted on metadata \| \| try to recover data, call xfs_force_shutdown() \| \| * corrupted on file data \| \| try to recover data, call mf_dax_kill_procs() \|* normal case \|------------- \|mf_generic_kill_procs() The patchset fsdax-reflink attempts to add CoW support for fsdax, and takes XFS, which has both reflink and fsdax features, as an example. One of the key mechanisms needed to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destination iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. This patch (of 14): To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.wiliams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:30 -07:00
Luo Meng	3534e5a5ed	dm thin: fix use-after-free crash in dm_sm_register_threshold_callback Fault inject on pool metadata device reports: BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80 Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950 CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_address_description.constprop.0.cold+0xeb/0x3f4 kasan_report.cold+0xe6/0x147 dm_pool_register_metadata_threshold+0x40/0x80 pool_ctr+0xa0a/0x1150 dm_table_add_target+0x2c8/0x640 table_load+0x1fd/0x430 ctl_ioctl+0x2c4/0x5a0 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb3/0xd0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This can be easily reproduced using: echo offline > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10 dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0" If a metadata commit fails, the transaction will be aborted and the metadata space maps will be destroyed. If a DM table reload then happens for this failed thin-pool, a use-after-free will occur in dm_sm_register_threshold_callback (called from dm_pool_register_metadata_threshold). Fix this by in dm_pool_register_metadata_threshold() by returning the -EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr() with a new error message: "Error registering metadata threshold". Fixes: `ac8c3f3df6` ("dm thin: generate event when metadata threshold passed") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-15 18:09:14 -04:00
Mikulas Patocka	2ee73ef60d	dm writecache: count number of blocks discarded, not number of discard bios Change dm-writecache, so that it counts the number of blocks discarded instead of the number of discard bios. Make it consistent with the read and write statistics counters that were changed to count the number of blocks instead of bios. Fixes: `e3a35d0340` ("dm writecache: add event counters") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:54:46 -04:00
Mikulas Patocka	b2676e1482	dm writecache: count number of blocks written, not number of write bios Change dm-writecache, so that it counts the number of blocks written instead of the number of write bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: `e3a35d0340` ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:53:25 -04:00
Mikulas Patocka	2c6e755b49	dm writecache: count number of blocks read, not number of read bios Change dm-writecache, so that it counts the number of blocks read instead of the number of read bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: `e3a35d0340` ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:32 -04:00
Mikulas Patocka	9bc0c92e4b	dm writecache: return void from functions The functions writecache_map_remap_origin and writecache_bio_copy_ssd only return a single value, thus they can be made to return void. This helps simplify the following IO accounting changes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:31 -04:00
Mikulas Patocka	949d49ec30	dm kcopyd: use __GFP_HIGHMEM when allocating pages dm-kcopyd doesn't access the allocated pages directly, it only passes them to dm-io which adds them to a bio list - thus, we can allocate the pages from high memory. This will reduce pressure on the low memory when there are a large number of kcopyd jobs in progress. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:31 -04:00
Mikulas Patocka	ca7dc242e3	dm writecache: set a default MAX_WRITEBACK_JOBS dm-writecache has the capability to limit the number of writeback jobs in progress. However, this feature was off by default. As such there were some out-of-memory crashes observed when lowering the low watermark while the cache is full. This commit enables writeback limit by default. It is set to 256MiB or 1/16 of total system memory, whichever is smaller. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:31 -04:00
Bart Van Assche	1420c4a549	fs/buffer: Combine two submit_bh() and ll_rw_block() arguments Both submit_bh() and ll_rw_block() accept a request operation type and request flags as their first two arguments. Micro-optimize these two functions by combining these first two arguments into a single argument. This patch does not change the behavior of any of the modified code. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Acked-by: Song Liu <song@kernel.org> (for the md changes) Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Bart Van Assche	a9010741ce	md/raid5: Use the enum req_op and blk_opf_t types Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-37-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00
Bart Van Assche	cb1802ff82	md/raid10: Use the new blk_opf_t type Improve static type checking by using the new blk_opf_t type for variables that represent a request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-36-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-14 12:14:32 -06:00

... 2 3 4 5 6 ...

7488 Commits