Commit Graph

1185650 Commits

Author SHA1 Message Date
Jan Kara abcc0cbd49 bcache: Alloc holder object before async registration
Allocate holder object (cache or cached_dev) before offloading the
rest of the startup to async work. This will allow us to open the block
block device with proper holder.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20230622164658.12861-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 16:10:04 -06:00
Jens Axboe c36591f682 Merge tag 'md-next-20230623' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.5/block-late
Pull MD fixes from Song.

* tag 'md-next-20230623' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  raid10: avoid spin_lock from fastpath from raid10_unplug()
  md: fix 'delete_mutex' deadlock
  md: use mddev->external to select holder in export_rdev()
  md/raid1-10: fix casting from randomized structure in raid1_submit_write()
  md/raid10: fix the condition to call bio_end_io_acct()
2023-06-23 11:59:05 -06:00
Yu Kuai a8d5fdd4d2 raid10: avoid spin_lock from fastpath from raid10_unplug()
Commit 0c0be98bbe ("md/raid10: prevent unnecessary calls to wake_up()
in fast path") missed one place, for example, with:

	fio -direct=1 -rw=write/randwrite -iodepth=1 ...

Plug and unplug are called for each io, then wake_up() from raid10_unplug()
will cause lock contention as well.

Avoid this contention by using wake_up_barrier() instead of wake_up(),
where spin_lock is not held if waitqueue is empty.

Fio test script:

[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]

Test result with ramdisk raid10(By Ali):

	Before this patch	With this patch
READ	IOPS=2033k		IOPS=3642k
WRITE	IOPS=871k		IOPS=1561K

By the way, in this scenario, blk_plug_cb() will be allocated and freed
for each io, this seems need to be optimized as well.

Reported-and-tested-by: Ali Gholami Rudi <aligrudi@gmail.com>
Closes: https://lore.kernel.org/all/20231606122233@laper.mirepesht/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
2023-06-23 09:41:50 -07:00
Yu Kuai 4934b6401a md: fix 'delete_mutex' deadlock
Commit 3ce94ce5d0 ("md: fix duplicate filename for rdev") introduce a
new lock 'delete_mutex', and trigger a new deadlock:

t1: remove rdev			t2: sysfs writer

rdev_attr_store			rdev_attr_store
 mddev_lock
 state_store
 md_kick_rdev_from_array
  lock delete_mutex
  list_add mddev->deleting
  unlock delete_mutex
 mddev_unlock
				 mddev_lock
				 ...
  lock delete_mutex
  kobject_del
  // wait for sysfs writers to be done
				 mddev_unlock
				 lock delete_mutex
				 // wait for delete_mutex, deadlock

'delete_mutex' is used to protect the list 'mddev->deleting', turns out
that this list can be protected by 'reconfig_mutex' directly, and this
lock can be removed.

Fix this problem by removing the lock, and use 'reconfig_mutex' to
protect the list. mddev_unlock() will move this list to a local list to
be handled after 'reconfig_mutex' is dropped.

Fixes: 3ce94ce5d0 ("md: fix duplicate filename for rdev")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com
2023-06-23 09:41:47 -07:00
Song Liu a1d7671910 md: use mddev->external to select holder in export_rdev()
mdadm test "10ddf-create-fail-rebuild" triggers warnings like the following

[  215.526357] ------------[ cut here ]------------
[  215.527243] WARNING: CPU: 18 PID: 1264 at block/bdev.c:617 blkdev_put+0x269/0x350
[  215.528334] Modules linked in:
[  215.528806] CPU: 18 PID: 1264 Comm: mdmon Not tainted 6.4.0-rc2+ #768
[  215.529863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
[  215.531464] RIP: 0010:blkdev_put+0x269/0x350
[  215.532167] Code: ff ff 49 8d 7d 10 e8 56 bf b8 ff 4d 8b 65 10 49 8d bc
24 58 05 00 00 e8 05 be b8 ff 41 83 ac 24 58 05 00 00 01 e9 44 ff ff ff
<0f> 0b e9 52 fe ff ff 0f 0b e9 6b fe ff ff1
[  215.534780] RSP: 0018:ffffc900040bfbf0 EFLAGS: 00010283
[  215.535635] RAX: ffff888174001000 RBX: ffff88810b1c3b00 RCX: ffffffff819a4061
[  215.536645] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88810b1c3ba0
[  215.537657] RBP: ffff88810dbde800 R08: fffffbfff0fca983 R09: fffffbfff0fca983
[  215.538674] R10: ffffc900040bfbf0 R11: fffffbfff0fca982 R12: ffff88810b1c3b38
[  215.539687] R13: ffff88810b1c3b10 R14: ffff88810dbdecb8 R15: ffff88810b1c3b00
[  215.540833] FS:  00007f2aabdff700(0000) GS:ffff888dfb400000(0000) knlGS:0000000000000000
[  215.541961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  215.542775] CR2: 00007fa19a85d934 CR3: 000000010c076006 CR4: 0000000000370ee0
[  215.543814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  215.544840] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  215.545885] Call Trace:
[  215.546257]  <TASK>
[  215.546608]  export_rdev.isra.63+0x71/0xe0
[  215.547338]  mddev_unlock+0x1b1/0x2d0
[  215.547898]  array_state_store+0x28d/0x450
[  215.548519]  md_attr_store+0xd7/0x150
[  215.549059]  ? __pfx_sysfs_kf_write+0x10/0x10
[  215.549702]  kernfs_fop_write_iter+0x1b9/0x260
[  215.550351]  vfs_write+0x491/0x760
[  215.550863]  ? __pfx_vfs_write+0x10/0x10
[  215.551445]  ? __fget_files+0x156/0x230
[  215.552053]  ksys_write+0xc0/0x160
[  215.552570]  ? __pfx_ksys_write+0x10/0x10
[  215.553141]  ? ktime_get_coarse_real_ts64+0xec/0x100
[  215.553878]  do_syscall_64+0x3a/0x90
[  215.554403]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  215.555125] RIP: 0033:0x7f2aade11847
[  215.555696] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec
10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05
<48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 448
[  215.558398] RSP: 002b:00007f2aabdfeba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[  215.559516] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f2aade11847
[  215.560515] RDX: 0000000000000005 RSI: 0000000000438b8b RDI: 0000000000000010
[  215.561512] RBP: 0000000000438b8b R08: 0000000000000000 R09: 00007f2aaecf0060
[  215.562511] R10: 000000000e3ba40b R11: 0000000000000293 R12: 0000000000000005
[  215.563647] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000c70750
[  215.564693]  </TASK>
[  215.565029] irq event stamp: 15979
[  215.565584] hardirqs last  enabled at (15991): [<ffffffff811a7432>] __up_console_sem+0x52/0x60
[  215.566806] hardirqs last disabled at (16000): [<ffffffff811a7417>] __up_console_sem+0x37/0x60
[  215.568022] softirqs last  enabled at (15716): [<ffffffff8277a2db>] __do_softirq+0x3eb/0x531
[  215.569239] softirqs last disabled at (15711): [<ffffffff810d8f45>] irq_exit_rcu+0x115/0x160
[  215.570434] ---[ end trace 0000000000000000 ]---

This means export_rdev() calls blkdev_put with a different holder than the
one used by blkdev_get_by_dev(). This is because mddev->major_version == -2
is not a good check for external metadata. Fix this by using
mddev->external instead.

Also, do not clear mddev->external in md_clean(), as the flag might be used
later in export_rdev().

Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230617052405.305871-1-song@kernel.org
2023-06-23 09:39:00 -07:00
Yu Kuai b5a99602b7 md/raid1-10: fix casting from randomized structure in raid1_submit_write()
Following build error triggered while build with clang version 17.0.0
with W=1(this can't be reporduced with gcc 13.1.0):

drivers/md/raid1-10.c:117:25: error: casting from randomized structure
pointer type 'struct block_device *' to 'struct md_rdev *'
     117 |         struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev;
         |                                ^

Fix this by casting 'bio->bi_bdev' to 'void *', as it used to be.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306142042.fmjfmTF8-lkp@intel.com/
Fixes: 8295efbe68 ("md/raid1-10: factor out a helper to submit normal write")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230616012136.3047071-1-yukuai1@huaweicloud.com
2023-06-23 09:33:16 -07:00
Li Nan 125bfc7cd7 md/raid10: fix the condition to call bio_end_io_acct()
/sys/block/[device]/queue/iostats is used to control whether to count io
stat. Write 0 to it will clear queue_flags QUEUE_FLAG_IO_STAT which means
iostats is disabled. If we disable iostats and later endable it, the io
issued during this period will be counted incorrectly, inflight will be
decreased to -1.

  //T1 set iostats
  echo 0 > /sys/block/md0/queue/iostats
   clear QUEUE_FLAG_IO_STAT

			//T2 issue io
			if (QUEUE_FLAG_IO_STAT) -> false
			 bio_start_io_acct
			  inflight++

  echo 1 > /sys/block/md0/queue/iostats
   set QUEUE_FLAG_IO_STAT

					//T3 io end
					if (QUEUE_FLAG_IO_STAT) -> true
					 bio_end_io_acct
					  inflight--	-> -1

Also, if iostats is enabled while issuing io but disabled while io end,
inflight will never be decreased.

Fix it by checking start_time when io end. If start_time is not 0, call
bio_end_io_acct().

Fixes: 528bc2cf2f ("md/raid10: enable io accounting")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230609094320.2397604-1-linan666@huaweicloud.com
2023-06-23 09:33:16 -07:00
Yu Kuai fcaa174a9c scsi/sg: don't grab scsi host module reference
In order to prevent request_queue to be freed before cleaning up
blktrace debugfs entries, commit db59133e92 ("scsi: sg: fix blktrace
debugfs entries leakage") use scsi_device_get(), however,
scsi_device_get() will also grab scsi module reference and scsi module
can't be removed.

It's reported that blktests can't unload scsi_debug after block/001:

blktests (master) # ./check block
block/001 (stress device hotplugging) [failed]
     +++ /root/blktests/results/nodev/block/001.out.bad 2023-06-19
      Running block/001
      Stressing sd
     +modprobe: FATAL: Module scsi_debug is in use.

Fix this problem by grabbing request_queue reference directly, so that
scsi host module can still be unloaded while request_queue will be
pinged by sg device.

Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Link: https://lore.kernel.org/all/1760da91-876d-fc9c-ab51-999a6f66ad50@nvidia.com/
Fixes: db59133e92 ("scsi: sg: fix blktrace debugfs entries leakage")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230621160111.1433521-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:28:18 -06:00
Jan Kara a42fb5a75c ext4: Fix warning in blkdev_put()
ext4_blkdev_remove() passes a wrong holder pointer to blkdev_put() which
triggers a warning there. Fix it.

Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230622165107.13687-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:14:41 -06:00
Christoph Hellwig 648fa60fa7 block: don't return -EINVAL for not found names in devt_from_devname
When we didn't find a device and didn't guess it might be a partition,
it might still show up later, so don't disable rootwait for it by
returning -EINVAL.

Fixes: 079caa35f7 ("init: clear root_wait on all invalid root= strings")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230622150644.600327-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-22 09:09:33 -06:00
Jordy Zomer 8270cb10c0 cdrom: Fix spectre-v1 gadget
This patch fixes a spectre-v1 gadget in cdrom.
The gadget could be triggered by speculatively
bypassing the cdi->capacity check.

Signed-off-by: Jordy Zomer <jordyzomer@google.com>
Link: https://lore.kernel.org/all/20230612110040.849318-2-jordyzomer@google.com
Reviewed-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/all/ZI1+1OG9Ut1MqsUC@equinox
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20230617113828.1230-2-phil@philpotter.co.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-22 07:48:08 -06:00
Bart Van Assche 017fb83ee0 block: Improve kernel-doc headers
Fix the documentation of the devt_from_partuuid() return value.

Fix the following two recently introduced kernel-doc warnings:

block/bdev.c:570: warning: Function parameter or member 'hops' not described in 'bd_finish_claiming'
block/early-lookup.c:46: warning: Function parameter or member 'devt' not described in 'devt_from_partuuid'

Cc: Christoph Hellwig <hch@lst.de>
Fixes: 0718afd47f ("block: introduce holder ops")
Fixes: cf056a4312 ("init: improve the name_to_dev_t interface")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230621165054.743815-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 13:18:15 -06:00
Ming Lei 2293cae703 blk-mq: don't insert passthrough request into sw queue
In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue()
may just check if scheduler queue has request to dispatch, see
__blk_mq_sched_dispatch_requests(). Then IO hang may be caused because
all passthorugh requests may stay in sw queue.

And any passthrough request should have been inserted to hctx->dispatch
always.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: d97217e7f0 ("blk-mq: don't queue plugged passthrough requests into scheduler")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:48:34 -06:00
Ivan Orlov 72ef02b8df bsg: make bsg_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the bsg_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-scsi@vger.kernel.org
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-8-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:46:17 -06:00
Ivan Orlov 2eefd399d2 ublk: make ublk_chr_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the ublk_chr_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-7-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:46:13 -06:00
Ivan Orlov 65d7a37d4e aoe: make aoe_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the aoe_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Justin Sanders <justin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-6-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:45:19 -06:00
Ivan Orlov 137380c0ec block/rnbd: make all 'class' structures const
Now that the driver core allows for struct class to be in read-only
memory, making all 'class' structures to be declared at build time
placing them into read-only memory, instead of having to be dynamically
allocated at load time.

Cc: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Cc: Jack Wang <jinpu.wang@ionos.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230620180129.645646-5-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:45:19 -06:00
Christoph Hellwig 56e71bdf32 block: fix the exclusive open mask in disk_scan_partitions
FMODE_EXEC has nothing to do with exclusive opens, and even is of
the wrong type.  We need to check for BLK_OPEN_EXCL here.

Fixes: 985958b858 ("block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230621124914.185992-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:37:52 -06:00
Michael Schmitz b6f3f28f60 block: add overflow checks for Amiga partition support
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.

Use u64 as type for sector address and size to allow using disks up to
2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD
format allows to specify disk sizes up to 2^128 bytes (though native
OS limitations reduce this somewhat, to max 2^68 bytes), so check for
u64 overflow carefully to protect against overflowing sector_t.

Bail out if sector addresses overflow 32 bits on kernels without LBD
support.

This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted (now resubmitted as patch 1 in this series).
This patch adds additional error checking and warning messages.

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/r/20230620201725.7020-4-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 14:28:17 -06:00
Michael Schmitz 95a55437dc block: change all __u32 annotations to __be32 in affs_hardblocks.h
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.

Use u64 as type for sector address and size to allow using disks up to
2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD
format allows to specify disk sizes up to 2^128 bytes (though native
OS limitations reduce this somewhat, to max 2^68 bytes), so check for
u64 overflow carefully to protect against overflowing sector_t.

This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted (now resubmitted as patch 1 of this series).

Patch 3 (this series) adds additional error checking and warning
messages. One of the error checks now makes use of the previously
unused rdb_CylBlocks field, which causes a 'sparse' warning
(cast to restricted __be32).

Annotate all 32 bit fields in affs_hardblocks.h as __be32, as the
on-disk format of RDB and partition blocks is always big endian.

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20230620201725.7020-3-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 14:28:17 -06:00
Michael Schmitz fc3d092c6b block: fix signed int overflow in Amiga partition support
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.

Use sector_t as type for sector address and size to allow using disks
up to 2 TB without LBD support, and disks larger than 2 TB with LBD.

This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted. This patch differs from Joanne's patch only in its use of
sector_t instead of unsigned int. No checking for overflows is done
(see patch 3 of this series for that).

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Tested-by: Martin Steigerwald <Martin@lichtvoll.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620201725.7020-2-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 14:28:17 -06:00
Min Li 6d4e80db4e block: add capacity validation in bdev_add_partition()
In the function bdev_add_partition(),there is no check that the start
and end sectors exceed the size of the disk before calling add_partition.
When we call the block's ioctl interface directly to add a partition,
and the capacity of the disk is set to 0 by driver,the command will
continue to execute.

Signed-off-by: Min Li <min15.li@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230619091214.31615-1-min15.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 12:50:51 -06:00
Jingbo Xu 9a72a02456 block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
Allow of unprivileged Persistent Reservation operations on devices
if the write permission check on the device node has passed.

brw-rw---- 1 root disk 259, 0 Jun 13 07:09 /dev/nvme0n1

In the example above, the "disk" group of nvme0n1 is also allowed to
make reservations on the device even without CAP_SYS_ADMIN.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230613084008.93795-3-jefflexu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 12:49:23 -06:00
Jingbo Xu 1262962166 block: disallow Persistent Reservation on partitions
Refuse Persistent Reservation operations on partitions as reservation
on partitions doesn't make sense.

Besides, introduce blkdev_pr_allowed() helper, where more policies could
be placed here later.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230613084008.93795-2-jefflexu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 12:49:23 -06:00
Yu Kuai c576c4bf9e reiserfs: fix blkdev_put() warning from release_journal_dev()
In journal_init_dev(), if super bdev is used as 'j_dev_bd', then
blkdev_get_by_dev() is called with NULL holder, otherwise, holder will
be journal. However, later in release_journal_dev(), blkdev_put() is
called with journal unconditionally, cause following warning:

WARNING: CPU: 1 PID: 5034 at block/bdev.c:617 bd_end_claim block/bdev.c:617 [inline]
WARNING: CPU: 1 PID: 5034 at block/bdev.c:617 blkdev_put+0x562/0x8a0 block/bdev.c:901
RIP: 0010:blkdev_put+0x562/0x8a0 block/bdev.c:901
Call Trace:
 <TASK>
 release_journal_dev fs/reiserfs/journal.c:2592 [inline]
 free_journal_ram+0x421/0x5c0 fs/reiserfs/journal.c:1896
 do_journal_release fs/reiserfs/journal.c:1960 [inline]
 journal_release+0x276/0x630 fs/reiserfs/journal.c:1971
 reiserfs_put_super+0xe4/0x5c0 fs/reiserfs/super.c:616
 generic_shutdown_super+0x158/0x480 fs/super.c:499
 kill_block_super+0x64/0xb0 fs/super.c:1422
 deactivate_locked_super+0x98/0x160 fs/super.c:330
 deactivate_super+0xb1/0xd0 fs/super.c:361
 cleanup_mnt+0x2ae/0x3d0 fs/namespace.c:1247
 task_work_run+0x16f/0x270 kernel/task_work.c:179
 exit_task_work include/linux/task_work.h:38 [inline]
 do_exit+0xadc/0x2a30 kernel/exit.c:874
 do_group_exit+0xd4/0x2a0 kernel/exit.c:1024
 __do_sys_exit_group kernel/exit.c:1035 [inline]
 __se_sys_exit_group kernel/exit.c:1033 [inline]
 __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1033
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fix this problem by passing in NULL holder in this case.

Reported-by: syzbot+04625c80899f4555de39@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=04625c80899f4555de39
Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620111322.1014775-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:19:15 -06:00
Yu Kuai 985958b858 block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
After commit 2736e8eeb0 ("block: use the holder as indication for
exclusive opens"), blkdev_get_by_dev() will warn if holder is NULL and
mode contains 'FMODE_EXCL'.

holder from blkdev_get_by_dev() from disk_scan_partitions() is always NULL,
hence it should not use 'FMODE_EXCL', which is broben by the commit. For
consequence, WARN_ON_ONCE() will be triggered from blkdev_get_by_dev()
if user scan partitions with device opened exclusively.

Fix this problem by removing 'FMODE_EXCL' from disk_scan_partitions(),
as it used to be.

Reported-by: syzbot+00cd27751f78817f167b@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=00cd27751f78817f167b
Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230618140402.7556-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:17:46 -06:00
Christoph Hellwig e89e001f24 block: document the holder argument to blkdev_get_by_path
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620043536.707249-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:16:38 -06:00
Demi Marie Obenour b90ecc0379 block: increment diskseq on all media change events
Currently, associating a loop device with a different file descriptor
does not increment its diskseq.  This allows the following race
condition:

1. Program X opens a loop device
2. Program X gets the diskseq of the loop device.
3. Program X associates a file with the loop device.
4. Program X passes the loop device major, minor, and diskseq to
   something.
5. Program X exits.
6. Program Y detaches the file from the loop device.
7. Program Y attaches a different file to the loop device.
8. The opener finally gets around to opening the loop device and checks
   that the diskseq is what it expects it to be.  Even though the
   diskseq is the expected value, the result is that the opener is
   accessing the wrong file.

From discussions with Christoph Hellwig, it appears that
disk_force_media_change() was supposed to call inc_diskseq(), but in
fact it does not.  Adding a Fixes: tag to indicate this.  Christoph's
Reported-by is because he stated that disk_force_media_change()
calls inc_diskseq(), which is what led me to discover that it should but
does not.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: e6138dc12d ("block: add a helper to raise a media changed event")
Cc: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230607170837.1559-1-demi@invisiblethingslab.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:16:24 -06:00
Christoph Hellwig 9a7933f3ac swim: fix a missing FMODE_ -> BLK_OPEN_ conversion in floppy_open
Fix a missing conversion to the new BLK_OPEN constant in swim.

Fixes: 05bdb99653 ("block: replace fmode_t with a block-specific type for block open flags")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620043051.707196-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:16:04 -06:00
Ming Lei 245165658e blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none
After grabbing q->sysfs_lock, q->elevator may become NULL because of
elevator switch.

Fix the NULL dereference on q->elevator by checking it with lock.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:12:25 -06:00
Christoph Hellwig 84bd06c632 iov_iter: remove iov_iter_get_pages and iov_iter_get_pages_alloc
Now that the direct I/O helpers have switched to use
iov_iter_extract_pages, these helpers are unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:09 -06:00
Christoph Hellwig e4cc64657b block: remove BIO_PAGE_REFFED
Now that all block direct I/O helpers use page pinning, this flag is
unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:09 -06:00
Christoph Hellwig 2e82f6c3bf splice: simplify a conditional in copy_splice_read
Check for -EFAULT instead of wrapping the check in an ret < 0 block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:08 -06:00
Christoph Hellwig 0b24be4691 splice: don't call file_accessed in copy_splice_read
copy_splice_read calls into ->read_iter to read the data, which already
calls file_accessed.

Fixes: 33b3b04154 ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:08 -06:00
Jens Axboe 236f255296 nvme updates for Linux 6.5
- Various cleanups all around (Irvin, Chaitanya, Christophe)
  - Better struct packing (Christophe JAILLET)
  - Reduce controller error logs for optional commands (Keith)
  - Support for >=64KiB block sizes (Daniel Gomez)
  - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmSMgAcACgkQPe3zGtjz
 Rgn0jhAAjhZrLdxsKd+mp4a9ywrg+P04sjYfxwfEu8A+J+OMDlgQXFrNzKGS5Riy
 WRF674sJih2KSjfuG8m0J1piRGfKph2+YjAsiwl2mxBuwZi8ckUU7QiOKYa4HudE
 DHOW9x7+mB+liGF4LvhWVBgjm0lWST7SQjGHwABToorDQ/10H6lyQR6CJHpyOwwQ
 LYVu19aU2mAnlMuExW1OpiE+mLD1QLmElAorYc+32hKCmeY7fOgheJk9gx4sx0lm
 bSrlinCLOJX3V4Ny7OIpYOH3KVFGBmxRQtrmnTyG6oSa3JCp8yTe5ytc9zhGCnli
 cNobu681XB6api7JMaPKmJk/vNfSxFlw9nndaX9u3PGLCv7RumUkMZm4eXtYmaBN
 FVLX3Jj0Ru4ULsbaA5WmHCo56bb1EoTA5bTuWNfVylILD71TfhzaeTkMwJ8izanV
 VlS0cZAiaEvlyrFznCirEpkNFmAgTp/XaeiV8lPbXMbvYtouoHHb+ex8uxxiSPBq
 lM8gKL28L3ZrGZZ29bIKqU3WJjt4d3bckHSytKA2keLxRbixV7tmlwaYkQF9PSCW
 bbgJ5knjJixYnN8jm8QEdtBhMOYpULK3sOeReFmJD73o5NHq+UkrdxzoQQhaa3ai
 a81wZZhDk5R1EJZehia7bQR1USXWXtjFoWsbc54bh9EG8cR/spk=
 =wMci
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme into for-6.5/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.5

 - Various cleanups all around (Irvin, Chaitanya, Christophe)
 - Better struct packing (Christophe JAILLET)
 - Reduce controller error logs for optional commands (Keith)
 - Support for >=64KiB block sizes (Daniel Gomez)
 - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner)"

* tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme: (27 commits)
  nvme: forward port sysfs delete fix
  nvme: skip optional id ctrl csi if it failed
  nvme-core: use nvme_ns_head_multipath instead of ns->head->disk
  nvmet-fcloop: Do not wait on completion when unregister fails
  nvme-fabrics: open code __nvmf_host_find()
  nvme-fabrics: error out to unlock the mutex
  nvme: Increase block size variable size to 32-bit
  nvme-fcloop: no need to return from void function
  nvmet-auth: remove unnecessary break after goto
  nvmet-auth: remove some dead code
  nvme-core: remove redundant check from nvme_init_ns_head
  nvme: move sysfs code to a dedicated sysfs.c file
  nvme-fabrics: prevent overriding of existing host
  nvme-fabrics: check hostid using uuid_equal
  nvme-fabrics: unify common code in admin and io queue connect
  nvmet: reorder fields in 'struct nvmefc_fcp_req'
  nvmet: reorder fields in 'struct nvme_dhchap_queue_context'
  nvmet: reorder fields in 'struct nvmf_ctrl_options'
  nvme: reorder fields in 'struct nvme_ctrl'
  nvmet: reorder fields in 'struct nvmet_sq'
  ...
2023-06-16 09:57:40 -06:00
Keith Busch 1c606f7f05 nvme: forward port sysfs delete fix
We had a late fix that modified nvme_sysfs_delete() after the staging
branch for the next merge window relocated the function to a new file.
Port commit 2eb94dd56a ("nvme: do not let the user delete a ctrl
before a complete") to the latest to avoid a potentially confusing merge
conflict.

Cc: Maurizio Lombardi <mlombard@redhat.com>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-16 08:15:57 -07:00
Mingzhe Zou f0854489fc bcache: fixup btree_cache_wait list damage
We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c.
(next=ffffae586d8afe68)."

crash> struct list_head 0xffff9c801bc01210
struct list_head {
  next = 0xffffae586d8afe68,
  prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
  next = 0x0,
  prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: ffffae586d8afe68  type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68

[230469.019492] Call Trace:
[230469.032041]  prepare_to_wait+0x8a/0xb0
[230469.044363]  ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533]  mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788]  mca_alloc+0x2ae/0x450 [escache]
[230469.080790]  bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681]  bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382]  ? finish_wait+0x80/0x80
[230469.115884]  ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259]  kthread+0x112/0x130
[230469.138448]  ? kthread_flush_work_fn+0x10/0x10
[230469.149477]  ret_from_fork+0x35/0x40

bch_btree_check_thread() and bch_dirty_init_thread() may call
mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
can do it at a time, so the op of other threads will be added to the
btree_cache_wait list.

We must call finish_wait() to remove op from btree_cache_wait before free
it's memory address. Otherwise, the list will be damaged. Also should call
bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
other waiters.

Fixes: 8e7102273f ("bcache: make bch_btree_check() to be multithreaded")
Fixes: b144e45fc5 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Cc: stable@vger.kernel.org
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:32:55 -06:00
Zheng Wang 80fca8a10b bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
In some specific situations, the return value of __bch_btree_node_alloc
may be NULL. This may lead to a potential NULL pointer dereference in
caller function like a calling chain :
btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.

Fix it by initializing the return value in __bch_btree_node_alloc.

Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:32:00 -06:00
Zheng Wang 028ddcac47 bcache: Remove unnecessary NULL point check in node allocations
Due to the previous fix of __bch_btree_node_alloc, the return value will
never be a NULL pointer. So IS_ERR is enough to handle the failure
situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check.

Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:43 -06:00
Andrea Tomassetti ccb8c3bd6d bcache: Remove dead references to cache_readaheads
The cache_readaheads stat counter is not used anymore and should be
removed.

Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:11 -06:00
Thomas Weißschuh b98dd0b0a5 bcache: make kobj_type structures constant
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:11 -06:00
ye xingchen a301b2deb6 bcache: Convert to use sysfs_emit()/sysfs_emit_at() APIs
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:11 -06:00
Yu Kuai dd7de3704a block: fix blktrace debugfs entries leakage
Commit 99d055b4fd ("block: remove per-disk debugfs files in
blk_unregister_queue") moves blk_trace_shutdown() from
blk_release_queue() to blk_unregister_queue(), this is safe if blktrace
is created through sysfs, however, there is a regression in corner
case.

blktrace can still be enabled after del_gendisk() through ioctl if
the disk is opened before del_gendisk(), and if blktrace is not shutdown
through ioctl before closing the disk, debugfs entries will be leaked.

Fix this problem by shutdown blktrace in disk_release(), this is safe
because blk_trace_remove() is reentrant.

Fixes: 99d055b4fd ("block: remove per-disk debugfs files in blk_unregister_queue")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 20:24:03 -06:00
Yu Kuai db59133e92 scsi: sg: fix blktrace debugfs entries leakage
sg_ioctl() support to enable blktrace, which will create debugfs entries
"/sys/kernel/debug/block/sgx/", however, there is no guarantee that user
will remove these entries through ioctl, and deleting sg device doesn't
cleanup these blktrace entries.

This problem can be fixed by cleanup blktrace while releasing
request_queue, however, it's not a good idea to do this special handling
in common layer just for sg device.

Fix this problem by shutdown bltkrace in sg_device_destroy(), where the
device is deleted and all the users close the device, also grab a
scsi_device reference from sg_add_device() to prevent scsi_device to be
freed before sg_device_destroy();

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230610022003.2557284-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 20:24:03 -06:00
Yu Kuai cbe7cff4a7 blktrace: use inline function for blk_trace_remove() while blktrace is disabled
If config is disabled, call blk_trace_remove() directly will trigger
build warning, hence use inline function instead, prepare to fix
blktrace debugfs entries leakage.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230610022003.2557284-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 20:24:03 -06:00
Pankaj Raghav 6dd4423f3f brd: use cond_resched instead of cond_resched_rcu
The body of the loop is run without RCU lock held. Use the regular
cond_resched() instead of cond_resched_rcu().

Fixes: 786bb02458 ("brd: use XArray instead of radix-tree to index backing pages")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 11:13:07 -06:00
Ed Tsai 30654614f3 blk-mq: check on cpu id when there is only one ctx mapping
commit f168420c62 ("blk-mq: don't redirect completion for hctx withs
only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx,
there will be no remote request.

But for ufs, the submission and completion queues could be asymmetric.
(e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and
ctx won't complete request on the submission cpu. In this situation,
this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result,
check on cpu id when there is only one ctx mapping.

Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com>
Suggested-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com
[axboe: fixed up indentation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 11:11:25 -06:00
Jens Axboe 6070131176 Merge tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.5/block
Pull MD updates from Song:

"The major changes are:

 1. Protect md_thread with rcu, by Yu Kuai;
 2. Various non-urgent raid5 and raid1/10 fixes, by Yu Kuai;
 3. Non-urgent raid10 fixes, by Li Nan."

* tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (29 commits)
  md/raid1-10: limit the number of plugged bio
  md/raid1-10: don't handle pluged bio by daemon thread
  md/md-bitmap: add a new helper to unplug bitmap asynchrously
  md/raid1-10: submit write io directly if bitmap is not enabled
  md/raid1-10: factor out a helper to submit normal write
  md/raid1-10: factor out a helper to add bio to plug
  md/raid10: prevent soft lockup while flush writes
  md/raid10: fix io loss while replacement replace rdev
  md/raid10: Do not add spare disk when recovery fails
  md/raid10: clean up md_add_new_disk()
  md/raid10: prioritize adding disk to 'removed' mirror
  md/raid10: improve code of mrdev in raid10_sync_request
  md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
  md/raid5: don't start reshape when recovery or replace is in progress
  md: protect md_thread with rcu
  md/bitmap: factor out a helper to set timeout
  md/bitmap: always wake up md_thread in timeout_store
  dm-raid: remove useless checking in raid_message()
  md: factor out a helper to wake up md_thread directly
  md: fix duplicate filename for rdev
  ...
2023-06-14 06:58:43 -06:00
David Howells d44c404207 block: Fix dio_cleanup() to advance the head index
Fix dio_bio_cleanup() to advance the head index into the list of pages past
the pages it has released, as __blockdev_direct_IO() will call it twice if
do_direct_IO() fails.

The issue was causing:

        WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio

This can be triggered by setting up a clean pair of UDF filesystems on
loopback devices and running the generic/451 xfstest with them as the
scratch and test partitions.  Something like the following:

    fallocate /mnt2/udf_scratch -l 1G
    fallocate /mnt2/udf_test -l 1G
    mknod /dev/lo0 b 7 0
    mknod /dev/lo1 b 7 1
    losetup lo0 /mnt2/udf_scratch
    losetup lo1 /mnt2/udf_test
    mkfs -t udf /dev/lo0
    mkfs -t udf /dev/lo1
    cd xfstests
    ./check generic/451

with xfstests configured by putting the following into local.config:

    export FSTYP=udf
    export DISABLE_UDF_TEST=1
    export TEST_DEV=/dev/lo1
    export TEST_DIR=/xfstest.test
    export SCRATCH_DEV=/dev/lo0
    export SCRATCH_MNT=/xfstest.scratch

Fixes: 1ccf164ec8 ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 06:58:18 -06:00
Yu Kuai 460af1f9d9 md/raid1-10: limit the number of plugged bio
bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:

Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test

Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:

[root@fedora ~]# cat /sys/block/md0/inflight
       0  4474191

Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00