linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Shin'ichiro Kawasaki	b53df2e744	block: Fix partition support for host aware zoned block devices Commit `b72053072c` ("block: allow partitions on host aware zone devices") introduced the helper function disk_has_partitions() to check if a given disk has valid partitions. However, since this function result directly depends on the disk partition table length rather than the actual existence of valid partitions in the table, it returns true even after all partitions are removed from the disk. For host aware zoned block devices, this results in zone management support to be kept disabled even after removing all partitions. Fix this by changing disk_has_partitions() to walk through the partition table entries and return true if and only if a valid non-zero size partition is found. Fixes: `b72053072c` ("block: allow partitions on host aware zone devices") Cc: stable@vger.kernel.org # 5.5 Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:54:39 -06:00
Ming Lei	cc3200eac4	blk-mq: insert flush request to the front of dispatch queue commit `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly") may change to add flush request to the tail of dispatch by applying the 'add_head' parameter of blk_mq_sched_insert_request. Turns out this way causes performance regression on NCQ controller because flush is non-NCQ command, which can't be queued when there is any in-flight NCQ command. When adding flush rq to the front of hctx->dispatch, it is easier to introduce extra time to flush rq's latency compared with adding to the tail of dispatch queue because of S_SCHED_RESTART, then chance of flush merge is increased, and less flush requests may be issued to controller. So always insert flush request to the front of dispatch queue just like before applying commit `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly"). Cc: Damien Le Moal <Damien.LeMoal@wdc.com> Cc: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `01e99aeca3` ("blk-mq: insert passthrough request into hctx->dispatch directly") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-12 07:26:12 -06:00
Tejun Heo	dcd6589b11	blk-iocost: fix incorrect vtime comparison in iocg_is_idle() vtimes may wrap and time_before/after64() should be used to determine whether a given vtime is before or after another. iocg_is_idle() was incorrectly using plain "<" comparison do determine whether done_vtime is before vtime. Here, the only thing we're interested in is whether done_vtime matches vtime which indicates that there's nothing in flight. Let's test for inequality instead. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-10 11:37:00 -06:00
Carlo Nonato	14afc59361	block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group() The bfq_find_set_group() function takes as input a blkcg (which represents a cgroup) and retrieves the corresponding bfq_group, then it updates the bfq internal group hierarchy (see comments inside the function for why this is needed) and finally it returns the bfq_group. In the hierarchy update cycle, the pointer holding the correct bfq_group that has to be returned is mistakenly used to traverse the hierarchy bottom to top, meaning that in each iteration it gets overwritten with the parent of the current group. Since the update cycle stops at root's children (depth = 2), the overwrite becomes a problem only if the blkcg describes a cgroup at a hierarchy level deeper than that (depth > 2). In this case the root's child that happens to be also an ancestor of the correct bfq_group is returned. The main consequence is that processes contained in a cgroup at depth greater than 2 are wrongly placed in the group described above by BFQ. This commits fixes this problem by using a different bfq_group pointer in the update cycle in order to avoid the overwrite of the variable holding the original group reference. Reported-by: Kwon Je Oh <kwonje.oh2@gmail.com> Signed-off-by: Carlo Nonato <carlo.nonato95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-06 07:00:58 -07:00
Daniel Wagner	e959e5405f	block: Remove used kblockd_schedule_work_on() Commit `ee63cfa7fc` ("block: add kblockd_schedule_work_on()") introduced the helper in 2016. Remove it because since then no caller was added. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-02 07:17:31 -07:00
John Garry	cae740a04b	blk-mq: Remove some unused function arguments The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(), blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove it. Overall obj code size shows a minor reduction, before: text data bss dec hex filename 27306 1312 0 28618 6fca block/blk-mq.o 4303 272 0 4575 11df block/blk-mq-tag.o after: 27282 1312 0 28594 6fb2 block/blk-mq.o 4311 272 0 4583 11e7 block/blk-mq-tag.o Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> -- This minor patch had been carried as part of the blk-mq shared tags RFC, I'd rather not carry it anymore as it required rebasing, so now or never.. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-26 10:34:41 -07:00
Ming Lei	01e99aeca3	blk-mq: insert passthrough request into hctx->dispatch directly For some reason, device may be in one situation which can't handle FS request, so STS_RESOURCE is always returned and the FS request will be added to hctx->dispatch. However passthrough request may be required at that time for fixing the problem. If passthrough request is added to scheduler queue, there isn't any chance for blk-mq to dispatch it given we prioritize requests in hctx->dispatch. Then the FS IO request may never be completed, and IO hang is caused. So passthrough request has to be added to hctx->dispatch directly for fixing the IO hang. Fix this issue by inserting passthrough request into hctx->dispatch directly together withing adding FS request to the tail of hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert(). Then it becomes consistent with original legacy IO request path, in which passthrough request is always added to q->queue_head. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-24 18:50:48 -07:00
Linus Torvalds	ed535f2c9e	block-5.6-2020-02-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47ML4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvm2EACGaxAxP7pLniNV30cRotF8lPpQ5nUrpiem H1r5WqeI5osCGkRKHaJQ4O0Sw8IV2pWzHTWz+9bv56zLM40yIMaEHLRU00AM047n KFdA2x4xH+HhbR9lF+flYz1oInlIEXxPiERKm/p1pvQEbquzi4X5cQqv6q2pdzJ9 sf8OBJhKs4rp/ooqzWwjVOeP/n1sT2r+XDg9C9WC5aXaVZbbLw50r1WRYFt1zf7N oa+91fq2lasxK1c79OtbbGJlBXWTurAtUaKBM0KKPguiH2h9j47pAs0HsV02kZ2M 1ZltwKTyfDNMzBEgvkdB3R0G9nU422nIF+w319i6on8P8xfz8Px13d1KCQGAmfD6 K1YuaCgOjWuVhOKpMwBq9ql6QVP+1LIMKIl2OGJkrBgl9ZzfE8KMZa2QZTGrGO/U xE/hirYdj5T1O8umUQ4cmZHTROASOJZ8/eU9XHA1vf/eJYXiS31/4ewgRzP3oGX2 5Jvz3o144nBeBTOiFlzs3Fe+wX63QABNG22bijzEGoNTxjXJFroBDYzeiOELjECZ /xGRZG1bLOGMj8Gg4ZADSILQDkqISsQHofl1I9mWTbwB1j7g69ZjV8Ie2dyMaX6b 5z5Smqzd9gcok9hr8NGWkV3c3NypPxIWxrOcyzYbGLUPDGqa+QjGtlLrGgeinhLM SitalHw0KA== =05d8 -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Some later arrivals, but all fixes at this point: - bcache fix series (Coly) - Series of BFQ fixes (Paolo) - NVMe pull request from Keith with a few minor NVMe fixes - Various little tweaks" * tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits) nvmet: update AEN list and array at one place nvmet: Fix controller use after free nvmet: Fix error print message at nvmet_install_queue function brd: check and limit max_part par nvme-pci: remove nvmeq->tags nvmet: fix dsm failure when payload does not match sgl descriptor nvmet: Pass lockdep expression to RCU lists block, bfq: clarify the goal of bfq_split_bfqq() block, bfq: get a ref to a group when adding it to a service tree block, bfq: remove ifdefs from around gets/puts of bfq groups block, bfq: extend incomplete name of field on_st block, bfq: get extra ref to prevent a queue from being freed during a group move block, bfq: do not insert oom queue into position tree block, bfq: do not plug I/O for bfq_queues with no proc refs bcache: check return value of prio_read() bcache: fix incorrect data type usage in btree_flush_write() bcache: add readahead cache policy options via sysfs interface bcache: explicity type cast in bset_bkey_last() bcache: fix memory corruption in bch_cache_accounting_clear() xen/blkfront: limit allocated memory size to actual use case ...	2020-02-06 06:15:23 +00:00
Paolo Valente	c92bddee77	block, bfq: clarify the goal of bfq_split_bfqq() The exact, general goal of the function bfq_split_bfqq() is not that apparent. Add a comment to make it clear. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	db37a34c56	block, bfq: get a ref to a group when adding it to a service tree BFQ schedules generic entities, which may represent either bfq_queues or groups of bfq_queues. When an entity is inserted into a service tree, a reference must be taken, to make sure that the entity does not disappear while still referred in the tree. Unfortunately, such a reference is mistakenly taken only if the entity represents a bfq_queue. This commit takes a reference also in case the entity represents a group. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Chris Evich <cevich@redhat.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	4d8340d0d4	block, bfq: remove ifdefs from around gets/puts of bfq groups ifdefs around gets and puts of bfq groups reduce readability, remove them. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	33a16a9804	block, bfq: extend incomplete name of field on_st The flag on_st in the bfq_entity data structure is true if the entity is on a service tree or is in service. Yet the name of the field, confusingly, does not mention the second, very important case. Extend the name to mention the second case too. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	ecedd3d7e1	block, bfq: get extra ref to prevent a queue from being freed during a group move In bfq_bfqq_move(), the bfq_queue, say Q, to be moved to a new group may happen to be deactivated in the scheduling data structures of the source group (and then activated in the destination group). If Q is referred only by the data structures in the source group when the deactivation happens, then Q is freed upon the deactivation. This commit addresses this issue by getting an extra reference before the possible deactivation, and releasing this extra reference after Q has been moved. Tested-by: Chris Evich <cevich@redhat.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	32c59e3a9a	block, bfq: do not insert oom queue into position tree BFQ maintains an ordered list, implemented with an RB tree, of head-request positions of non-empty bfq_queues. This position tree, inherited from CFQ, is used to find bfq_queues that contain I/O close to each other. BFQ merges these bfq_queues into a single shared queue, if this boosts throughput on the device at hand. There is however a special-purpose bfq_queue that does not participate in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be wrongly added to the position tree. So bfqq_find_close() could return the oom bfq_queue, which is a source of further troubles in an out-of-memory situation. This commit prevents the oom bfq_queue from being inserted into the position tree. Tested-by: Patrick Dung <patdung100@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:15 -07:00
Paolo Valente	f718b09327	block, bfq: do not plug I/O for bfq_queues with no proc refs Commit `478de3380c` ("block, bfq: deschedule empty bfq_queues not referred by any process") fixed commit `3726112ec7` ("block, bfq: re-schedule empty queues if they deserve I/O plugging") by descheduling an empty bfq_queue when it remains with not process reference. Yet, this still left a case uncovered: an empty bfq_queue with not process reference that remains in service. This happens for an in-service sync bfq_queue that is deemed to deserve I/O-dispatch plugging when it remains empty. Yet no new requests will arrive for such a bfq_queue if no process sends requests to it any longer. Even worse, the bfq_queue may happen to be prematurely freed while still in service (because there may remain no reference to it any longer). This commit solves this problem by preventing I/O dispatch from being plugged for the in-service bfq_queue, if the latter has no process reference (the bfq_queue is then prevented from remaining in service). Fixes: `3726112ec7` ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Patrick Dung <patdung100@gmail.com> Tested-by: Patrick Dung <patdung100@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-03 06:58:14 -07:00
Linus Torvalds	33c84e89ab	SCSI misc on 20200129 This series is slightly unusual because it includes Arnd's compat ioctl tree here: `1c46a2cf2d` Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue Excluding Arnd's changes, this is mostly an update of the usual drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas. There are a couple of core and base updates around error propagation and atomicity in the attribute container base we use for the SCSI transport classes. The rest is minor changes and updates. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXjHQJyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishZZ8AQC02N+v iUnTl1YxGPjIWBbnHuUxN2Qbb9D3C6gAT1LkigEArlk163K3A1XEQHF/VNCdAz/f 01XYTd3p1VHuegIBHlk= =Cn52 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series is slightly unusual because it includes Arnd's compat ioctl tree here: `1c46a2cf2d` Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue Excluding Arnd's changes, this is mostly an update of the usual drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas. There are a couple of core and base updates around error propagation and atomicity in the attribute container base we use for the SCSI transport classes. The rest is minor changes and updates" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits) scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity scsi: hisi_sas: Modify the file permissions of trigger_dump to write only scsi: hisi_sas: Replace magic number when handle channel interrupt scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock scsi: hisi_sas: use threaded irq to process CQ interrupts scsi: ufs: Use UFS device indicated maximum LU number scsi: ufs: Add max_lu_supported in struct ufs_dev_info scsi: ufs: Delete is_init_prefetch from struct ufs_hba scsi: ufs: Inline two functions into their callers scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init() scsi: ufs: Split ufshcd_probe_hba() based on its called flow scsi: ufs: Delete struct ufs_dev_desc scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails scsi: ufs-mediatek: enable low-power mode for hibern8 state scsi: ufs: export some functions for vendor usage scsi: ufs-mediatek: add dbg_register_dump implementation scsi: qla2xxx: Fix a NULL pointer dereference in an error path scsi: qla1280: Make checking for 64bit support consistent scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1 ...	2020-01-29 18:16:16 -08:00
Linus Torvalds	48b4b4ff1e	for-5.6/block-2020-01-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4vOqAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppYBD/wLczY7hyjF2loc71MC9HloUq3BVbATktM3 OF6wRbyxbeiOj/7Px0lE0M67tQbnEoIP26gS03fd6e7HE19//gmzGuB3Z2R2CJ5q XKkTamqz0pcPcX5FdDO5JFQZf27/1Qs3g7Nkr7FjVcR2XQ8PFv5B/FLMhse4frJI k92Sj0V1OwdNtMXozKqno/7xPwL/kQKWoF6aFDgO27xLfsFmi8Wbgf/CslOOTHIN vAUaz3Cue6V17M5y98wD4nwpjG7Ve+aY1i6oFPBE7Az9TA0xoiBA/tNPKW7iS10C GEP1aoI6lpgkxAzvyR29K1ayjzV11hEIig3rNIWxNfmCSGaawttWXAPEi7jU5u2D ZXbzUJxKnfeg8yrAj0CTcKLA9i4v1cZXPCUXqMO2+wHEWgmxq2IWuWjSl/V4fn3Y zgTPBngDM4Gx3fAqvD8SVfCW7xwI4VRP+da58WCFOjwnOgYSouxS7RnCtm+yPUbk Es6m2XBb+3ycaJPT58LcXPrnTJWZeRincs3MfFJeTXRn5T7IzlBjKdIvQiQSHQXo caZzWHEJW827+wfQFNreXpk5KPi+D6boeziYe96UcII8L5qVw3N0X5hOpr6IRhkX hn2CUb/CmY6bl8PJJPVc4ygqgiavyvynJu+A0uJvFSjvXX6jjXNEsSJ6bz8aBxdm 4rmgPFTlqA== =yJZi -----END PGP SIGNATURE----- Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This may be the most quiet round we've had in years. I'm not complaining. Really not a lot to detail here, outside of spelling and documentation improvements/fixes, we have: - Allow t10-pi to be modular (Herbert) - Remove dead code in bfq (Alex) - Mark zone management requests with REQ_SYNC (Chaitanya) - BFQ division improvement (Wen) - Small series improving plugging (Pavel)" * tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block: partitions/ldm: fix spelling mistake "to" -> "too" block, bfq: improve arithmetic division in bfq_delta() block/bfq: remove unused bfq_class_rt which never used block: mark zone-mgmt bios with REQ_SYNC blk-mq: Document functions for sending request block: Allow t10-pi to be modular blk-mq: optimise blk_mq_flush_plug_list() list: introduce list_for_each_continue() blk-mq: optimise rq sort function	2020-01-27 12:38:25 -08:00
Christoph Hellwig	b72053072c	block: allow partitions on host aware zone devices Host-aware SMR drives can be used with the commands to explicitly manage zone state, but they can also be used as normal disks. In the former case it makes perfect sense to allow partitions on them, in the latter it does not, just like for host managed devices. Add a check to add_partition to allow partitions on host aware devices, but give up any zone management capabilities in that case, which also catches the previously missed case of adding a partition vs just scanning it. Because sd can rescan the attribute at runtime it needs to check if a disk has partitions, for which a new helper is added to genhd.h. Fixes: `5eac3eb30c` ("block: Remove partition support for zoned block devices") Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-26 09:59:08 -07:00
Colin Ian King	5336da37a5	partitions/ldm: fix spelling mistake "to" -> "too" There is a spelling mistake in a ldm_error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-23 11:41:45 -07:00
Wen Yang	554d21efb0	block, bfq: improve arithmetic division in bfq_delta() do_div() does a 64-by-32 division. Use div64_ul() instead of it if the divisor is unsigned long, to avoid truncation to 32-bit. And as a nice side effect also cleans up the function a bit. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@fb.com> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 10:34:11 -07:00
Alex Shi	b7f22d993f	block/bfq: remove unused bfq_class_rt which never used This macro is never used after introduced from commit `aee69d78de` ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Better to remove it. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-22 10:31:20 -07:00
Mikulas Patocka	ad6bf88a6c	block: fix an integer overflow in logical block size Logical block size has type unsigned short. That means that it can be at most 32768. However, there are architectures that can run with 64k pages (for example arm64) and on these architectures, it may be possible to create block devices with 64k block size. For exmaple (run this on an architecture with 64k pages): Mount will fail with this error because it tries to read the superblock using 2-sector access: device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536 EXT4-fs (dm-0): unable to read superblock This patch changes the logical block size from unsigned short to unsigned int to avoid the overflow. Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-15 21:43:09 -07:00
Ming Lei	4a2f704eb2	block: fix get_max_segment_size() overflow on 32bit arch Commit `429120f3df` starts to take account of segment's start dma address when computing max segment size, and data type of 'unsigned long' is used to do that. However, the segment mask may be 0xffffffff, so the figured out segment size may be overflowed in case of zero physical address on 32bit arch. Fix the issue by returning queue_max_segment_size() directly when that happens. Fixes: `429120f3df` ("block: fix splitting segments on boundary masks") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Christoph Hellwig <hch@lst.de> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-14 13:37:40 -07:00
Ming Lei	83c9c54716	fs: move guard_bio_eod() after bio_set_op_attrs Commit `85a8ce62c2` ("block: add bio_truncate to fix guard_bio_eod") adds bio_truncate() for handling bio EOD. However, bio_truncate() doesn't use the passed 'op' parameter from guard_bio_eod's callers. So bio_trunacate() may retrieve wrong 'op', and zering pages may not be done for READ bio. Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs() in submit_bh_wbc() so that bio_truncate() can always retrieve correct op info. Meantime remove the 'op' parameter from guard_bio_eod() because it isn't used any more. Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixes: `85a8ce62c2` ("block: add bio_truncate to fix guard_bio_eod") Signed-off-by: Ming Lei <ming.lei@redhat.com> Fold in kerneldoc and bio_op() change. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-09 08:16:12 -07:00
Chaitanya Kulkarni	8e42d239cb	block: mark zone-mgmt bios with REQ_SYNC In the current implementation, final zone-mgmt request is issued with submit_bio_wait() which marks the bio REQ_SYNC. This is needed since immediate action is expected for zone-mgmt requests as these are blocking operations. This also bypasses the scheduler in the blk_mq_make_request() and dispatches the request directly into the hw ctx. This patch marks all the chained bios REQ_SYNC so that we can have above-mentioned behavior for non-final bios also. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-09 07:59:12 -07:00
André Almeida	105663f73e	blk-mq: Document functions for sending request Add or improve documentation for function regarding creating and sending IO requests to the hardware. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-06 21:00:27 -07:00
Herbert Xu	a754bd5f18	block: Allow t10-pi to be modular Currently t10-pi can only be built into the block layer which via crc-t10dif pulls in a whole chunk of the Crypto API. In fact all users of t10-pi work as modules and there is no reason for it to always be built-in. This patch adds a new hidden option for t10-pi that is selected automatically based on BLK_DEV_INTEGRITY and whether the users of t10-pi are built-in or not. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-06 20:59:04 -07:00
Arnd Bergmann	9b81648cb5	compat_ioctl: simplify up block/ioctl.c Having separate implementations of blkdev_ioctl() often leads to these getting out of sync, despite the comment at the top. Since most of the ioctl commands are compatible, and we try very hard not to add any new incompatible ones, move all the common bits into a shared function and leave only the ones that are historically different in separate functions for native/compat mode. To deal with the compat_ptr() conversion, pass both the integer argument and the pointer argument into the new blkdev_common_ioctl() and make sure to always use the correct one of these. blkdev_ioctl() is now only kept as a separate exported interfact for drivers/char/raw.c, which lacks a compat_ioctl variant. We should probably either move raw.c to staging if there are no more users, or export blkdev_compat_ioctl() as well. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	5fb889f587	compat_ioctl: block: simplify compat_blkpg_ioctl() There is no need to go through a compat_alloc_user_space() copy any more, just wrap the function in a small helper that works the same way for native and compat mode. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	bdc1ddad3e	compat_ioctl: block: move blkdev_compat_ioctl() into ioctl.c Having both in the same file allows a number of simplifications to the compat path, and makes it more likely that changes to the native path get applied to the compat version as well. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	1df23c6fe5	compat_ioctl: move HDIO ioctl handling into drivers/ide Most of the HDIO ioctls are only used by the obsolete drivers/ide subsystem, these can be handled by changing ide_cmd_ioctl() to be aware of compat mode and doing the correct transformations in place and using it as both native and compat handlers for all drivers. The SCSI drivers implementing the same commands are already doing this in the drivers, so the compat_blkdev_driver_ioctl() function is no longer needed now. The BLKSECTSET and HDIO_GETGEO_BIG ioctls are not implemented in any driver any more and no longer need any conversion. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	64cbfa9655	compat_ioctl: move cdrom commands into cdrom.c There is no need for the special cases for the cdrom ioctls any more now, so make sure that each cdrom driver has a .compat_ioctl() callback and calls cdrom_compat_ioctl() directly there. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:42:52 +01:00
Arnd Bergmann	fe0da4e5e8	compat_ioctl: bsg: add handler bsg_ioctl() calls into scsi_cmd_ioctl() for a couple of generic commands and relies on fs/compat_ioctl.c to handle it correctly in compat mode. Adding a private compat_ioctl() handler avoids that round-trip and lets us get rid of the generic emulation once this is done. Note that bsg implements an SG_IO command that is different from the other drivers and does not need emulation. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:33:21 +01:00
Arnd Bergmann	8f8f562038	compat_ioctl: move CDROMREADADIO to cdrom.c Again, there is only one file that needs this, so move the conversion handler into the native implementation. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:33:08 +01:00
Arnd Bergmann	f3ee6e63a9	compat_ioctl: move CDROM_SEND_PACKET handling into scsi There is only one implementation of this ioctl, so move the handling out of the common block layer code into the place where it's actually needed. It also gets called indirectly through pktcdvd, which needs to be aware of this change. As I noticed, the old implementation of the compat handler failed to convert the structure on the way out, so the updated fields never got written back to user space. This is either not important, or it has never worked and should be fixed now. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:33:05 +01:00
Arnd Bergmann	ee6a129dff	compat_ioctl: block: add blkdev_compat_ptr_ioctl A lot of block drivers need only a trivial .compat_ioctl callback. Add a helper function that can be set as the callback pointer to only convert the argument using the compat_ptr() conversion and otherwise assume all input and output data is compatible, or handled using in_compat_syscall() checks. This mirrors the compat_ptr_ioctl() helper function used in character devices. Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:32:59 +01:00
Arnd Bergmann	78ed001d9e	compat: scsi: sg: fix v3 compat read/write interface In the v5.4 merge window, a cleanup patch from Al Viro conflicted with my rework of the compat handling for sg.c read(). Linus Torvalds did a correct merge but pointed out that the resulting code is still unsatisfactory. I later noticed that the sg_new_read() function still gets the compat mode wrong, when the 'count' argument is large enough to pass a compat_sg_io_hdr object, but not a nativ sg_io_hdr. To address both of these, move the definition of compat_sg_io_hdr into a scsi/sg.h to make it visible to sg.c and rewrite the logic for reading req_pack_id as well as the size check to a simpler version that gets the expected results. Fixes: `c35a5cfb41` ("scsi: sg: sg_read(): simplify reading ->pack_id of userland sg_io_hdr_t") Fixes: `98aaaec4a1` ("compat_ioctl: reimplement SG_IO handling") Reviewed-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2020-01-03 09:32:54 +01:00
Ming Lei	429120f3df	block: fix splitting segments on boundary masks We ran into a problem with a mpt3sas based controller, where we would see random (and hard to reproduce) file corruption). The issue seemed specific to this controller, but wasn't specific to the file system. After a lot of debugging, we find out that it's caused by segments spanning a 4G memory boundary. This shouldn't happen, as the default setting for segment boundary masks is 4G. Turns out there are two issues in get_max_segment_size(): 1) The default segment boundary mask is bypassed 2) The segment start address isn't taken into account when checking segment boundary limit Fix these two issues by removing the bypass of the segment boundary check even if the mask is set to the default value, and taking into account the actual start address of the request when checking if a segment needs splitting. Cc: stable@vger.kernel.org # v5.1+ Reviewed-by: Chris Mason <clm@fb.com> Tested-by: Chris Mason <clm@fb.com> Fixes: `dcebd75592` ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Dropped const on the page pointer, ppc page_to_phys() doesn't mark the page as const... Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-30 08:51:18 -07:00
Ming Lei	85a8ce62c2	block: add bio_truncate to fix guard_bio_eod Some filesystem, such as vfat, may send bio which crosses device boundary, and the worse thing is that the IO request starting within device boundaries can contain more than one segment past EOD. Commit `dce30ca9e3` ("fs: fix guard_bio_eod to check for real EOD errors") tries to fix this issue by returning -EIO for this situation. However, this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb() may hang for ever. Also the current truncating on last segment is dangerous by updating the last bvec, given bvec table becomes not immutable any more, and fs bio users may not retrieve the truncated pages via bio_for_each_segment_all() in its .end_io callback. Fixes this issue by supporting multi-segment truncating. And the approach is simpler: - just update bio size since block layer can make correct bvec with the updated bio size. Then bvec table becomes really immutable. - zero all truncated segments for read bio Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: linux-fsdevel@vger.kernel.org Fixed-by: `dce30ca9e3` ("fs: fix guard_bio_eod to check for real EOD errors") Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-28 09:44:56 -07:00
Arnd Bergmann	b2c0fcd287	compat_ioctl: block: handle Persistent Reservations These were added to blkdev_ioctl() in linux-5.5 but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.4+ Fixes: `bbd3e06436` ("block: add an API for Persistent Reservations") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fold in followup patch from Arnd with missing pr.h header include. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:56 -07:00
Arnd Bergmann	4b43f31d65	compat_ioctl: block: handle add zone open, close and finish ioctl These were added to blkdev_ioctl() in linux-5.5 but not blkdev_compat_ioctl, so add them now. Fixes: `e876df1fe0` ("block: add zone open, close and finish ioctl support") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:41 -07:00
Arnd Bergmann	21d3734091	compat_ioctl: block: handle BLKGETZONESZ/BLKGETNRZONES These were added to blkdev_ioctl() in v4.20 but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.20+ Fixes: `72cd87576d` ("block: Introduce BLKGETZONESZ ioctl") Fixes: `65e4e3eee8` ("block: Introduce BLKGETNRZONES ioctl") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:41 -07:00
Arnd Bergmann	673bdf8ce0	compat_ioctl: block: handle BLKREPORTZONE/BLKRESETZONE These were added to blkdev_ioctl() but not blkdev_compat_ioctl, so add them now. Cc: <stable@vger.kernel.org> # v4.10+ Fixes: `3ed05a987e` ("blk-zoned: implement ioctls") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-21 07:26:40 -07:00
Yang Yingliang	3b7995a98a	block: fix memleak when __blk_rq_map_user_iov() is failed When I doing fuzzy test, get the memleak report: BUG: memory leak unreferenced object 0xffff88837af80000 (size 4096): comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ............... backtrace: [<000000001c894df8>] bio_alloc_bioset+0x393/0x590 [<000000008b139a3c>] bio_copy_user_iov+0x300/0xcd0 [<00000000a998bd8c>] blk_rq_map_user_iov+0x2f1/0x5f0 [<000000005ceb7f05>] blk_rq_map_user+0xf2/0x160 [<000000006454da92>] sg_common_write.isra.21+0x1094/0x1870 [<00000000064bb208>] sg_write.part.25+0x5d9/0x950 [<000000004fc670f6>] sg_write+0x5f/0x8c [<00000000b0d05c7b>] __vfs_write+0x7c/0x100 [<000000008e177714>] vfs_write+0x1c3/0x500 [<0000000087d23f34>] ksys_write+0xf9/0x200 [<000000002c8dbc9d>] do_syscall_64+0x9f/0x4f0 [<00000000678d8e9a>] entry_SYSCALL_64_after_hwframe+0x49/0xbe If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(), the bio(s) which is allocated before this failing will leak. The refcount of the bio(s) is init to 1 and increased to 2 by calling bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so the bio cannot be freed. Fix it by calling blk_rq_unmap_user(). Reviewed-by: Bob Liu <bob.liu@oracle.com> Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 11:52:01 -07:00
Bart Van Assche	b3c6a59975	block: Fix a lockdep complaint triggered by request queue flushing Avoid that running test nvme/012 from the blktests suite triggers the following false positive lockdep complaint: ============================================ WARNING: possible recursive locking detected 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted -------------------------------------------- ksoftirqd/1/16 is trying to acquire lock: 000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 but task is already holding lock: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&fq->mq_flush_lock)->rlock); lock(&(&fq->mq_flush_lock)->rlock); * DEADLOCK * May be due to missing lock nesting notation 1 lock held by ksoftirqd/1/16: #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0 stack backtrace: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: dump_stack+0x67/0x90 __lock_acquire.cold.45+0x2b4/0x313 lock_acquire+0x98/0x160 _raw_spin_lock_irqsave+0x3b/0x80 flush_end_io+0x4e/0x1d0 blk_mq_complete_request+0x76/0x110 nvmet_req_complete+0x15/0x110 [nvmet] nvmet_bio_done+0x27/0x50 [nvmet] blk_update_request+0xd7/0x2d0 blk_mq_end_request+0x1a/0x100 blk_flush_complete_seq+0xe5/0x350 flush_end_io+0x12f/0x1d0 blk_done_softirq+0x9f/0xd0 __do_softirq+0xca/0x440 run_ksoftirqd+0x24/0x50 smpboot_thread_fn+0x113/0x1e0 kthread+0x121/0x140 ret_from_fork+0x3a/0x50 Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 11:52:01 -07:00
Bart Van Assche	c44a4edb20	block: Fix the type of 'sts' in bsg_queue_rq() This patch fixes the following sparse warnings: block/bsg-lib.c:269:19: warning: incorrect type in initializer (different base types) block/bsg-lib.c:269:19: expected int sts block/bsg-lib.c:269:19: got restricted blk_status_t [usertype] block/bsg-lib.c:286:16: warning: incorrect type in return expression (different base types) block/bsg-lib.c:286:16: expected restricted blk_status_t block/bsg-lib.c:286:16: got int [assigned] sts Cc: Martin Wilck <mwilck@suse.com> Fixes: `d46fe2cb2d` ("block: drop device references in bsg_queue_rq()") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-20 11:52:01 -07:00
Pavel Begunkov	95ed0c5b12	blk-mq: optimise blk_mq_flush_plug_list() Instead of using list_del_init() in a loop, that generates a lot of unnecessary memory read/writes, iterate from the first request of a batch and cut out a sublist with list_cut_before(). Apart from removing the list node initialisation part, this is more register-friendly, and the assembly uses the stack less intensively. list_empty() at the beginning is done with hope, that the compiler can optimise out the same check in the following list_splice_init(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-19 06:08:50 -07:00
Pavel Begunkov	7d30a62102	blk-mq: optimise rq sort function Check "!=" in multi-layer comparisons. The same memory usage, fewer instructions, and 2 from 4 jumps are replaced with SETcc. Note, that list_sort() doesn't differ 0 and <0. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-19 06:08:50 -07:00
Roman Penyaev	c58c1f8343	block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat request gracefully on -EAGAIN error. The problem is well reproduced using io_uring: mkfs.ext4 /dev/ram0 mount /dev/ram0 /mnt # Preallocate a file dd if=/dev/zero of=/mnt/file bs=1M count=1 # Start fio with io_uring and get -EIO fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/file Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-17 09:01:43 -07:00
Tejun Heo	d7bd15a138	iocost: over-budget forced IOs should schedule async delay When over-budget IOs are force-issued through root cgroup, iocg_kick_delay() adjusts the async delay accordingly but doesn't actually schedule async throttle for the issuing task. This bug is pretty well masked because sooner or later the offending threads are gonna get directly throttled on regular IOs or have async delay scheduled by mem_cgroup_throttle_swaprate(). However, it can affect control quality on filesystem metadata heavy operations. Let's fix it by invoking blkcg_schedule_throttle() when iocg_kick_delay() says async delay is needed. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `7caa47151a` ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-12-16 16:10:17 -07:00

1 2 3 4 5 ...

4751 Commits