Commit Graph

1136696 Commits

Author SHA1 Message Date
Kemeng Shi 37754595e9 blk-cgroup: Fix typo in comment
Replace assocating with associating.
Replace intiailized with initialized.

Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221206093307.378249-1-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-08 09:17:21 -07:00
Jens Axboe 8f415307c3 nvme updates for Linux 6.2
- fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
  - use the common tagset helpers in nvme-pci driver (Christoph Hellwig)
  - cleanup the nvme-pci removal path (Christoph Hellwig)
  - use kstrtobool() instead of strtobool (Christophe JAILLET)
  - allow unprivileged passthrough of Identify Controller (Joel Granados)
  - support io stats on the mpath device (Sagi Grimberg)
  - minor nvmet cleanup (Sagi Grimberg)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmOQ2s4LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNt9w//StxG9XxAD6aqZvDxzYG03qYDOFNLbXRevsGMtw/K
 TSkEwzbXkiRoWdCoqQv5coB0MzaT0R6X9/Ni+tZbws+wjWCeVKMVik92SpYt+kra
 b3rk+XUtcKDIjjbda9C4Yahl+VOF4iiR+gWGbH3SQ1WOAKZvb+oF82grzsRBvMfo
 khNAc2YKgGsJLCDTox2PlSg0x3zg9b7ZkXkOW/r0l71EGULTIogqdpIEaFU3z1qw
 vWi08DH80Ce1D+2R3u8Yg2hML/E2XZCKqdB2sXTzRdJOpNFvS2ODQby8oe6Lqvg5
 /wZGMhKKrcKbJOXOMjad+aToqK09JC7Kv5DdRleLGPQNaZ15ucVDpLUsaxRo3mzW
 zoZGYQCb1S7S3CAECPm9ctJ4uB1IolCVITEV5ZdkB56UPfHp8u2qpaPK+kQANAf9
 IiXLCWIzWndxkxNT1HemuCSvf1oLJBB9/zYnnJe7UEBJaC1GC0KPysk37+kMhvWj
 SxH8HJn68/fDG8nhZ0ptf9S5/xP6xJN8rBXzLxtuv9aAJjyXGxzxVLCBaTCre8I8
 706LZKKSrUe2J5HuaS4nju19yjNgtF+rq65qokhot/ANSIQfDo20FdNP2LjcfzJE
 ShwpwTv77uXOtxebtqbpufIavdCf73OhboThFo2GY4aELUaE7JS6u1yneq/4ALbZ
 RzE=
 =3ncj
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.2-2022-12-07' of git://git.infradead.org/nvme into for-6.2/block

Pull NVMe updates from Christoph:

"nvme updates for Linux 6.2

 - fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
 - use the common tagset helpers in nvme-pci driver (Christoph Hellwig)
 - cleanup the nvme-pci removal path (Christoph Hellwig)
 - use kstrtobool() instead of strtobool (Christophe JAILLET)
 - allow unprivileged passthrough of Identify Controller (Joel Granados)
 - support io stats on the mpath device (Sagi Grimberg)
 - minor nvmet cleanup (Sagi Grimberg)"

* tag 'nvme-6.2-2022-12-07' of git://git.infradead.org/nvme: (22 commits)
  nvmet: don't open-code NVME_NS_ATTR_RO enumeration
  nvme-pci: use the tagset alloc/free helpers
  nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
  nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
  nvme: consolidate setting the tagset flags
  nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
  nvme-pci: split out a nvme_pci_ctrl_is_dead helper
  nvme-pci: return early on ctrl state mismatch in nvme_reset_work
  nvme-pci: rename nvme_disable_io_queues
  nvme-pci: cleanup nvme_suspend_queue
  nvme-pci: remove nvme_pci_disable
  nvme-pci: remove nvme_disable_admin_queue
  nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
  nvme: use nvme_wait_ready in nvme_shutdown_ctrl
  nvme-apple: fix controller shutdown in apple_nvme_disable
  nvme-fc: move common code into helper
  nvme-fc: avoid null pointer dereference
  nvme: allow unprivileged passthrough of Identify Controller
  nvme-multipath: support io stats on the mpath device
  nvme: introduce nvme_start_request
  ...
2022-12-07 12:39:37 -07:00
Christoph Hellwig c34b7ac650 block: remove bio_set_op_attrs
This macro is obsolete, so replace the last few uses with open coded
bi_opf assignments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 09:43:12 -07:00
Sagi Grimberg 19b00e0069 nvmet: don't open-code NVME_NS_ATTR_RO enumeration
It is already there, just go ahead and use it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-07 15:03:09 +01:00
Christoph Hellwig 0da7feaa59 nvme-pci: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_dev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:31 +01:00
Christoph Hellwig 93b24f579c nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
Add the apple shared tag workaround to nvme_alloc_io_tag_set to prepare
for using that helper in the PCIe driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:27 +01:00
Christoph Hellwig b794d1c2ad nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
The reserved_tags are only needed for fabrics controllers.  Right now only
fabrics drivers call this helper, so this is harmless, but we'll use it
in the PCIe driver soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:24 +01:00
Christoph Hellwig db45e1a5dd nvme: consolidate setting the tagset flags
All nvme transports should be using the same flags for their tagsets,
with the exception for the blocking flag that should only be set for
transports that can block in ->queue_rq.

Add a NVME_F_BLOCKING flag to nvme_ctrl_ops to control the blocking
behavior and lift setting the flags into nvme_alloc_{admin,io}_tag_set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:20 +01:00
Christoph Hellwig dcef77274a nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
Don't look at ctrl->ops as only RDMA and TCP actually support multiple
maps.

Fixes: 6dfba1c09c ("nvme-fc: use the tagset alloc/free helpers")
Fixes: ceee1953f9 ("nvme-loop: use the tagset alloc/free helpers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-12-07 15:02:15 +01:00
Christoph Hellwig db1c7d7797 block: bio_copy_data_iter
With the pktcdvdv removal, bio_copy_data_iter is unused now.  Fold the
logic into bio_copy_data and remove the separate lower level function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221206144407.722049-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-06 10:18:27 -07:00
Christoph Hellwig 68e81eba67 nvme-pci: split out a nvme_pci_ctrl_is_dead helper
Clean up nvme_dev_disable by splitting the logic to detect if a
controller is dead into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:38:19 +01:00
Christoph Hellwig 8cb9f10b71 nvme-pci: return early on ctrl state mismatch in nvme_reset_work
The only way nvme_reset_work could be called when not in resetting state
is if a reset and remove happen near the same time.  This should not
happen, but if it did we don't want the reset work to disable the
controller because the remove is already doing that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig 7d879c90ae nvme-pci: rename nvme_disable_io_queues
This function really deletes the I/O queues, so rename it to match
the functionality.  Also move the main wrapper right next to the
actual underlying implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig 10981f23a1 nvme-pci: cleanup nvme_suspend_queue
Remove the unused returne value, pass a dev + qid instead of the queue
as that is better for the callers as well as the function itself, and
remove the entirely pointless kerneldoc comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig c80767f770 nvme-pci: remove nvme_pci_disable
nvme_pci_disable has a single caller, fold it into that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig 47d42d229a nvme-pci: remove nvme_disable_admin_queue
nvme_disable_admin_queue has only a single caller, and just calls two
other funtions, so remove it to clean up the remove path a little more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2022-12-06 14:36:54 +01:00
Christoph Hellwig 285b6e9b57 nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
Many of the callers decide which one to use based on a bool argument and
there is at least some code to be shared, so merge these two.  Also
move a comment specific to a single callsite to that callsite.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hector Martin <marcan@marcan.st>
2022-12-06 14:36:54 +01:00
Christoph Hellwig e6d275de2e nvme: use nvme_wait_ready in nvme_shutdown_ctrl
Refactor the code to wait for CSTS state changes so that it can be reused
by nvme_shutdown_ctrl.  This reduces the delay between each iteration
that checks CSTS from 100ms in the shutdown code to the 1 to 2ms range
done during enable, matching the changes from commit 3e98c2443f that
were only applied to the enable/disable path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
2022-12-06 14:36:53 +01:00
Christoph Hellwig c76b8308e4 nvme-apple: fix controller shutdown in apple_nvme_disable
nvme_shutdown_ctrl already shuts the controller down, there is no
need to also call nvme_disable_ctrl for the shutdown case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hector Martin <marcan@marcan.st>
2022-12-06 14:36:51 +01:00
Chaitanya Kulkarni b296958557 nvme-fc: move common code into helper
Add a helper to move the duplicate code for error message
from nvme_fc_rcv_ls_req() to nvme_fc_rcv_ls_req_err_msg().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:27:24 +01:00
Chaitanya Kulkarni 6c90294d72 nvme-fc: avoid null pointer dereference
Before using dynamically allcoated variable lsop in the
nvme_fc_rcv_ls_req(), add a check for NULL and error out early.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:27:24 +01:00
Joel Granados ea43fceea4 nvme: allow unprivileged passthrough of Identify Controller
Add unprivileged passthrough of the I/O Command Set Independent and I/O
Command Set Specific Identify Controller sub-command.

This will allow access to attributes (e.g. MDTS and WZSL) that are needed
to effectively form passthrough I/O to the /dev/ng* character devices.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:27:03 +01:00
Sagi Grimberg d4d957b53d nvme-multipath: support io stats on the mpath device
Our mpath stack device is just a shim that selects a bottom namespace
and submits the bio to it without any fancy splitting. This also means
that we don't clone the bio or have any context to the bio beyond
submission. However it really sucks that we don't see the mpath device
io stats.

Given that the mpath device can't do that without adding some context
to it, we let the bottom device do it on its behalf (somewhat similar
to the approach taken in nvme_trace_bio_complete).

When the IO starts, we account the request for multipath IO stats using
REQ_NVME_MPATH_IO_STATS nvme_request flag to avoid queue io stats disable
in the middle of the request.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2022-12-06 09:17:01 +01:00
Sagi Grimberg 6887fc6495 nvme: introduce nvme_start_request
In preparation for nvme-multipath IO stats accounting, we want the
accounting to happen in a centralized place. The request completion
is already centralized, but we need a common helper to request I/O
start.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2022-12-06 09:16:57 +01:00
Christophe JAILLET 99722c8aa8 nvme: use kstrtobool() instead of strtobool()
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file (<linux/kstrtox.h>)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-12-06 09:16:56 +01:00
Christoph Hellwig ba0718a6d6 nvme: don't call blk_mq_{,un}quiesce_tagset when ctrl->tagset is NULL
The NVMe drivers support a mode where no tagset is allocated for the I/O
queues and only the admin queue is usable.  In that case ctrl->tagset is
NULL and we must not call the block per-tagset quiesce helpers that
dereference it.

Fixes: 98d81f0df7 ("nvme: use blk_mq_[un]quiesce_tagset")
Reported-by: Gerd Bayer <gbayer@linux.ibm.com>
Reported-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Leng <lengchao@huawei.com>
2022-12-06 09:16:56 +01:00
Kemeng Shi eea3e8b74a blk-throttle: Use more suitable time_after check for update of slice_start
There is no need to update tg->slice_start[rw] to start when they are
equal already. So remove "eq" part of check before update slice_start.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-10-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:45:31 -07:00
Kemeng Shi 9c9f209d9d blk-throttle: remove repeat check of elapsed time
There is no need to check elapsed time from last upgrade for each node in
hierarchy. Move this check before traversing as throtl_can_upgrade do
to remove repeat check.

Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:45:11 -07:00
Kemeng Shi e3031d4c7d blk-throttle: remove incorrect comment for tg_last_low_overflow_time
Function tg_last_low_overflow_time is called with intermediate node as
following:
throtl_hierarchy_can_downgrade
  throtl_tg_can_downgrade
    tg_last_low_overflow_time

throtl_hierarchy_can_upgrade
  throtl_tg_can_upgrade
    tg_last_low_overflow_time

throtl_hierarchy_can_downgrade/throtl_hierarchy_can_upgrade will traverse
from leaf node to sub-root node and pass traversed intermediate node
to tg_last_low_overflow_time.

No such limit could be found from context and implementation of
tg_last_low_overflow_time, so remove this limit in comment.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:45:05 -07:00
Kemeng Shi 009df34171 blk-throttle: fix typo in comment of throtl_adjusted_limit
lapsed time -> elapsed time

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-7-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:44:59 -07:00
Kemeng Shi a4d508e333 blk-throttle: simpfy low limit reached check in throtl_tg_can_upgrade
Commit c79892c557 ("blk-throttle: add upgrade logic for LIMIT_LOW
state") added upgrade logic for low limit and methioned that
1. "To determine if a cgroup exceeds its limitation, we check if the cgroup
has pending request. Since cgroup is throttled according to the limit,
pending request means the cgroup reaches the limit."
2. "If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction IO,
the other direction IO could be severly harmed."
Besides, we also determine that cgroup reaches low limit if low limit is 0,
see comment in throtl_tg_can_upgrade.

Collect the information above, the desgin of upgrade check is as following:
1.The low limit is reached if limit is zero or io is already queued.
2.Cgroup will pass upgrade check if low limits of READ and WRITE are both
reached.

Simpfy the check code described above to removce repeat check and improve
readability. There is no functional change.

Detail equivalence proof is as following:
All replaced conditions to return true are as following:
condition 1
  (!read_limit && !write_limit)
condition 2
  read_limit && sq->nr_queued[READ] &&
  (!write_limit || sq->nr_queued[WRITE])
condition 3
  write_limit && sq->nr_queued[WRITE] &&
  (!read_limit || sq->nr_queued[READ])

Transferring condition 2 as following:
  (read_limit && sq->nr_queued[READ]) &&
  (!write_limit || sq->nr_queued[WRITE])
is equivalent to
  (read_limit && sq->nr_queued[READ]) &&
  (!write_limit || (write_limit && sq->nr_queued[WRITE]))
is equivalent to
condition 2.1
  (read_limit && sq->nr_queued[READ] &&
  !write_limit) ||
condition 2.2
  (read_limit && sq->nr_queued[READ] &&
  (write_limit && sq->nr_queued[WRITE]))

Transferring condition 3 as following:
  write_limit && sq->nr_queued[WRITE] &&
  (!read_limit || sq->nr_queued[READ])
is equivalent to
  (write_limit && sq->nr_queued[WRITE]) &&
  (!read_limit || (read_limit && sq->nr_queued[READ]))
is equivalent to
condition 3.1
  ((write_limit && sq->nr_queued[WRITE]) &&
  !read_limit) ||
condition 3.2
  ((write_limit && sq->nr_queued[WRITE]) &&
  (read_limit && sq->nr_queued[READ]))

Condition 3.2 is the same as condition 2.2, so all conditions we get to
return are as following:
  (!read_limit && !write_limit) (1)
  (!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1)
  ((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1)
  ((write_limit && sq->nr_queued[WRITE]) &&
  (read_limit && sq->nr_queued[READ])) (2.2)

As we can extract conditions "(a1 || a2) && (b1 || b2)" to:
a1 && b1
a1 && b2
a2 && b1
ab && b2

Considering that:
a1 = !read_limit
a2 = read_limit && sq->nr_queued[READ]
b1 = !write_limit
b2 = write_limit && sq->nr_queued[WRITE]

We can pack replaced conditions to
  (!read_limit || (read_limit && sq->nr_queued[READ])) &&
  (!write_limit || (write_limit && sq->nr_queued[WRITE]))
which is equivalent to
  (!read_limit || sq->nr_queued[READ]) &&
  (!write_limit || sq->nr_queued[WRITE])

Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:44:48 -07:00
Kemeng Shi 183daeb11d blk-throttle: correct calculation of wait time in tg_may_dispatch
In C language, When executing "if (expression1 && expression2)" and
expression1 return false, the expression2 may not be executed.
For "tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) &&
tg_within_iops_limit(tg, bio, iops_limit, &iops_wait))", if bps is
limited, tg_within_bps_limit will return false and
tg_within_iops_limit will not be called. So even bps and iops are
both limited, iops_wait will not be calculated and is always zero.
So wait time of iops is always ignored.

Fix this by always calling tg_within_bps_limit and tg_within_iops_limit
to get wait time for both bps and iops.

Observed that:
1. Wait time in tg_within_iops_limit/tg_within_bps_limit need always
be stored as wait argument is always passed.
2. wait time is stored to zero if iops/bps is limited otherwise non-zero
is stored.
Simpfy tg_within_iops_limit/tg_within_bps_limit by removing wait argument
and return wait time directly. Caller tg_may_dispatch checks if wait time
is zero to find if iops/bps is limited.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:44:42 -07:00
Kemeng Shi eb18479182 blk-throttle: ignore cgroup without io queued in blk_throtl_cancel_bios
Ignore cgroup without io queued in blk_throtl_cancel_bios for two
reasons:
1. Save cpu cycle for trying to dispatch cgroup which is no io queued.
2. Avoid non-consistent state that cgroup is inserted to service queue
without THROTL_TG_PENDING set as tg_update_disptime will unconditional
re-insert cgroup to service queue. If we are on the default hierarchy,
IO dispatched from child in tg_dispatch_one_bio will trigger inserting
cgroup to service queue without erase first and ruin the tree.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:44:34 -07:00
Kemeng Shi 84aca0a7e0 blk-throttle: Fix that bps of child could exceed bps limited in parent
Consider situation as following (on the default hierarchy):
 HDD
  |
root (bps limit: 4k)
  |
child (bps limit :8k)
  |
fio bs=8k
Rate of fio is supposed to be 4k, but result is 8k. Reason is as
following:
Size of single IO from fio is larger than bytes allowed in one
throtl_slice in child, so IOs are always queued in child group first.
When queued IOs in child are dispatched to parent group, BIO_BPS_THROTTLED
is set and these IOs will not be limited by tg_within_bps_limit anymore.
Fix this by only set BIO_BPS_THROTTLED when the bio traversed the entire
tree.

There patch has no influence on situation which is not on the default
hierarchy as each group is a single root group without parent.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:44:27 -07:00
Kemeng Shi f56019aef3 blk-throttle: correct stale comment in throtl_pd_init
On the default hierarchy (cgroup2), the throttle interface files don't
exist in the root cgroup, so the ablity to limit the whole system
by configuring root group is not existing anymore. In general, cgroup
doesn't wanna be in the business of restricting resources at the
system level, so correct the stale comment that we can limit whole
system to we can only limit subtree.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221205115709.251489-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05 13:44:19 -07:00
Jens Axboe b147645148 Floppy patch for 6.2
The patch from Yuan Can fixes a memory leak in floppy init code.
 
 Signed-off-by: Denis Efremov <efremov@linux.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEdlQDNgKUDfGSD+QDtSKVsDNQMB8FAmOMt1sACgkQtSKVsDNQ
 MB9jZg//UEGaUtecb6V7YekRUfXhQUVh8awB8suX5xnApPg8z4en1x7LENMOYZIn
 4Lw+SYpQfYswU0LnUPIDPauZnKQJwTW4UCRJrDqa6SSf/YPOX8fWOso4K1X0HEG0
 Z25ptwZFSXld4qQagv0Y7Qen/nLnXh7pDmAxXwgT6/njkpbwjNotRIpzDqE+1pMm
 bhCjnvWikOFOr19xrbS8s7va/YHuJzvOlVVCsAZvNCmPuMSqlnCmLlB9i63hz/Z+
 yUf7U1xO6pRsrl1NEgwV6lAlFviXA1gtyfuWzBJiJewGok77gu14rJIF9mC5Kipw
 gRtUx5XN2orF23XEh6lUVWq4whwSFa7ClGv+GAxCIUFBU4rU37z6CUGoM824UwRd
 RqGOzLVesPF6oIjRuPNXpYsvMPBGsaPHtgEt2MzYNJXXNCU8Wfl9/GDVMXIfHo9y
 CtnjtHbkG0hUS/cX5cM8TIpWa+u6KrwfnLy1SrmBUnC7xev+Q5RooQlyd0AiB/r8
 xhvsWwLz2kAgWEZ2D56gTyrayNSdFTCwSI2XDZzxy26/PDeYUaOGc+1+Ad8vRORn
 tRr/Eal3B0aG8ddX1AqaBbWXJiFtDFVTb5dW17iA1Gj6CXmWMDnf7DxFGYo+cBdL
 0TfOiHR8q8ELtH+ZWf+EgysHWj++deFWHBK1Neq3dIFaIUKApkw=
 =jO9b
 -----END PGP SIGNATURE-----

Merge tag 'floppy-for-6.2' of https://github.com/evdenis/linux-floppy into for-6.2/block

Pull floppy fix from Denis:

"Floppy patch for 6.2

 The patch from Yuan Can fixes a memory leak in floppy init code.

 Signed-off-by: Denis Efremov <efremov@linux.com>"

* tag 'floppy-for-6.2' of https://github.com/evdenis/linux-floppy:
  floppy: Fix memory leak in do_floppy_init()
2022-12-04 08:54:19 -07:00
Yuan Can f8ace2e304 floppy: Fix memory leak in do_floppy_init()
A memory leak was reported when floppy_alloc_disk() failed in
do_floppy_init().

unreferenced object 0xffff888115ed25a0 (size 8):
  comm "modprobe", pid 727, jiffies 4295051278 (age 25.529s)
  hex dump (first 8 bytes):
    00 ac 67 5b 81 88 ff ff                          ..g[....
  backtrace:
    [<000000007f457abb>] __kmalloc_node+0x4c/0xc0
    [<00000000a87bfa9e>] blk_mq_realloc_tag_set_tags.part.0+0x6f/0x180
    [<000000006f02e8b1>] blk_mq_alloc_tag_set+0x573/0x1130
    [<0000000066007fd7>] 0xffffffffc06b8b08
    [<0000000081f5ac40>] do_one_initcall+0xd0/0x4f0
    [<00000000e26d04ee>] do_init_module+0x1a4/0x680
    [<000000001bb22407>] load_module+0x6249/0x7110
    [<00000000ad31ac4d>] __do_sys_finit_module+0x140/0x200
    [<000000007bddca46>] do_syscall_64+0x35/0x80
    [<00000000b5afec39>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
unreferenced object 0xffff88810fc30540 (size 32):
  comm "modprobe", pid 727, jiffies 4295051278 (age 25.529s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000007f457abb>] __kmalloc_node+0x4c/0xc0
    [<000000006b91eab4>] blk_mq_alloc_tag_set+0x393/0x1130
    [<0000000066007fd7>] 0xffffffffc06b8b08
    [<0000000081f5ac40>] do_one_initcall+0xd0/0x4f0
    [<00000000e26d04ee>] do_init_module+0x1a4/0x680
    [<000000001bb22407>] load_module+0x6249/0x7110
    [<00000000ad31ac4d>] __do_sys_finit_module+0x140/0x200
    [<000000007bddca46>] do_syscall_64+0x35/0x80
    [<00000000b5afec39>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

If the floppy_alloc_disk() failed, disks of current drive will not be set,
thus the lastest allocated set->tag cannot be freed in the error handling
path. A simple call graph shown as below:

 floppy_module_init()
   floppy_init()
     do_floppy_init()
       for (drive = 0; drive < N_DRIVE; drive++)
         blk_mq_alloc_tag_set()
           blk_mq_alloc_tag_set_tags()
             blk_mq_realloc_tag_set_tags() # set->tag allocated
         floppy_alloc_disk()
           blk_mq_alloc_disk() # error occurred, disks failed to allocated

       ->out_put_disk:
       for (drive = 0; drive < N_DRIVE; drive++)
         if (!disks[drive][0]) # the last disks is not set and loop break
           break;
         blk_mq_free_tag_set() # the latest allocated set->tag leaked

Fix this problem by free the set->tag of current drive before jump to
error handling path.

Cc: stable@vger.kernel.org
Fixes: 302cfee150 ("floppy: use a separate gendisk for each media format")
Signed-off-by: Yuan Can <yuancan@huawei.com>
[efremov: added stable list, changed title]
Signed-off-by: Denis Efremov <efremov@linux.com>
2022-12-04 18:03:41 +04:00
Greg Kroah-Hartman 85d6ce58e4 block: remove devnode callback from struct block_device_operations
With the removal of the pktcdvd driver, there are no in-kernel users of
the devnode callback in struct block_device_operations, so it can be
safely removed.  If it is needed for new block drivers in the future, it
can be brought back.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221203140747.1942969-1-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-03 09:19:48 -07:00
Jens Axboe 368c7f1f8a Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block
Pull MD fixes from Song:

"This contains code cleanup by Christoph."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: fold unbind_rdev_from_array into md_kick_rdev_from_array
  md: mark md_kick_rdev_from_array static
  md: remove lock_bdev / unlock_bdev
2022-12-02 12:36:37 -07:00
Greg Kroah-Hartman f40eb99897 pktcdvd: remove driver.
Way back in 2016 in commit 5a8b187c61 ("pktcdvd: mark as unmaintained
and deprecated") this driver was marked as "will be removed soon".  5
years seems long enough to have it stick around after that, so finally
remove the thing now.

Reported-by: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Maier <balagi@justmail.de>
Cc: Peter Osterlund <petero2@telia.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221202182758.1339039-1-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-02 12:35:30 -07:00
Christoph Hellwig b5c1acf012 md: fold unbind_rdev_from_array into md_kick_rdev_from_array
unbind_rdev_from_array is only called from md_kick_rdev_from_array, so
merge it into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2022-12-02 11:21:01 -08:00
Christoph Hellwig d57d9d6965 md: mark md_kick_rdev_from_array static
md_kick_rdev_from_array is only used in md.c, so unexport it and mark
the symbol static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2022-12-02 11:21:01 -08:00
Christoph Hellwig fb541ca4c3 md: remove lock_bdev / unlock_bdev
These wrappers for blkdev_get / blkdev_put just horribly confuse the
code with their odd naming.  Remove them and improve the error unwinding
in md_import_device with the now folded code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2022-12-02 11:21:01 -08:00
Yang Li 1d6df9d352 blk-cgroup: Fix some kernel-doc comments
Make the description of @gendisk to @disk in blkcg_schedule_throttle()
to clear the below warnings:

block/blk-cgroup.c:1850: warning: Function parameter or member 'disk' not described in 'blkcg_schedule_throttle'
block/blk-cgroup.c:1850: warning: Excess function parameter 'gendisk' description in 'blkcg_schedule_throttle'

Fixes: de185b56e8 ("blk-cgroup: pass a gendisk to blkcg_schedule_throttle")
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3338
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221202011713.14834-1-yang.lee@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 18:22:12 -07:00
Shin'ichiro Kawasaki d3a5738849 null_blk: support read-only and offline zone conditions
In zoned mode, zones with write pointers can have conditions "read-only"
or "offline". In read-only condition, zones can not be written. In
offline condition, the zones can be neither written nor read. These
conditions are intended for zones with media failures, then it is
difficult to set those conditions to zones on real devices.

To test handling of zones in the conditions, add a feature to null_blk
to set up zones in read-only or offline condition. Add new configuration
attributes "zone_readonly" and "zone_offline". Write a sector to the
attribute files to specify the target zone to set the zone conditions.
For example, following command lines do it:

   echo 0 > nullb1/zone_readonly
   echo 524288 > nullb1/zone_offline

When the specified zones are already in read-only or offline condition,
normal empty condition is restored to the zones. These condition changes
can be done only after the null_blk device get powered, since status
area of each zone is not yet allocated before power-on.

Also improve zone condition checks to inhibit all commands for zones in
offline conditions. In same manner, inhibit write and zone management
commands for zones in read-only condition.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20221201061036.2342206-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 14:49:48 -07:00
Christoph Böhmwalder 677b367275 drbd: add context parameter to expect() macro
Originally-from: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221201110349.1282687-6-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 11:33:49 -07:00
Christoph Böhmwalder e3fa02d7d4 drbd: introduce drbd_ratelimit()
Use call site specific ratelimit instead of one single static global.
Also ratelimit ASSERTION messages generated by expect().

Originally-from: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221201110349.1282687-5-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 11:33:49 -07:00
Christoph Böhmwalder aa03469597 drbd: introduce dynamic debug
Incorporate as many out-of-tree changes as possible without changing the
genl API.

Over the years, we restructured this several times, and also changed the
log format.

One breaking change is that DRBD 9 gained "implicit options", like a
connection name. This cannot be replayed here without changing the API,
so save it for later.

Originally-from: Andreas Gruenbacher <agruen@linbit.com>
Originally-from: Philipp Reisner <philipp.reisner@linbit.com>
Originally-from: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221201110349.1282687-4-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 11:33:49 -07:00
Christoph Böhmwalder 136160c173 drbd: split polymorph printk to its own file
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221201110349.1282687-3-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 11:33:49 -07:00
Christoph Böhmwalder c3f8974198 drbd: unify how failed assertions are logged
Unify how failed assertions from D_ASSERT() and expect() are logged.

Originally-from: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20221201110349.1282687-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01 11:33:49 -07:00