Commit Graph

1044642 Commits

Author SHA1 Message Date
Hannes Reinecke 2953b30b1d nvmet: register discovery subsystem as 'current'
Register the discovery subsystem as the 'current' discovery subsystem,
and add a new discovery log page entry for it.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 08:06:04 +02:00
Hannes Reinecke 598e75934c nvmet: switch check for subsystem type
Invert the check for discovery subsystem type to allow for additional
discovery subsystem types.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 08:03:30 +02:00
Hannes Reinecke 785d584c30 nvme: add new discovery log page entry definitions
TP8014 adds a new SUBTYPE value and a new field EFLAGS for the
discovery log page entry.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 08:03:19 +02:00
Michael Schmitz d28e4dff08 block: ataflop: more blk-mq refactoring fixes
As it turns out, my earlier patch in commit 86d46fdaa1 (block:
ataflop: fix breakage introduced at blk-mq refactoring) was
incomplete. This patch fixes any remaining issues found during
more testing and code review.

Requests exceeding 4 k are handled in 4k segments but
__blk_mq_end_request() is never called on these (still
sectors outstanding on the request). With redo_fd_request()
removed, there is no provision to kick off processing of the
next segment, causing requests exceeding 4k to hang. (By
setting /sys/block/fd0/queue/max_sectors_k <= 4 as workaround,
this behaviour can be avoided).

Instead of reintroducing redo_fd_request(), requeue the remainder
of the request by calling blk_mq_requeue_request() on incomplete
requests (i.e. when blk_update_request() still returns true), and
rely on the block layer to queue the residual as new request.

Both error handling and formatting needs to release the
ST-DMA lock, so call finish_fdc() on these (this was previously
handled by redo_fd_request()). finish_fdc() may be called
legitimately without the ST-DMA lock held - make sure we only
release the lock if we actually held it. In a similar way,
early exit due to errors in ataflop_queue_rq() must release
the lock.

After minor errors, fd_error sets up to recalibrate the drive
but never re-runs the current operation (another task handled by
redo_fd_request() before). Call do_fd_action() to get the next
steps (seek, retry read/write) underway.

Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Fixes: 6ec3938cff (ataflop: convert to blk-mq)
CC: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20211024002013.9332-1-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 07:54:32 -06:00
Christoph Hellwig 47e9624616 block: remove support for cryptoloop and the xor transfer
Support for cyrptoloop has been officially marked broken and deprecated
in favor of dm-crypt (which supports the same broken algorithms if
needed) in Linux 2.6.4 (released in March 2004), and support for it has
been entirely removed from losetup in util-linux 2.23 (released in April
2013).  The XOR transfer has never been more than a toy to demonstrate
the transfer in the bad old times of crypto export restrictions.
Remove them as they have some nasty interactions with loop device life
times due to the iteration over all loop devices in
loop_unregister_transfer.

Suggested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019075639.2333969-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:34:58 -06:00
Luis Chamberlain 83b863f4a3 mtd: add add_disk() error handling
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-10-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Luis Chamberlain 2e9e31bea0 rnbd: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-9-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Luis Chamberlain 66638f163a um/drivers/ubd_kern: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

ubd_disk_register() never returned an error, so just fix
that now and let the caller handle the error condition.

Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-8-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Luis Chamberlain 21fd880d3d m68k/emu/nfblock: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-7-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Luis Chamberlain 293a7c5288 xen-blkfront: add error handling support for add_disk()
We never checked for errors on device_add_disk() as this function
returned void. Now that this is fixed, use the shiny new error
handling. The function xlvbd_alloc_gendisk() typically does the
unwinding on error on allocating the disk and creating the tag,
but since all that error handling was stuffed inside
xlvbd_alloc_gendisk() we must repeat the tag free'ing as well.

We set the info->rq to NULL to ensure blkif_free() doesn't crash
on blk_mq_stop_hw_queues() on device_add_disk() error as the queue
will be long gone by then.

Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-6-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Luis Chamberlain 2961c3bbca bcache: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

This driver doesn't do any unwinding with blk_cleanup_disk()
even on errors after add_disk() and so we follow that
tradition.

Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-5-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Luis Chamberlain e7089f65dd dm: add add_disk() error handling
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

There are two calls to dm_setup_md_queue() which can fail then,
one on dm_early_create() and we can easily see that the error path
there calls dm_destroy in the error path. The other use case is on
the ioctl table_load case. If that fails userspace needs to call
the DM_DEV_REMOVE_CMD to cleanup the state - similar to any other
failure.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211015233028.2167651-4-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 09:00:56 -06:00
Ye Guojin ff06ed7e81 block: aoe: fixup coccinelle warnings
coccicheck complains about the use of snprintf() in sysfs show
functions:
WARNING  use scnprintf or sprintf

Use sysfs_emit instead of scnprintf or sprintf makes more sense.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Ye Guojin <ye.guojin@zte.com.cn>
Link: https://lore.kernel.org/r/20211021064931.1047687-1-ye.guojin@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:54:15 -06:00
Jens Axboe cbab6ae0d0 nvme updates for Linux 5.16
- fix a multipath partition scanning deadlock (Hannes Reinecke)
  - generate uevent once a multipath namespace is operational again
    (Hannes Reinecke)
  - support unique discovery controller NQNs (Hannes Reinecke)
  - fix use-after-free when a port is removed (Israel Rukshin)
  - clear shadow doorbell memory on resets (Keith Busch)
  - use struct_size (Len Baker)
  - add error handling support for add_disk (Luis Chamberlain)
  - limit the maximal queue size for RDMA controllers (Max Gurtovoy)
  - use a few more symbolic names (Max Gurtovoy)
  - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
  - add support for ->map_queues on FC (Saurav Kashyap)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmFxYs4LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNzGBAAqGhOE7aTrrvsTkx/lc0oZrcS/WxT5zMj1KC7+C8O
 FT4rFDvLGa4J8PBz+l/u/Dmysw6T70HlDt13WqEy+8l4ckOolAWwoLIqmqaLJM6l
 7LA8S0kXlaJr2Wyj1RHn3YatjPhBhBtSxcSI+VwvuMobibUPtTzUEaUYY80+DyGI
 bWkY1+CzgSXZwhwe72Nf7I5rvkhEvS+pTLsHP70h+AsMlDljUBCNgD9SkvNRciic
 FFJ90NXXGnmvl0mZiZJ4sfb55r8tqGBvphw+vAkv/Gl9aOyVKmD+9nTAHiFXknPT
 LAlTidebE09cRVZERg8oooUwvfmFNTRQg/nD+4q9camWgmDqiQtyLrSFvME+ieL0
 Cd3zOR7KCRTMhfK5AhdKiXGZ3zu7RznBZ9zNciqZEONob3BxbSs7NagariCVXGvQ
 KxIA4EE/3nrPmiosXp1/VMVceCJBGJw8wh8TyNX1tkffZR4G+jNihUhT1k2TQlyE
 KqX9ibN/J0yWWQ/EWqI8r32ox6hIxKjwbtJLgA+wqe3RqF8DjEg6frmvl7c9h4rs
 aI62XgdF+mMFtDQaYkXtTP63oYiWLQeX8Hkv3Vig2r42U36vlYlhUpIU2Ee1FQZ4
 e55pnVCxLQsQBAvVn5vuKd1ivNRynR1NuSeF3NrAtWK33kiziSVTFYFxJiJG8+4Y
 1Os=
 =D1Jt
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.16-2021-10-21' of git://git.infradead.org/nvme into for-5.16/drivers

Pull NVMe updates from Christoph:

"nvme updates for Linux 5.16

 - fix a multipath partition scanning deadlock (Hannes Reinecke)
 - generate uevent once a multipath namespace is operational again
   (Hannes Reinecke)
 - support unique discovery controller NQNs (Hannes Reinecke)
 - fix use-after-free when a port is removed (Israel Rukshin)
 - clear shadow doorbell memory on resets (Keith Busch)
 - use struct_size (Len Baker)
 - add error handling support for add_disk (Luis Chamberlain)
 - limit the maximal queue size for RDMA controllers (Max Gurtovoy)
 - use a few more symbolic names (Max Gurtovoy)
 - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
 - add support for ->map_queues on FC (Saurav Kashyap)"

* tag 'nvme-5.16-2021-10-21' of git://git.infradead.org/nvme: (23 commits)
  nvmet: use struct_size over open coded arithmetic
  nvme: drop scan_lock and always kick requeue list when removing namespaces
  nvme-pci: clear shadow doorbell memory on resets
  nvme-rdma: fix error code in nvme_rdma_setup_ctrl
  nvme-multipath: add error handling support for add_disk()
  nvmet: use macro definitions for setting cmic value
  nvmet: use macro definition for setting nmic value
  nvme: display correct subsystem NQN
  nvme: Add connect option 'discovery'
  nvme: expose subsystem type in sysfs attribute 'subsystype'
  nvmet: set 'CNTRLTYPE' in the identify controller data
  nvmet: add nvmet_is_disc_subsys() helper
  nvme: add CNTRLTYPE definitions for 'identify controller'
  nvmet: make discovery NQN configurable
  nvmet-rdma: implement get_max_queue_size controller op
  nvmet: add get_max_queue_size op for controllers
  nvme-rdma: limit the maximal queue size for RDMA controllers
  nvmet-tcp: fix use-after-free when a port is removed
  nvmet-rdma: fix use-after-free when a port is removed
  nvmet: fix use-after-free when a port is removed
  ...
2021-10-21 08:25:54 -06:00
Len Baker 117d5b6d00 nvmet: use struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

In this case this is not actually dynamic size: all the operands
involved in the calculation are constant values. However it is better to
refactor this anyway, just to keep the open-coded math idiom out of
code.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kmalloc() function.

This code was detected with the help of Coccinelle and audited and fixed
manually.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:23:30 +02:00
Hannes Reinecke 2b81a5f015 nvme: drop scan_lock and always kick requeue list when removing namespaces
When reading the partition table on initial scan hits an I/O error the
I/O will hang with the scan_mutex held:

[<0>] do_read_cache_page+0x49b/0x790
[<0>] read_part_sector+0x39/0xe0
[<0>] read_lba+0xf9/0x1d0
[<0>] efi_partition+0xf1/0x7f0
[<0>] bdev_disk_changed+0x1ee/0x550
[<0>] blkdev_get_whole+0x81/0x90
[<0>] blkdev_get_by_dev+0x128/0x2e0
[<0>] device_add_disk+0x377/0x3c0
[<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
[<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[<0>] nvme_scan_work+0x168/0x310 [nvme_core]
[<0>] process_one_work+0x231/0x420

and trying to delete the controller will deadlock as it tries to grab
the scan mutex:

[<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
[<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]

As we're now properly ordering the namespace list there is no need to
hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore.
And we always need to kick the requeue list as the path will be marked
as unusable and I/O will be requeued _without_ a current path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:23:30 +02:00
Keith Busch 58847f12fe nvme-pci: clear shadow doorbell memory on resets
The host memory doorbell and event buffers need to be initialized on
each reset so the driver doesn't observe stale values from the previous
instantiation.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: John Levon <john.levon@nutanix.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:23:29 +02:00
Max Gurtovoy 0974812200 nvme-rdma: fix error code in nvme_rdma_setup_ctrl
In case that icdoff is not zero or mandatory keyed sgls are not
supported by the NVMe/RDMA target, we'll go to error flow but we'll
return 0 to the caller. Fix it by returning an appropriate error code.

Fixes: c66e2998c8 ("nvme-rdma: centralize controller setup sequence")
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:03 +02:00
Luis Chamberlain 11384580e3 nvme-multipath: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Since we now can tell for sure when a disk was added, move
setting the bit NVME_NSHEAD_DISK_LIVE only when we did
add the disk successfully.

Nothing to do here as the cleanup is done elsewhere. We take
care and use test_and_set_bit() because it is protects against
two nvme paths simultaneously calling device_add_disk() on the
same namespace head.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:03 +02:00
Max Gurtovoy d56ae18f06 nvmet: use macro definitions for setting cmic value
This makes the code more readable.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:03 +02:00
Max Gurtovoy 571b5444d1 nvmet: use macro definition for setting nmic value
This makes the code more readable.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke e5ea42faa7 nvme: display correct subsystem NQN
With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke 20e8b689c9 nvme: Add connect option 'discovery'
Add a connect option 'discovery' to specify that the connection
should be made to a discovery controller, not a normal I/O controller.
With discovery controllers supporting unique subsystem NQNs we
cannot easily distinguish by the subsystem NQN if this should be
a discovery connection, but we need this information to blank out
options not supported by discovery controllers.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke 954ae16681 nvme: expose subsystem type in sysfs attribute 'subsystype'
With unique discovery controller NQNs we cannot distinguish the
subsystem type by the NQN alone, but need to check the subsystem
type, too.
So expose the subsystem type in a new sysfs attribute 'subsystype'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke d3aef70124 nvmet: set 'CNTRLTYPE' in the identify controller data
Set the correct 'CNTRLTYPE' field in the identify controller data.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke a294711ed5 nvmet: add nvmet_is_disc_subsys() helper
Add a helper function to determine if a given subsystem is a discovery
subsystem.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke e15a8a9755 nvme: add CNTRLTYPE definitions for 'identify controller'
Update the 'identify controller' structure to define the newly added
CNTRLTYPE field.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:01 +02:00
Hannes Reinecke 626851e922 nvmet: make discovery NQN configurable
TPAR8013 allows for unique discovery NQNs, so make the discovery
controller NQN configurable by exposing a subsys attribute
'discovery_nqn'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:01 +02:00
Max Gurtovoy c7d792f9b8 nvmet-rdma: implement get_max_queue_size controller op
Limit the maximal queue size for RDMA controllers. Today, the target
reports a limit of 1024 and this limit isn't valid for some of the RDMA
based controllers. For now, limit RDMA transport to 128 entries (the
max queue depth configured for Linux NVMe/RDMA host).

Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.

Reported-by: Mark Ruijter <mruijter@primelogic.nl>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:01 +02:00
Max Gurtovoy 6d1555cc41 nvmet: add get_max_queue_size op for controllers
Some transports, such as RDMA, would like to set the queue size
according to device/port/ctrl characteristics. Add a new nvmet transport
op that is called during ctrl initialization. This will not effect
transports that don't implement this option.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:01 +02:00
Max Gurtovoy 44c3c6257e nvme-rdma: limit the maximal queue size for RDMA controllers
Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In
case the target expose a cap of larger amount of entries (e.g. 1024),
the initiator may fail to create a QP with this size. Thus limit to a
value that works for all RDMA adapters.

Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:01 +02:00
Israel Rukshin 2351ead99c nvmet-tcp: fix use-after-free when a port is removed
When removing a port, all its controllers are being removed, but there
are queues on the port that doesn't belong to any controller (during
connection time). This causes a use-after-free bug for any command
that dereferences req->port (like in nvmet_alloc_ctrl). Those queues
should be destroyed before freeing the port via configfs. Destroy
the remaining queues after the accept_work was cancelled guarantees
that no new queue will be created.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Israel Rukshin fcf73a804c nvmet-rdma: fix use-after-free when a port is removed
When removing a port, all its controllers are being removed, but there
are queues on the port that doesn't belong to any controller (during
connection time). This causes a use-after-free bug for any command
that dereferences req->port (like in nvmet_alloc_ctrl). Those queues
should be destroyed before freeing the port via configfs. Destroy the
remaining queues after the RDMA-CM was destroyed guarantees that no
new queue will be created.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Israel Rukshin e3e19dcc4c nvmet: fix use-after-free when a port is removed
When a port is removed through configfs, any connected controllers
are starting teardown flow asynchronously and can still send commands.
This causes a use-after-free bug for any command that dereferences
req->port (like in nvmet_parse_io_cmd).

To fix this, wait for all the teardown scheduled works to complete
(like release_work at rdma/tcp drivers). This ensures there are no
active controllers when the port is eventually removed.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Saurav Kashyap 2b2af50ae8 qla2xxx: add ->map_queues support for nvme
Implement ->map queues and use the block layer blk_mq_pci_map_queues
helper for mapping queues to CPUs.

With this mapping minimum 10%+ increase in performance is noticed.

Signed-off-by: Saurav Kashyap <skashyap@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Saurav Kashyap 01d838164b nvme-fc: add support for ->map_queues
NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP
transports.  Add a ->map_queues callout for the LLDDs to provide such
functionality.

Signed-off-by: Saurav Kashyap <skashyap@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Hannes Reinecke f6f09c15a7 nvme: generate uevent once a multipath namespace is operational again
When fast_io_fail_tmo is set I/O will be aborted while recovery is
still ongoing. This causes MD to set the namespace to failed, and
no futher I/O will be submitted to that namespace.

However, once the recovery succeeds and the namespace becomes
operational again the NVMe subsystem doesn't send a notification,
so MD cannot automatically reinstate operation and requires
manual interaction.

This patch will send a KOBJ_CHANGE uevent per multipathed namespace
once the underlying controller transitions to LIVE, allowing an automatic
MD reassembly with these udev rules:

/etc/udev/rules.d/65-md-auto-re-add.rules:
SUBSYSTEM!="block", GOTO="md_end"

ACTION!="change", GOTO="md_end"
ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end"
PROGRAM="/sbin/md_raid_auto_readd.sh $devnode"
LABEL="md_end"

/sbin/md_raid_auto_readd.sh:

MDADM=/sbin/mdadm
DEVNAME=$1

export $(${MDADM} --examine --export ${DEVNAME})

if [ -z "${MD_UUID}" ]; then
    exit 1
fi

UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID})
MD_DEVNAME=${UUID_LINK##*/}
export $(${MDADM} --detail --export /dev/${MD_DEVNAME})
if [ -z "${MD_METADATA}" ] ; then
    exit 1
fi
if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then
    echo "${MD_DEVNAME}: array not degraded, nothing to do"
    exit 0
fi
MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state)
if [ ${MD_STATE} != "clean" ] ; then
    echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add"
    exit 1
fi
MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE"
if [ ${!MD_VARNAME} = "spare" ] ; then
    ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME}
fi

Changes to v2:
- Add udev rules example to description
Changes to v1:
- use disk_uevent() as suggested by hch

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:00 +02:00
Christoph Hellwig 39fa7a9555 bcache: remove bch_crc64_update
bch_crc64_update is an entirely pointless wrapper around crc64_be.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-9-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Christoph Hellwig 00387bd21d bcache: use bvec_kmap_local in bch_data_verify
Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Also switch from page_address to bvec_kmap_local for cbv to be on the
safe side and to avoid pointlessly poking into bvec internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-8-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Christoph Hellwig 0f5cd7815f bcache: remove the backing_dev_name field from struct cached_dev
Just use the %pg format specifier to print the name directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Christoph Hellwig 7e84c21507 bcache: remove the cache_dev_name field from struct cache
Just use the %pg format specifier to print the name directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Lin Feng 0259d4498b bcache: move calc_cached_dev_sectors to proper place on backing device detach
Calculation of cache_set's cached sectors is done by travelling
cached_devs list as shown below:

static void calc_cached_dev_sectors(struct cache_set *c)
{
...
        list_for_each_entry(dc, &c->cached_devs, list)
                sectors += bdev_sectors(dc->bdev);

        c->cached_dev_sectors = sectors;
}

But cached_dev won't be unlinked from c->cached_devs list until we call
following list_move(&dc->list, &uncached_devices),
so previous fix in 'commit 46010141da
("bcache: recal cached_dev_sectors on detach")' is wrong, now we move
it to its right place.

Signed-off-by: Lin Feng <linf@wangsu.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Chao Yu d55f7cb2e5 bcache: fix error info in register_bcache()
In register_bcache(), there are several cases we didn't set
correct error info (return value and/or error message):
- if kzalloc() fails, it needs to return ENOMEM and print
"cannot allocate memory";
- if register_cache() fails, it's better to propagate its
return value rather than using default EINVAL.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Coly Li 0a2b3e3635 bcache: reserve never used bits from bkey.high
There sre 3 bits in member high of struct bkey are never used, and no
plan to support them in future,
- HEADER_SIZE, start at bit 58, length 2 bits
- KEY_PINNED,  start at bit 55, length 1 bit

No any kernel code, or user space tool references or accesses the three
bits. Therefore it is possible and feasible to reserve the valuable bits
from bkey.high. They can be used in future for other purpose.

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Ding Senjie a307e2abfc md: bcache: Fix spelling of 'acquire'
acqurie -> acquire

Signed-off-by: Ding Senjie <dingsenjie@yulong.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20211020143812.6403-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:40:54 -06:00
Stefan Haberland a8e5d491df s390/dasd: fix possibly missed path verification
__dasd_device_check_path_events() calls the discipline path event handler.
This handler can leave the 'to be verified pathmask' populated for an
additional verification.

There is a race window where the worker has finished before
dasd_path_clear_all_verify() is called which resets the tbvpm.

Due to this there could be outstanding path verifications missed.

Fix by clearing the pathmasks before calling the handler and add them
again in case of an error.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20211020115124.1735254-8-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:10:42 -06:00
Stefan Haberland 9dffede011 s390/dasd: fix missing path conf_data after failed allocation
dasd_eckd_path_available_action() does a memory allocation to store
the per path configuration data permanently.
In the unlikely case that this allocation fails there is no conf_data
stored for the corresponding path.

This is OK since this is not necessary for an operational path but some
features like control unit initiated reconfiguration (CUIR) do not work.

To fix this add the path to the 'to be verified pathmask' again and
schedule the handler again.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20211020115124.1735254-7-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:10:42 -06:00
Stefan Haberland 542e30ce8e s390/dasd: summarize dasd configuration data in a separate structure
Summarize the dasd configuration data in a separate structure so that
functions that need temporary config data do not need to allocate the
whole eckd_private structure.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20211020115124.1735254-6-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:10:42 -06:00
Stefan Haberland 74e2f21102 s390/dasd: move dasd_eckd_read_fc_security
dasd_eckd_read_conf is called multiple times during device setup but the
fc_security feature needs to be read only once. So move it into the calling
function.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20211020115124.1735254-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:10:42 -06:00
Stefan Haberland 23596961b4 s390/dasd: split up dasd_eckd_read_conf
Move the cabling check out of dasd_eckd_read_conf and split it up into
separate functions to improve readability and re-use functions.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20211020115124.1735254-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:10:41 -06:00