The nvme_put_ctrl() is implemented earlier as an inline function so
this declaration isn't required.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, a namespace io_opt queue limit is set by default to the
physical sector size of the namespace and to the the write optimal
size (NOWS) when the namespace reports optimal IO sizes. This causes
problems with block limits stacking in blk_stack_limits() when a
namespace block device is combined with an HDD which generally do not
report any optimal transfer size (io_opt limit is 0). The code:
/* Optimal I/O a multiple of the physical block size? */
if (t->io_opt & (t->physical_block_size - 1)) {
t->io_opt = 0;
t->misaligned = 1;
ret = -1;
}
in blk_stack_limits() results in an error return for this function when
the combined devices have different but compatible physical sector
sizes (e.g. 512B sector SSD with 4KB sector disks).
Fix this by not setting the optimal IO size queue limit if the namespace
does not report an optimal write size value.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The nvme-fc devloss_tmo is computed as the min of either the
ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's
devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in
nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo
value. So correct this by printing the min value instead.
Signed-off-by: Martin George <marting@netapp.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We can signal the stack that this is not the last page coming and the
stack can build a larger tso segment, so go ahead and use it.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Improve code readability by defining the specification's constants that
the driver is using when decoding identification payloads.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Bart van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme-multipath already uses the gendisk private data, not need to
also set up the request_queue queuedata and use it in one place only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Today, nvme-tcp automatically schedules a send request
to a workqueue context, which is 1 more than we'd need
in case the socket buffer is wide open.
However, because we have async send activity (as a result
of r2t, or write_space callbacks), we need to synchronize
sends from possibly multiple contexts (ideally all running
on the same cpu though).
Thus, we only try to send directly from queue_rq in cases:
1. the send_list is empty
2. we can send it synchronously (i.e. not from the RX path)
3. we run on the same cpu as the queue->io_cpu to avoid
contention on the send operation.
Proposed-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the user runs polled I/O, we shouldn't have to trigger
the workqueue to generate the receive work upon the .data_ready
upcall. This prevents a redundant context switch when the
application is already polling for completions.
Proposed-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
data_ready may be invoked from send context or from
softirq, so need bh locking for that.
Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 147b27e4bd ("nvme-pci: allocate device queues storage
space at probe"), nvme_alloc_queue does not alloc the nvme queues
itself anymore.
If the write/poll_queues module parameters are changed at runtime to
values larger than the number of allocated queues in nvme_probe,
nvme_alloc_queue will access unallocated memory.
Add a new nr_allocated_queues member to struct nvme_dev to record how
many queues were alloctated in nvme_probe to avoid using more than the
allocated queues after a reset following a change to the
write/poll_queues module parameters.
Also add nr_write_queues and nr_poll_queues members to allow refreshing
the number of write and poll queues based on a change to the module
parameters when resetting the controller.
Fixes: 147b27e4bd ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
[hch: add nvme_max_io_queues, update the commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nvme driver does not have enough tags to wrap the queue, and blk-mq
will no longer call commit_rqs() when there are no new submissions to
notify.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The completion queue entry is not volatile once the phase is confirmed.
Remove the volatile keywords and check the phase using the appropriate
READ_ONCE() accessor, allowing the compiler to optimize the remaining
completion path.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a passthrough command causes the namespace inventory or capabilities
to change, flush the scan work that handles these changes so the driver
synchronizes with the user command's effects before returning the result
to user space.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a common label for putting the nshead if needed and only convert
nvme status codes for the one case where it actually is needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When CONFIG_ARCH_NO_SG_CHAIN is set, op->sgl[0] cannot be dereferenced,
as gcc-10 now points out:
drivers/nvme/host/fc.c: In function 'nvme_fc_init_request':
drivers/nvme/host/fc.c:1774:29: warning: array subscript 0 is outside the bounds of an interior zero-length array 'struct scatterlist[0]' [-Wzero-length-bounds]
1774 | op->op.fcp_req.first_sgl = &op->sgl[0];
| ^~~~~~~~~~~
drivers/nvme/host/fc.c:98:21: note: while referencing 'sgl'
98 | struct scatterlist sgl[NVME_INLINE_SG_CNT];
| ^~~
I don't know if this is a legitimate warning or a false-positive.
If this is just a false alarm, the warning is easily suppressed
by interpreting the array as a pointer.
Fixes: b1ae1a2389 ("nvme-fc: Avoid preallocating big SGL for data")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The stream parameters indicating optimal io settings were just getting
overwritten later. Rearrange the settings so the streams parameters can
be preserved if provided.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The stream parameters are based on the currently formatted logical block
size. Recheck these parameters on namespace revalidation so the
registered constraints will be accurate.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the quirked chunk_sectors setting to the same location as noiob so
one place registers this setting. And since the noiob value is only used
locally, remove the member from struct nvme_ns.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the namespace identifiers have changed, skip updating the disk
information, as that will register parameters from a mismatched
namespace.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The queues' backing device info capabilities don't change with each
namespace revalidation. Set it only when each path's request_queue
is initially added to a multipath queue.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reject a new shared namespace if a duplicate unshared namespace exists.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even if a namespace reports it is not capable of sharing, search the
subsystem for a matching namespace head. If found, the driver should
reject that namespace since it's coming from an invalid configuration.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a namespace identification does not match the subsystem's head for
that NSID, release the reference that was taken when the matching head
was initially found.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The driver had been unlinking the namespace head from the subsystem's
list only after the last reference was released, and outside of the
list's subsys->lock protection.
There is no reason to track an empty head, so unlink the entry from the
subsystem's list when the last namespace using that head is removed and
with the mutex lock protecting the list update. The next namespace to
attach reusing the previous NSID will allocate a new head rather than
find the old head with mismatched identifiers.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace it with a value derived from the identify data and nsid sizes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The namespace lists are 0-terminated, so we don't really need the NN value
execept for the legacy sequential scan.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Factor out a piece of deeply indented and logicaly separate code
from nvme_scan_ns_list into a new helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the check for the supported CNS value into nvme_scan_ns_list, and
limit the life time of the identify controller allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper to check if we can use Identify CNS values > 1, and refine
the Qemu quirk to not apply to reported versions larger than 1.1, as the
Qemu implementation had been fixed by then.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme_alloc_ns_head() doesn't use the 'struct nvme_id_ns' parameter.
Remove it, and update caller accordingly.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Various nvme commands use a zeroes based number of dwords field. Create
a helper function to convert byte lengths to this format.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nvme-fc host transport did not support the reception of a
FC-NVME LS. Reception is necessary to implement full compliance
with FC-NVME-2.
Populate the LS receive handler, and specifically the handling
of a Disconnect Association LS. The response to the LS, if it
matched a controller, must be sent after the aborts for any
I/O on any connection have been sent.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Given that both host and target now generate and receive LS's create
a single table definition for LS names. Each tranport half will have
a local version of the table.
As Create Association LS is issued by both sides, and received by
both sides, create common routines to format the LS and to validate
the LS.
Convert the host side transport to use the new common Create
Association LS formatting routine.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the assoc_active boolean flag to a bitop on the flags field.
The bit ops will provide atomicity.
To make this change, the flags field was converted to a long type,
which also affects the FCCTRL_TERMIO flag. Both FCCTRL_TERMIO and
now ASSOC_ACTIVE flags are set/cleared by bit operations.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that when allocations are done, and the lldd options indicate
no private data is needed, that private pointers will be set to NULL
(catches driver error that forgot to set private data size).
Slightly reorg the allocations so that private data follows allocations
for LS request/response buffers. Ensures better alignments for the buffers
as well as the private pointer.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current code uses NVME_FC_MAX_LS_BUFFER_SIZE (2KB) when allocating
buffers for LS requests and responses. This is considerable overkill
for what is actually defined.
Rework code to have unions for all possible requests and responses
and size based on the unions. Remove NVME_FC_MAX_LS_BUFFER_SIZE.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Routines in the target will want to be used in the host as well.
Error definitions should now shared as both sides will process
requests and responses to requests.
Moved common declarations to new fc.h header kept in the host
subdirectory.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current LLDD api has:
nvme-fc: contains api for transport to do LS requests (and aborts of
them). However, there is no interface for reception of LS's and sending
responses for them.
nvmet-fc: contains api for transport to do reception of LS's and sending
of responses for them. However, there is no interface for doing LS
requests.
Revise the api's so that both nvme-fc and nvmet-fc can send LS's, as well
as receiving LS's and sending their responses.
Change name of the rcv_ls_req struct to better reflect generic use as
a context to used to send an ls rsp. Specifically:
nvmefc_tgt_ls_req -> nvmefc_ls_rsp
nvmefc_tgt_ls_req.nvmet_fc_private -> nvmefc_ls_rsp.nvme_fc_private
Change nvmet_fc_rcv_ls_req() calling sequence to provide handle that
can be used by transport in later LS request sequences for an association.
nvme-fc nvmet_fc nvme_fcloop:
Revise to adapt to changed names in api header.
Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle.
Add stubs for new interfaces:
host/fc.c: nvme_fc_rcv_ls_req()
target/fc.c: nvmet_fc_invalidate_host()
lpfc:
Revise to adapt code to changed names in api header.
Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the controller is reconnecting, the host fails I/O and admin
commands as the host cannot reach the controller. ns scanning may
revalidate namespaces during that period and it is wrong to remove
namespaces due to these failures as we may hang (see 205da24343).
One command that may fail is nvme_identify_ns_descs. Since we return
success due to having ns identify descriptor list optional, we continue
to compare ns identifiers in nvme_revalidate_disk, obviously fail and
return -ENODEV to nvme_validate_ns, which will remove the namespace.
Exactly what we don't want to happen.
Fixes: 22802bf742 ("nvme: Namepace identification descriptor list is optional")
Tested-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pre-incrementing ->cq_head can't be done in memory because OOB value
can be observed by another context.
This devalues space savings compared to original code :-\
$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32)
Function old new delta
nvme_poll_irqdisable 464 456 -8
nvme_poll 455 447 -8
nvme_irq 388 380 -8
nvme_dev_disable 955 947 -8
But the code is minimal now: one read for head, one read for q_depth,
one increment, one comparison, single instruction phase bit update and
one write for new head.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: John Garry <john.garry@huawei.com>
Tested-by: John Garry <john.garry@huawei.com>
Fixes: e2a366a4b0 ("nvme-pci: slimmer CQ head update")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When jumping to the out_put_disk label, we will call put_disk(), which will
trigger a call to disk_release(), which calls blk_put_queue().
Later in the cleanup code, we do blk_cleanup_queue(), which will also call
blk_put_queue().
Putting the queue twice is incorrect, and will generate a KASAN splat.
Set the disk->queue pointer to NULL, before calling put_disk(), so that the
first call to blk_put_queue() will not free the queue.
The second call to blk_put_queue() uses another pointer to the same queue,
so this call will still free the queue.
Fixes: 85136c0102 ("lightnvm: simplify geometry enumeration")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6QhDIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsE/EADOQ0xDMOa8EmzRvjuCkiaB9yK2zXiBSAj5
ZBi7ReownfXhCR7nVc8Bv1s2f00PD6CFNURXZmdgyDDrXEd2ojueDoAZNBk59t0e
i2CAF2wLAQ5EfuVaxSHVEOrVEmtu+ue+Ix83JNlnGPd7pf9s7uKc/W4iKGpgpxIo
1CpXmWwm5RwjX4z/Qsiaka2lB7QojjImp1n3C+XI5+pp/bJXiftep1lxH5Y3nSWU
iR4jO81uxDMxhTEZ9z2cb1HarhctKvnihcb39gQYQ/kYYu7hSZnBPZo5zp5Dyb/t
4tGuDsfXCQCbF0smkusUrcyeT19vh9tOsGkiMzJ/ihm7TMyN4fT23h6DUb/7pAON
jnlcB7r5Ezs8jLz9i+mAoq06djd5u54kiuKFog8170sTrtYsncZbyc01wLNAla/V
/6KX1sMbPlbXZ+a3l3i7i/gcCBJ7ci6pV3x2elvM9dKHxyqJmwEGMlFVwt4s26ev
wS+7+dktLAC73889Zyn8LutA4bWy5FmisSPA4PydSUSOZA+7JjlbILcz15jjwlP2
HzYk+TXsd3yJUQRYX5P0FcDaBUTISr/xeUUB+KT1rLv4Lhtso+S/9cvSc8x5mOa9
989gmqNfFAWoj1nKEIKeRwLjk0b6YA9qMv4jOwwiuobsT55aBxpbP80huNoRVj5L
xFIWgBSwzg==
=3woC
-----END PGP SIGNATURE-----
Merge tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here's a set of fixes that should go into this merge window. This
contains:
- NVMe pull request from Christoph with various fixes
- Better discard support for loop (Evan)
- Only call ->commit_rqs() if we have queued IO (Keith)
- blkcg offlining fixes (Tejun)
- fix (and fix the fix) for busy partitions"
* tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block:
block: fix busy device checking in blk_drop_partitions again
block: fix busy device checking in blk_drop_partitions
nvmet-rdma: fix double free of rdma queue
blk-mq: don't commit_rqs() if none were queued
nvme-fc: Revert "add module to ops template to allow module references"
nvme: fix deadlock caused by ANA update wrong locking
nvmet-rdma: fix bonding failover possible NULL deref
loop: Better discard support for block devices
loop: Report EOPNOTSUPP properly
nvmet: fix NULL dereference when removing a referral
nvme: inherit stable pages constraint in the mpath stack device
blkcg: don't offline parent blkcg first
blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it
nvme-tcp: fix possible crash in recv error flow
nvme-tcp: don't poll a non-live queue
nvme-tcp: fix possible crash in write_zeroes processing
nvmet-fc: fix typo in comment
nvme-rdma: Replace comma with a semicolon
nvme-fcloop: fix deallocation of working context
nvme: fix compat address handling in several ioctls
The original patch was to resolve the lldd being able to be unloaded
while being used to talk to the boot device of the system. However, the
end result of the original patch is that any driver unload while a nvme
controller is live via the lldd is now being prohibited. Given the module
reference, the module teardown routine can't be called, thus there's no
way, other than manual actions to terminate the controllers.
Fixes: 863fbae929 ("nvme_fc: add module to ops template to allow module references")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The deadlock combines 4 flows in parallel:
- ns scanning (triggered from reconnect)
- request timeout
- ANA update (triggered from reconnect)
- I/O coming into the mpath device
(1) ns scanning triggers disk revalidation -> update disk info ->
freeze queue -> but blocked, due to (2)
(2) timeout handler reference the g_usage_counter - > but blocks in
the transport .timeout() handler, due to (3)
(3) the transport timeout handler (indirectly) calls nvme_stop_queue() ->
which takes the (down_read) namespaces_rwsem - > but blocks, due to (4)
(4) ANA update takes the (down_write) namespaces_rwsem -> calls
nvme_mpath_set_live() -> which synchronize the ns_head srcu
(see commit 504db087aa) -> but blocks, due to (5)
(5) I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
direct_make_request > blk_queue_enter -> but blocked, due to (1)
==> the request queue is under freeze -> deadlock.
The fix is making ANA update take a read lock as the namespaces list
is not manipulated, it is just the ns and ns->head that are being
updated (which is protected with the ns->head lock).
Fixes: 0d0b660f21 ("nvme: add ANA support")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
update changing all our txt files to rst ones. Excluding that, we
have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc,
pm80xx, aacraid), a treewide update for scnprintf and some other minor
updates. The major core update is Hannes moving functions out of the
aacraid driver and into the core.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXoYKiyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSasAP4iGwSB
Y8tFaZgWadu76+wj5MdqTBoXdhnIuFF0rZG3pQEAiIKdsfQlbSFdm75+gUtx5hG/
GOilX/pJczTRJDCGNis=
=g7Sk
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This series has a huge amount of churn because it pulls in Mauro's doc
update changing all our txt files to rst ones.
Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc,
zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and
some other minor updates.
The major core change is Hannes moving functions out of the aacraid
driver and into the core"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (223 commits)
scsi: aic7xxx: aic97xx: Remove FreeBSD-specific code
scsi: ufs: Do not rely on prefetched data
scsi: dc395x: remove dc395x_bios_param
scsi: libiscsi: Fix error count for active session
scsi: hpsa: correct race condition in offload enabled
scsi: message: fusion: Replace zero-length array with flexible-array member
scsi: qedi: Add PCI shutdown handler support
scsi: qedi: Add MFW error recovery process
scsi: ufs: Enable block layer runtime PM for well-known logical units
scsi: ufs-qcom: Override devfreq parameters
scsi: ufshcd: Let vendor override devfreq parameters
scsi: ufshcd: Update the set frequency to devfreq
scsi: ufs: Resume ufs host before accessing ufs device
scsi: ufs-mediatek: customize the delay for enabling host
scsi: ufs: make HCE polling more compact to improve initialization latency
scsi: ufs: allow custom delay prior to host enabling
scsi: ufs-mediatek: use common delay function
scsi: ufs: introduce common and flexible delay function
scsi: ufs: use an enum for host capabilities
scsi: ufs: fix uninitialized tx_lanes in ufshcd_disable_tx_lcc()
...
If the backing device require stable pages, we need to set it on the
stack mpath device as well. This applies to rdma/fc transports when
doing data integrity and tcp transport calculating digests.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If the target misbehaves and sends us unexpected payload we
need to make sure to fail the controller and stop processing
the input stream. We clear the rd_enabled flag and stop
the io_work, but we may still requeue it if we still have pending
sends and then in the next invocation we will process the input
stream as the check is only in the .data_ready upcall.
To fix this we need to make sure not to self-requeue io_work
upon a recv flow error.
This fixes the crash:
nvme nvme2: receive failed: -22
BUG: unable to handle page fault for address: ffffbeb5816c3b48
nvme_ns_head_make_request: 29 callbacks suppressed
block nvme0n5: no usable path - requeuing I/O
block nvme0n5: no usable path - requeuing I/O
block nvme0n7: no usable path - requeuing I/O
block nvme0n7: no usable path - requeuing I/O
block nvme0n3: no usable path - requeuing I/O
block nvme0n3: no usable path - requeuing I/O
block nvme0n3: no usable path - requeuing I/O
block nvme0n7: no usable path - requeuing I/O
block nvme0n3: no usable path - requeuing I/O
block nvme0n3: no usable path - requeuing I/O
#PF: supervisor read access inkernel mode
#PF: error_code(0x0000) - not-present page
PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0
Oops: 0000 [#1] SMP PTI
CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu
Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015
Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp]
RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246
RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008
RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40
RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900
R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798
FS: 0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tcp_read_sock+0x8c/0x290
? __release_sock+0x9d/0xe0
? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp]
nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp]
? finish_task_switch+0x163/0x270
process_one_work+0x1fd/0x3f0
worker_thread+0x34/0x410
kthread+0x121/0x140
? process_one_work+0x3f0/0x3f0
? kthread_park+0xb0/0xb0
ret_from_fork+0x35/0x40
Reported-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>