Recent patches changed calling sequences. nvme_fc_abort_outstanding_ios
used to be called from a timeout or work context. Now it is being called
in an io completion context, which can be an interrupt handler.
Unfortunately, the abort outstanding ios routine attempts to stop nvme
queues and nested routines that may try to sleep, which is in conflict
with the interrupt handler.
Correct replacing the direct call with a work element scheduling, and the
abort outstanding ios routine will be called in the work element.
Fixes: 95ced8a2c7 ("nvme-fc: eliminate terminate_io use by nvme_fc_error_recovery")
Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
__nvme_fc_terminate_io() is now called by only 1 place, in reset_work.
Consoldate and move the functionality of terminate_io into reset_work.
In reset_work, rather than calling the create_association directly,
schedule the connect work element to do its thing. After scheduling,
flush the connect work element to continue with semantic of not
returning until connect has been attempted at least once.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme_fc_error_recovery() special cases handling when in CONNECTING state
and calls __nvme_fc_terminate_io(). __nvme_fc_terminate_io() itself
special cases CONNECTING state and calls the routine to abort outstanding
ios.
Simplify the sequence by putting the call to abort outstanding I/Os
directly in nvme_fc_error_recovery.
Move the location of __nvme_fc_abort_outstanding_ios(), and
nvme_fc_terminate_exchange() which is called by it, to avoid adding
function prototypes for nvme_fc_error_recovery().
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
err_work was created to handle errors (mainly I/O timeouts) while in
CONNECTING state. The flag for err_work_active is also unneeded.
Remove err_work_active and err_work. The actions to abort I/Os are moved
inline to nvme_error_recovery().
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Whenever there are errors during CONNECTING, the driver recovers by
aborting all outstanding ios and counts on the io completion to fail them
and thus the connection/association they are on. However, the connection
failure depends on a failure state from the core routines. Not all
commands that are issued by the core routine are guaranteed to cause a
failure of the core routine. They may be treated as a failure status and
the status is then ignored.
As such, whenever the transport enters error_recovery while CONNECTING,
it will set a new flag indicating an association failed. The
create_association routine which creates and initializes the controller,
will monitor the state of the flag as well as the core routine error
status and ensure the association fails if there was an error.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We've had several complaints about a 10s reconnect delay (the default)
when there was an error while there is connectivity to a subsystem.
The max_reconnects and reconnect_delay are set in common code prior to
calling the transport to create the controller.
This change checks if the default reconnect delay is being used, and if
so, it adjusts it to a shorter period (2s) for the nvme-fc transport.
It does so by calculating the controller loss tmo window, changing the
value of the reconnect delay, and then recalculating the maximum number
of reconnect attempts allowed.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
On reconnect, the code currently does not freeze the controller before
possibly updating the number hw queues for the controller.
Add the freeze before updating the number of hw queues. Note: the queues
are already started and remain started through the reconnect.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The loop that backs out of hw io queue creation continues through index
0, which corresponds to the admin queue as well.
Fix the loop so it only proceeds through indexes 1..n which correspond to
I/O queues.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, an I/O timeout unconditionally invokes
nvme_fc_error_recovery() which checks for LIVE or CONNECTING state. If
live, the routine resets the controller which initiates a reconnect -
which is valid. If CONNECTING, err_work is scheduled. Err_work then
calls the terminate_io routine, which also checks for CONNECTING and
noops any further action on outstanding I/O. The result is nothing
happened to the timed out io. As such, if the command was dropped on
the wire, it will never timeout / complete, and the connect process
will hang.
Change the behavior of the io timeout routine to unconditionally abort
the I/O. I/O completion handling will note that an io failed due to an
abort and will terminate the connection / association as needed. If the
abort was unable to happen, continue with a call to
nvme_fc_error_recovery(). To ensure something different happens in
nvme_fc_error_recovery() rework it so at it will abort all I/Os on the
association to force a failure.
As I/O aborts now may occur outside of delete_association, counting for
completion must be wary and only count those aborted during
delete_association when TERMIO is set on the controller.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The lldd may have made calls to delete a remote port or local port and
the delete is in progress when the cli then attempts to create a new
controller. Currently, this proceeds without error although it can't be
very successful.
Fix this by validating that both the host port and remote port are
present when a new controller is to be created.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cancel async event work in case async event has been queued up, and
nvme_fc_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn <dmilburn@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API. It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On an error exit path, a negative error code should be returned
instead of a positive return value.
Fixes: e399441de9 ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart <jsmart2021@gmail.com>
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the FC transport is set max_hw_sectors based on the lldds
max sgl segment count. However, the block queue max segments is
set based on the controller's max_segments count, which the transport
does not set. As such, the lldd is receiving sgl lists that are
exceeding its max segment count.
Set the controller max segment count and derive max_hw_sectors from
the max segment count.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
is an optimized path to it, path nvme1 happens to be
inaccessible.
2) Before scan_work is complete nvme0 disconnect is initiated
nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
3) scan_work(1) attempts to submit IO,
but nvme_path_is_optimized() observes nvme0 is not LIVE.
Since nvme1 is a possible path IO is requeued and scan_work hangs.
--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: efi_partition+0x1e6/0x708
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
--
4) Delete also hangs in flush_work(ctrl->scan_work)
from nvme_remove_namespaces().
Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.
[ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[ 605.552087] Call Trace:
[ 605.552683] __schedule+0x2b9/0x6c0
[ 605.553507] schedule+0x42/0xb0
[ 605.554201] io_schedule+0x16/0x40
[ 605.555012] do_read_cache_page+0x438/0x830
[ 605.556925] read_cache_page+0x12/0x20
[ 605.557757] read_dev_sector+0x27/0xc0
[ 605.558587] amiga_partition+0x4d/0x4c5
[ 605.561278] check_partition+0x154/0x244
[ 605.562138] rescan_partitions+0xae/0x280
[ 605.563076] __blkdev_get+0x40f/0x560
[ 605.563830] blkdev_get+0x3d/0x140
[ 605.564500] __device_add_disk+0x388/0x480
[ 605.565316] device_add_disk+0x13/0x20
[ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core]
[ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core]
[ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core]
[ 605.573330] process_one_work+0x1db/0x380
[ 605.574144] worker_thread+0x4d/0x400
[ 605.574896] kthread+0x104/0x140
[ 605.577205] ret_from_fork+0x35/0x40
[ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830
[ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.582320] nvme D 0 14044 14043 0x00000000
[ 605.583424] Call Trace:
[ 605.583935] __schedule+0x2b9/0x6c0
[ 605.584625] schedule+0x42/0xb0
[ 605.585290] schedule_timeout+0x203/0x2f0
[ 605.588493] wait_for_completion+0xb1/0x120
[ 605.590066] __flush_work+0x123/0x1d0
[ 605.591758] __cancel_work_timer+0x10e/0x190
[ 605.593542] cancel_work_sync+0x10/0x20
[ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core]
[ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core]
[ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[ 605.598320] dev_attr_store+0x17/0x30
Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
indicate the phase of controller deletion where I/O cannot be allowed
to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
be issued to the bottom device, and only after we flush the ana_work
and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
from re-firing by aborting early if we are not LIVE, so we should be safe
here.
In addition, change the transport drivers to follow the updated state
machine.
Fixes: 0d0b660f21 ("nvme: add ANA support")
Reported-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Asynchronous event notifications do not have an associated request.
When fcp_io() fails we unconditionally call nvme_cleanup_cmd() which
leads to a crash.
Fixes: 16686f3a6c ("nvme: move common call to nvme_cleanup_cmd to core layer")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <hmadhani2024@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The nvme-fc devloss_tmo is computed as the min of either the
ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's
devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in
nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo
value. So correct this by printing the min value instead.
Signed-off-by: Martin George <marting@netapp.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When CONFIG_ARCH_NO_SG_CHAIN is set, op->sgl[0] cannot be dereferenced,
as gcc-10 now points out:
drivers/nvme/host/fc.c: In function 'nvme_fc_init_request':
drivers/nvme/host/fc.c:1774:29: warning: array subscript 0 is outside the bounds of an interior zero-length array 'struct scatterlist[0]' [-Wzero-length-bounds]
1774 | op->op.fcp_req.first_sgl = &op->sgl[0];
| ^~~~~~~~~~~
drivers/nvme/host/fc.c:98:21: note: while referencing 'sgl'
98 | struct scatterlist sgl[NVME_INLINE_SG_CNT];
| ^~~
I don't know if this is a legitimate warning or a false-positive.
If this is just a false alarm, the warning is easily suppressed
by interpreting the array as a pointer.
Fixes: b1ae1a2389 ("nvme-fc: Avoid preallocating big SGL for data")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nvme-fc host transport did not support the reception of a
FC-NVME LS. Reception is necessary to implement full compliance
with FC-NVME-2.
Populate the LS receive handler, and specifically the handling
of a Disconnect Association LS. The response to the LS, if it
matched a controller, must be sent after the aborts for any
I/O on any connection have been sent.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Given that both host and target now generate and receive LS's create
a single table definition for LS names. Each tranport half will have
a local version of the table.
As Create Association LS is issued by both sides, and received by
both sides, create common routines to format the LS and to validate
the LS.
Convert the host side transport to use the new common Create
Association LS formatting routine.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the assoc_active boolean flag to a bitop on the flags field.
The bit ops will provide atomicity.
To make this change, the flags field was converted to a long type,
which also affects the FCCTRL_TERMIO flag. Both FCCTRL_TERMIO and
now ASSOC_ACTIVE flags are set/cleared by bit operations.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that when allocations are done, and the lldd options indicate
no private data is needed, that private pointers will be set to NULL
(catches driver error that forgot to set private data size).
Slightly reorg the allocations so that private data follows allocations
for LS request/response buffers. Ensures better alignments for the buffers
as well as the private pointer.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Routines in the target will want to be used in the host as well.
Error definitions should now shared as both sides will process
requests and responses to requests.
Moved common declarations to new fc.h header kept in the host
subdirectory.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current LLDD api has:
nvme-fc: contains api for transport to do LS requests (and aborts of
them). However, there is no interface for reception of LS's and sending
responses for them.
nvmet-fc: contains api for transport to do reception of LS's and sending
of responses for them. However, there is no interface for doing LS
requests.
Revise the api's so that both nvme-fc and nvmet-fc can send LS's, as well
as receiving LS's and sending their responses.
Change name of the rcv_ls_req struct to better reflect generic use as
a context to used to send an ls rsp. Specifically:
nvmefc_tgt_ls_req -> nvmefc_ls_rsp
nvmefc_tgt_ls_req.nvmet_fc_private -> nvmefc_ls_rsp.nvme_fc_private
Change nvmet_fc_rcv_ls_req() calling sequence to provide handle that
can be used by transport in later LS request sequences for an association.
nvme-fc nvmet_fc nvme_fcloop:
Revise to adapt to changed names in api header.
Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle.
Add stubs for new interfaces:
host/fc.c: nvme_fc_rcv_ls_req()
target/fc.c: nvmet_fc_invalidate_host()
lpfc:
Revise to adapt code to changed names in api header.
Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The original patch was to resolve the lldd being able to be unloaded
while being used to talk to the boot device of the system. However, the
end result of the original patch is that any driver unload while a nvme
controller is live via the lldd is now being prohibited. Given the module
reference, the module teardown routine can't be called, thus there's no
way, other than manual actions to terminate the controllers.
Fixes: 863fbae929 ("nvme_fc: add module to ops template to allow module references")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
If an error occurs on one of the ios used for creating an
association, the creating routine has error paths that are
invoked by the command failure and the error paths will free
up the controller resources created to that point.
But... the io was ultimately determined by an asynchronous
completion routine that detected the error and which
unconditionally invokes the error_recovery path which calls
delete_association. Delete association deletes all outstanding
io then tears down the controller resources. So the
create_association thread can be running in parallel with
the error_recovery thread. What was seen was the LLDD received
a call to delete a queue, causing the LLDD to do a free of a
resource, then the transport called the delete queue again
causing the driver to repeat the free call. The second free
routine corrupted the allocator. The transport shouldn't be
making the duplicate call, and the delete queue is just one
of the resources being freed.
To fix, it is realized that the create_association path is
completely serialized with one command at a time. So the
failed io completion will always be seen by the create_association
path and as of the failure, there are no ios to terminate and there
is no reason to be manipulating queue freeze states, etc.
The serialized condition stays true until the controller is
transitioned to the LIVE state. Thus the fix is to change the
error recovery path to check the controller state and only
invoke the teardown path if not already in the CONNECTING state.
Reviewed-by: Himanshu Madhani <hmadhani@marvell.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
In nvme-fc: it's possible to have connected active controllers
and as no references are taken on the LLDD, the LLDD can be
unloaded. The controller would enter a reconnect state and as
long as the LLDD resumed within the reconnect timeout, the
controller would resume. But if a namespace on the controller
is the root device, allowing the driver to unload can be problematic.
To reload the driver, it may require new io to the boot device,
and as it's no longer connected we get into a catch-22 that
eventually fails, and the system locks up.
Fix this issue by taking a module reference for every connected
controller (which is what the core layer did to the transport
module). Reference is cleared when the controller is removed.
Acked-by: Himanshu Madhani <hmadhani@marvell.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.
Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.
If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.
Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme_cleanup_cmd should be called for each call to nvme_setup_cmd
(symmetrical functions). Move the call for nvme_cleanup_cmd to the common
core layer and call it during nvme_complete_rq for the good flow. For
error flow, each transport will call nvme_cleanup_cmd independently. Also
take care of a special case of path failure, where we call
nvme_complete_rq without doing nvme_setup_cmd.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Code today only clears the association_id if a Disconnect LS is transmit.
Remove ambiguity and unconditionally clear the association_id if the
association has been terminated.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change wording on a couple of messages to clarify what happened.
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set the new category field in the FC-NVME CMND_IU based on queue number.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sync sources with revised structure and field names to correspond with
FC-NVME-2 header sync-up.
Tested interoperability with success:
- prior initiator with new target
- prior target with new initiator
- new on new
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NVME_SC_INTERNAL should indicate an internal controller errors
and not host transport errors. These errors will propagate to
upper layers (essentially nvme core) and be interpereted as
transport errors which should not be taken into account for
namespace state or condition.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
We have a fundamental issue that fabric commands use the admin_q.
The reason is, that admin-connect, register reads and writes and
admin commands cannot be guaranteed ordering while we are running
controller resets.
For example, when we reset a controller we perform:
1. disable the controller
2. teardown the admin queue
3. re-establish the admin queue
4. enable the controller
In order to perform (3), we need to unquiesce the admin queue, however
we may have some admin commands that are already pending on the
quiesced admin_q and will immediate execute when we unquiesce it before
we execute (4). The host must not send admin commands to the controller
before enabling the controller.
To fix this, we have the fabric commands (admin connect and property
get/set, but not I/O queue connect) use a separate fabrics_q and make
sure to quiesce the admin_q before we disable the controller, and
unquiesce it only after we enable the controller.
This fixes the error prints from nvmet in a controller reset storm test:
kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0
Which indicate that the host is sending an admin command when the
controller is not enabled.
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
nvme_enable_ctrl reads the cap register right after, so
no need to do that locally in the transport driver. Have
sqsize setting in nvme_init_identify.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When aborting in-flight request for recovering controller, we have
to make sure that queue's complete function is called on completed
request before moving on. Otherwise, for example, the warning of
WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be
triggered on nvme-rdma.
Fix this issue by using blk_mq_tagset_wait_completed_request.
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N
GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O
nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK
WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA
uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF
b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA
zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ
AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6
pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW
h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7
I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y
1mtSNhm5TQ==
=g8xI
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A later pull request with some followup items. I had some vacation
coming up to the merge window, so certain things items were delayed a
bit. This pull request also contains fixes that came in within the
last few days of the merge window, which I didn't want to push right
before sending you a pull request.
This contains:
- NVMe pull request, mostly fixes, but also a few minor items on the
feature side that were timing constrained (Christoph et al)
- Report zones fixes (Damien)
- Removal of dead code (Damien)
- Turn on cgroup psi memstall (Josef)
- block cgroup MAINTAINERS entry (Konstantin)
- Flush init fix (Josef)
- blk-throttle low iops timing fix (Konstantin)
- nbd resize fixes (Mike)
- nbd 0 blocksize crash fix (Xiubo)
- block integrity error leak fix (Wenwen)
- blk-cgroup writeback and priority inheritance fixes (Tejun)"
* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
MAINTAINERS: add entry for block io cgroup
null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
block: Limit zone array allocation size
sd_zbc: Fix report zones buffer allocation
block: Kill gfp_t argument of blkdev_report_zones()
block: Allow mapping of vmalloc-ed buffers
block/bio-integrity: fix a memory leak bug
nvme: fix NULL deref for fabrics options
nbd: add netlink reconfigure resize support
nbd: fix crash when the blksize is zero
block: Disable write plugging for zoned block devices
block: Fix elevator name declaration
block: Remove unused definitions
nvme: fix regression upon hot device removal and insertion
blk-throttle: fix zero wait time for iops throttled group
block: Fix potential overflow in blk_report_zones()
blkcg: implement REQ_CGROUP_PUNT
blkcg, writeback: Implement wbc_blkcg_css()
blkcg, writeback: Add wbc->no_cgroup_owner
blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
...
This topic branch covers a fundamental change in how our sg lists are
allocated to make mq more efficient by reducing the size of the
preallocated sg list. This necessitates a large number of driver
changes because the previous guarantee that if a driver specified
SG_ALL as the size of its scatter list, it would get a non-chained
list and didn't need to bother with scatterlist iterators is now
broken and every driver *must* use scatterlist iterators.
This was broken out as a separate topic because we need to convert all
the drivers before pulling the trigger and unconverted drivers kept
being found, necessitating a rebase.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXSTzzCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishZB+AP9I8j/s
wWfg0Z3WNuf4D5I3rH4x1J3cQTqPJed+RjwgcQEA1gZvtOTg1ZEn/CYMVnaB92x0
t6MZSchIaFXeqfD+E7U=
=cv8o
-----END PGP SIGNATURE-----
Merge tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI scatter-gather list updates from James Bottomley:
"This topic branch covers a fundamental change in how our sg lists are
allocated to make mq more efficient by reducing the size of the
preallocated sg list.
This necessitates a large number of driver changes because the
previous guarantee that if a driver specified SG_ALL as the size of
its scatter list, it would get a non-chained list and didn't need to
bother with scatterlist iterators is now broken and every driver
*must* use scatterlist iterators.
This was broken out as a separate topic because we need to convert all
the drivers before pulling the trigger and unconverted drivers kept
being found, necessitating a rebase"
* tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits)
scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN
scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation
scsi: core: avoid preallocating big SGL for data
scsi: core: avoid preallocating big SGL for protection information
scsi: lib/sg_pool.c: improve APIs for allocating sg pool
scsi: esp: use sg helper to iterate over scatterlist
scsi: NCR5380: use sg helper to iterate over scatterlist
scsi: wd33c93: use sg helper to iterate over scatterlist
scsi: ppa: use sg helper to iterate over scatterlist
scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist
scsi: imm: use sg helper to iterate over scatterlist
scsi: aha152x: use sg helper to iterate over scatterlist
scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist
scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist
scsi: usb: image: microtek: use sg helper to iterate over scatterlist
scsi: pmcraid: use sg helper to iterate over scatterlist
scsi: ipr: use sg helper to iterate over scatterlist
scsi: mvumi: use sg helper to iterate over scatterlist
scsi: lpfc: use sg helper to iterate over scatterlist
scsi: advansys: use sg helper to iterate over scatterlist
...
Current code allows the module to be unloaded even if there are
pending data structures, such as localports and controllers on
the localports, that have yet to hit their reference counting
to remove them.
Fix by having exit entrypoint explicitly delete every controller,
which in turn will remove references on the remoteports and localports
causing them to be deleted as well. The exit entrypoint, after
initiating the deletes, will wait for the last localport to be deleted
before continuing.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When looking at console messages to troubleshoot, there are one
maybe two messages before creation of the controller is complete.
However, a lot of io takes place to reach that point. It's unclear
when things have started.
Add a message when the controller is attempting to create a new
association. Thus we know what controller, between what host and
remote port, and what NQN is being put into place for any
subsequent success or failure messages.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Giridhar Malavali <gmalavali@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
sg_alloc_table_chained() currently allows the caller to provide one
preallocated SGL and returns if the requested number isn't bigger than
size of that SGL. This is used to inline an SGL for an IO request.
However, scattergather code only allows that size of the 1st preallocated
SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory
(4KB) is claimed for the SGL for each IO request. If the I/O is small, it
would be prudent to allocate a smaller SGL.
Introduce an extra parameter to sg_alloc_table_chained() and
sg_free_table_chained() for specifying size of the preallocated SGL.
Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the
same size except for the last one. Change the code to allow both functions
to accept a variable size for the 1st preallocated SGL.
[mkp: attempted to clarify commit desc]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzd7PYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpggWD/46Hmn6FuiXQ30HTJd9WKtJzenAAIdUpjq8
+U985q7vvcqIUotMcG9VUOlCaxk79D5XbptInzLo5CRSn9vMv0sXmAHIFkoj201K
gW3sHqajnWFFj60Eq5IVdHBZekvD8+bBZMvnX+S53QHOfwY+D1Nx/CtjkxNeq+48
98kMA/Q1d87Ied6oMW6Nyc7UEN3SanTnntYRIeSrXOJPiwxVWT6SsPUC01VZcwrt
NSt6IVoW2vFgU0sg8VetzCSfJyTzI0YytjTj/WKGQzuBiKFAvChWrrYZiZ/Z4587
6W4SFR94nYkW5U1BKgrMp64KUEn20m+jk0IHRYApsFwutSBHJCeB9m2sddxur/GQ
G/IyXZxv5jKFNBhUEiSedfml9OF+nBbwJGJCKF64Wnybk/gqFgxM1gzyw4fMAXr+
qYQdETv02W0rDqUG9i3/CaXlN4Lf1IvLR8al4ao0LfDJ0TSXw+UviNsuHEHAv8ey
sioREF8JacSj1q42TsRGckn3k4HVmaGyFwI3ceLT5bRq8VAhJ+cp7WqML1lUEmY0
2iIz+PKPDSyigqrh1wvo8ZqhqHifo+0TbRkCOCi5j+PRX6GiYlrvShGevZXEZPqC
lOFNDgCH3VBTvrcx3j05jJK1qvL4QWAwb/rDUsHZVbsnSVTEHxs/3BsIFQNZpE9/
AoXCH/ye0Q==
=ZKv1
-----END PGP SIGNATURE-----
Merge tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"This is mainly some late lightnvm changes that came in just before the
merge window, as well as fixes that have been queued up since the
initial pull request was frozen.
This contains:
- lightnvm changes, fixing race conditions, improving memory
utilization, and improving pblk compatability (Chansol, Igor,
Marcin)
- NVMe pull request with minor fixes all over the map (via Christoph)
- remove redundant error print in sata_rcar (Geert)
- struct_size() cleanup (Jackie)
- dasd CONFIG_LBADF warning fix (Ming)
- brd cond_resched() improvement (Mikulas)"
* tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits)
block/bio-integrity: use struct_size() in kmalloc()
nvme: validate cntlid during controller initialisation
nvme: change locking for the per-subsystem controller list
nvme: trace all async notice events
nvme: fix typos in nvme status code values
nvme-fabrics: remove unused argument
nvme-multipath: avoid crash on invalid subsystem cntlid enumeration
nvme-fc: use separate work queue to avoid warning
nvme-rdma: remove redundant reference between ib_device and tagset
nvme-pci: mark expected switch fall-through
nvme-pci: add known admin effects to augument admin effects log page
nvme-pci: init shadow doorbell after each reset
brd: add cond_resched to brd_free_pages
sata_rcar: Remove ata_host_alloc() error printing
s390/dasd: fix build warning in dasd_eckd_build_cp_raw
lightnvm: pblk: use nvm_rq_to_ppa_list()
lightnvm: pblk: simplify partial read path
lightnvm: do not remove instance under global lock
lightnvm: track inflight target creations
lightnvm: pblk: recover only written metadata
...
When tearing down a controller the following warning is issued:
WARNING: CPU: 0 PID: 30681 at ../kernel/workqueue.c:2418 check_flush_dependency
This happens as the err_work workqueue item is scheduled on the
system workqueue (which has WQ_MEM_RECLAIM not set), but is flushed
from a workqueue which has WQ_MEM_RECLAIM set.
Fix this by providing an FC-NVMe specific workqueue.
Fixes: 4cff280a5f ("nvme-fc: resolve io failures during connect")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>