Without this we can't cleanly shut down.
Based on analysis an an earlier patch from Hannes Reinecke.
Fixes: bb06ec3145 ("nvme: expand nvmf_check_if_ready checks")
Reported-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJbDwViAAoJEPfTWPspceCmv+gQANwV5cPSvXEsGdsI4FX+ThUD
li2vscItK2firIAPnDMN9K3pKCWN46Rm8oLvzo2vois/vAtIvbg+eJqxJmNFPNJP
X8wcFb108DF/4I8K1Vklq1ILl5+CNmb1KjdiJjrUbDhDlQ4lsHo0eUAabbZ4LTZE
mT5BZjHiYCLW/om/AXZWV5TWVopMefp/hBX8ICSts8vCMRnxA8xoFkC1QB3WgG4g
2P49RsLehfqIJTNOmwcw0TShNWu0gFjdhrTv1NpzsHcJuGvCUJ+OhhUyZAmGNSCb
VyGp7xb1T9NqyfVLnBeMHM1V2PudT3lX4VZkDFG5ZKkFdam9J2SvaFaNwl2pKear
znQAnpGH1fXlVSx4FCPTPESj3buEpJ8UY99+gpHQ5+cN1jVJEG+UprQnuXFpNhFM
sOizGiqOU5edwkxcMl6M/kJwmbNPnxFs/fDgUgWacorTBZEG7bxcrI10rKnuPs3A
Zl3OCYXq7Ls0lU9QBRqSPNVd87i/BTuOWUqkLjjVH8xsOiw1MFVqLKfpZVe9AWZy
4Y8u9ShHo/tKO219Sq2onUUGUmdCtMQJwx0vnuD3r0B7VMIUBNc6D8xwre2xoNrr
1BgnNooyXrC+7ESrEdaO2ORn0W3VqZkTMfc0k3Cms/ZbEc7/1+8bWTfmHDYh+VbV
gJ9jl85Qd2aGjK+f0s8+
=Dk2i
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180530' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Just a single fix that should make it into this release, fixing a
regression with T10-DIF on NVMe"
* tag 'for-linus-20180530' of git://git.kernel.dk/linux-block:
nvme: fix extended data LBA supported setting
We already check for started commands in all callbacks, but we should
also protect against already completed commands. Do this by taking
the checks to common code.
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With recent CQ handling improvements we can now move the locking into
__nvme_submit_cmd. Also remove the local tail variable to make the code
more obvious, remove the __ prefix in the name, and fix the comments
describing the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
The block layer's timeout handling currently prevents drivers from
completing commands outside the timeout callback once blk-mq decides
they've expired. If a device breaks, this could potentially create many
thousands of timed out commands. There's nothing of value to be gleaned
from observing each of those messages, so this patch adds a rate limit
on them.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The current checks for whether a new controller request "matches" an
existing controller ignores controller state and checks identity strings.
There are cases where an existing controller may be in its last steps of
deletion when they are "matched" by a new connection.
Change the behavior so that the new connection ignores controllers that
are deleted.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pull NVMe changes from Christoph:
"Here is the current batch of nvme updates for 4.18, we have a few more
patches in the queue, but I'd like to get this pile into your tree
and linux-next ASAP.
The biggest item is support for file-backed namespaces in the NVMe
target from Chaitanya, in addition to that we mostly small fixes from
all the usual suspects."
* 'nvme-4.18-2' of git://git.infradead.org/nvme:
nvme: fixup memory leak in nvme_init_identify()
nvme: fix KASAN warning when parsing host nqn
nvmet-loop: use nr_phys_segments when map rq to sgl
nvmet-fc: increase LS buffer count per fc port
nvmet: add simple file backed ns support
nvmet: remove duplicate NULL initialization for req->ns
nvmet: make a few error messages more generic
nvme-fabrics: allow duplicate connections to the discovery controller
nvme-fabrics: centralize discovery controller defaults
nvme-fabrics: remove unnecessary controller subnqn validation
nvme-fc: remove setting DNR on exception conditions
nvme-rdma: stop admin queue before freeing it
nvme-pci: Fix AER reset handling
nvme-pci: set nvmeq->cq_vector after alloc cq/sq
nvme: host: core: fix precedence of ternary operator
nvme: fix lockdep warning in nvme_mpath_clear_current_path
This value depands on the metadata support value, so reorder the
initialization to fit.
Fixes: b5be3b392 ("nvme: always unregister the integrity profile in __nvme_revalidate_disk")
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
NVMe always completes the request before returning from ->timeout, either
by polling for it, or by disabling the controller. Return BLK_EH_DONE so
that the block layer doesn't even try to complete it again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
link error when another driver using it is built-in. The
INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
a 'bool' symbol that does not force anything to be a module in turn.
fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
net/9p/trans_rdma.o: In function `rdma_request':
trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
net/9p/trans_rdma.o: In function `rdma_destroy_trans':
trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'
Fixes: 9533b292a7 ("IB: remove redundant INFINIBAND kconfig dependencies")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If nvme_get_effects_log() failed the 'id' buffer from the previous
nvme_identify_ctrl() call will never be freed.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The host nqn actually is smaller than the space reserved for it,
so we should be using strlcpy to keep KASAN happy.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check
if a command contains data to me mapped. This fixes the case where
a struct requests contains LBAs, but no data will actually be send,
e.g. the pending Write Zeroes support.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Todays limit on concurrent LS's is very small - 4 buffers. With large
subsystem counts or large numbers of initiators connecting, the limit
may be exceeded.
Raise the LS buffer count to 256.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch adds simple file backed namespace support for NVMeOF target.
The new file io-cmd-file.c is responsible for handling the code for I/O
commands when ns is file backed. Also, we introduce mempools based slow
path using sync I/Os for file backed ns to ensure forward progress under
reclaim.
The old block device based implementation is moved to io-cmd-bdev.c and
use a "nvmet_bdev_" symbol prefix. The enable/disable calls are also
move into the respective files.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[hch: updated changelog, fixed double req->ns lookup in bdev case]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove the duplicate NULL initialization for req->ns. req->ns is always
initialized to NULL in nvmet_req_init(), so there is no need to reset
it later on failures unless we have previously assigned a value to it.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
"nvmet_check_ctrl_status()" is called from admin-cmd.c along
with io-cmd.c, make the error message more generic.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The whole point of the discovery controller is that it can accept
multiple connections. Additionally the cmic field is not even defined for
the discovery controller identify page.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When connecting to the discovery controller we have certain defaults
to observe, so centralize them to avoid inconsistencies due to argument
ordering.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
After creating the nvme controller, nvmf_create_ctrl() validates
the newly created subsysnqn vs the one specified by the options.
In general, this is an unnecessary check as the Connect message
should implicitly ensure this value matches.
With the change to the FC transport to do an asynchronous connect
for the first association create, the transport will return to
nvmf_create_ctrl() before that first association has been established,
thus the subnqn will not yet be set.
Remove the unnecessary validation.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Current code will set DNR if the controller is deleting or there is
an error during controller init. None of this is necessary.
Remove the code that sets DNR
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
For any failure after nvme_rdma_start_queue in
nvme_rdma_configure_admin_queue, the admin queue will be freed with the
NVME_RDMA_Q_LIVE flag still set. Once nvme_rdma_stop_queue is invoked,
that will cause a use-after-free.
BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm]
To fix it, call nvme_rdma_stop_queue for all the failed cases after
nvme_rdma_start_queue.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The nvme timeout handling doesn't do anything if the pci channel is
offline, which is the case when recovering from PCI error event, so it
was a bad idea to sync the controller reset in this state. This patch
flushes the reset work in the error_resume callback instead when the
channel is back to online. This keeps AER handling serialized and
can recover from timeouts.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199757
Fixes: cc1d5e749a ("nvme/pci: Sync controller reset for AER slot_reset")
Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Set cq_vector after alloc cq/sq, otherwise nvme_suspend_queue will invoke
free_irq for it and cause a 'Trying to free already-free IRQ xxx'
warning if the create CQ/SQ command times out.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
[hch: fixed to pass a s16 and clean up the comment]
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Remove bouncing addresses from the MAINTAINERS file
- Kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and hns drivers
- Various small LOC behavioral/operational bugs in mlx5, hns, qedr and i40iw drivers
- Two fixes for patches already sent during the merge window
- A long standing bug related to not decreasing the pinned pages count in the right
MM was found and fixed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJbByPQAAoJEDht9xV+IJsa164P/AihB/vbn9MBdK3pe1OSUGTm
tKZJ/Y6nY/Q/XTJSeM2wNECk8fOrZbKuLBz2XlPRsB2djp4ugC5WWfK9YbwWMGXG
I5B/lB8VTorQr8E5i9lqqMDQc8aF8VcGJtdqVE3nD4JsVTrQSGiSnw45/BARDUm3
OycJJMDOWhDj2wnNSa+JfjPemIMDM1jse7DnsJfDsGfTMS/G+6nyzjKIlEnnFZ8/
PBxhq0q7C5viNDwwn2GsAVUrATTlW48SY0WYhkgMdSl20d2th9wMZqNMqtniz8NP
lg87SrhzsAPOTlbSWlYYkAnzE7nEhfJyIfYUp2piNJeYuOohYPtO6w99Tqjl/GmU
uLIYIXtZCxAK1Zb/znc49HkRVL5YFDsQGXdtYy7tvRZPwwR32kowUtpKIWaZFz8O
BA/x+Zgqu9AlwqSWwQwxmMbUX42RRwhNJDVyTYlXQSSzhfgFaLIZARqb4K6HxeNN
vZN0BK+x6pX6FI7hpdsqNRtH1oo4SNUBxiuUsrZ7cy7GqYNdUJ6piygDgmERaJxU
svIUJof/+OoU1QyErQ0JgUEK/3jOHbjxSPb/rjQeqxAnCqhaGOuNGMtdfsGqgvBU
x/u3eDcbfi/LBErXR46gYtxnOQ8I2BB+m8erUc/GVvCzWrX+R7ELZYpBrP5Pcu/6
mr2D7hDqgZHbeU8aB8+D
=uFZh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is pretty much just the usual array of smallish driver bugs.
- remove bouncing addresses from the MAINTAINERS file
- kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and
hns drivers
- various small LOC behavioral/operational bugs in mlx5, hns, qedr
and i40iw drivers
- two fixes for patches already sent during the merge window
- a long-standing bug related to not decreasing the pinned pages
count in the right MM was found and fixed"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits)
RDMA/hns: Move the location for initializing tmp_len
RDMA/hns: Bugfix for cq record db for kernel
IB/uverbs: Fix uverbs_attr_get_obj
RDMA/qedr: Fix doorbell bar mapping for dpi > 1
IB/umem: Use the correct mm during ib_umem_release
iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()'
RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint
RDMA/i40iw: Avoid reference leaks when processing the AEQ
RDMA/i40iw: Avoid panic when objects are being created and destroyed
RDMA/hns: Fix the bug with NULL pointer
RDMA/hns: Set NULL for __internal_mr
RDMA/hns: Enable inner_pa_vld filed of mpt
RDMA/hns: Set desc_dma_addr for zero when free cmq desc
RDMA/hns: Fix the bug with rq sge
RDMA/hns: Not support qp transition from reset to reset for hip06
RDMA/hns: Add return operation when configured global param fail
RDMA/hns: Update convert function of endian format
RDMA/hns: Load the RoCE dirver automatically
RDMA/hns: Bugfix for rq record db for kernel
RDMA/hns: Add rq inline flags judgement
...
Ternary operator have lower precedence then bitwise or, so 'cdw10' was
calculated wrong.
Signed-off-by: Ivan Bornyakov <brnkv.i1@gmail.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
When running blktest's nvme/005 with a lockdep enabled kernel the test
case fails due to the following lockdep splat in dmesg:
=============================
WARNING: suspicious RCU usage
4.17.0-rc5 #881 Not tainted
-----------------------------
drivers/nvme/host/nvme.h:457 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
3 locks held by kworker/u32:5/1102:
#0: (ptrval) ((wq_completion)"nvme-wq"){+.+.}, at: process_one_work+0x152/0x5c0
#1: (ptrval) ((work_completion)(&ctrl->scan_work)){+.+.}, at: process_one_work+0x152/0x5c0
#2: (ptrval) (&subsys->lock#2){+.+.}, at: nvme_ns_remove+0x43/0x1c0 [nvme_core]
The only caller of nvme_mpath_clear_current_path() is nvme_ns_remove()
which holds the subsys lock so it's likely a false positive, but when
using rcu_access_pointer(), we're telling rcu and lockdep that we're
only after the pointer falue.
Fixes: 32acab3181 ("nvme: implement multipath access to nvme subsystems")
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Pull NVMe changes from Keith:
"This is just the first nvme pull request for 4.18. There are several
fabrics and target patches that I missed, so there will be more to
come."
* 'nvme-4.18' of git://git.infradead.org/nvme:
nvme-pci: drop IRQ disabling on submission queue lock
nvme-pci: split the nvme queue lock into submission and completion locks
nvme-pci: handle completions outside of the queue lock
nvme-pci: move ->cq_vector == -1 check outside of ->q_lock
nvme-pci: remove cq check after submission
nvme-pci: simplify nvme_cqe_valid
nvme: mark the result argument to nvme_complete_async_event volatile
nvme/pci: Sync controller reset for AER slot_reset
nvme/pci: Hold controller reference during async probe
nvme: only reconfigure discard if necessary
nvme/pci: Use async_schedule for initial reset work
nvme: lightnvm: add granby support
NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage
nvme: change order of qid and cmdid in completion trace
nvme: fc: provide a descriptive error
Since we aren't sharing the lock for completions now, we don't
have to make it IRQ safe.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This is now feasible. We protect the submission queue ring with
->sq_lock, and the completion side with ->cq_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Split the completion of events into a two part process:
1) Reap the events inside the queue lock
2) Complete the events outside the queue lock
Since we never wrap the queue, we can access it locklessly after we've
updated the completion queue head. This patch started off with batching
events on the stack, but with this trick we don't have to. Keith Busch
<keith.busch@intel.com> came up with that idea.
Note that this kills the ->cqe_seen as well. I haven't been able to
trigger any ill effects of this. If we do race with polling every so
often, it should be rare enough NOT to trigger any issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: refactored, restored poll early exit optimization]
Signed-off-by: Christoph Hellwig <hch@lst.de>
We only clear it dynamically in nvme_suspend_queue(). When we do, ensure
to do a full flush so that any nvme_queue_rq() invocation will see it.
Ideally we'd kill this check completely, but we're using it to flush
requests on a dying queue.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We always check the completion queue after submitting, but in my testing
this isn't a win even on DRAM/xpoint devices. In some cases it's
actually worse. Kill it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We always look at the current CQ head and phase, so don't pass these
as separate arguments, and rename the function to nvme_cqe_pending.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Updates to the handling of expedited grace periods, perhaps most
notably parallelizing their initialization. Other changes
include fixes from Boqun Feng.
- Miscellaneous fixes. These include an nvme fix from Nitzan Carmi
that I am carrying because it depends on a new SRCU function
cleanup_srcu_struct_quiesced(). This branch also includes fixes
from Byungchul Park and Yury Norov.
- Updates to reduce lock contention in the rcu_node combining tree.
These are in preparation for the consolidation of RCU-bh,
RCU-preempt, and RCU-sched into a single flavor, which was
requested by Linus Torvalds in response to a security flaw
whose root cause included confusion between the multiple flavors
of RCU.
- Torture-test updates that save their users some time and effort.
Conflicts:
drivers/nvme/host/core.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The nvme_delete_ctrl() function queues a work item on a MEM_RECLAIM
queue (nvme_delete_wq), which eventually calls cleanup_srcu_struct(),
which in turn flushes a delayed work from an !MEM_RECLAIM queue. This
is unsafe as we might trigger deadlocks under severe memory pressure.
Since we don't ever invoke call_srcu(), it is safe to use the shiny new
_quiesced() version of srcu cleanup, thus avoiding that flush dependency.
This commit makes that change.
Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
AER handling expects a successful return from slot_reset means the
driver made the device functional again. The nvme driver had been using
an asynchronous reset to recover the device, so the device
may still be initializing after control is returned to the
AER handler. This creates problems for subsequent event handling,
causing the initializion to fail.
This patch fixes that by syncing the controller reset before returning
to the AER driver, and reporting the true state of the reset.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199657
Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Cc: Sinan Kaya <okaya@codeaurora.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Some P3100 drives have a bug where they think WRRU (weighted round robin)
is always enabled, even though the host doesn't set it. Since they think
it's enabled, they also look at the submission queue creation priority. We
used to set that to MEDIUM by default, but that was removed in commit
81c1cd9835. This causes various issues on that drive. Add a quirk to
still set MEDIUM priority for that controller.
Fixes: 81c1cd9835 ("nvme/pci: Don't set reserved SQ create flags")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
If a controller reset is requested while the device has no namespaces,
we were incorrectly returning ENETRESET. This patch adds the check for
ADMIN_ONLY controller state to indicate a successful reset.
Fixes: 8000d1fdb0 ("nvme-rdma: fix sysfs invoked reset_ctrl error flow ")
Cc: <stable@vger.kernel.org>
Signed-off-by: Charles Machalow <charles.machalow@intel.com>
[changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It is possible the driver's remove may have freed the controller if
the remove callback is invoked prior to the async_schedule starting
the reset_work. This patch fixes that by holding a reference on the
controller.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Currently only nvme_ctrl will take a reference counter of
nvme_subsystem, nvme_ns_head also needs it. Otherwise
nvme_free_ns_head will access the nvme_subsystem.ns_ida
which has been freed by __nvme_release_subsystem after all the
reference of nvme_subsystem have been released by nvme_free_ctrl.
This could cause memory corruption.
BUG: KASAN: use-after-free in radix_tree_next_chunk+0x9f/0x4b0
Read of size 8 at addr ffff88036494d2e8 by task fio/1815
CPU: 1 PID: 1815 Comm: fio Kdump: loaded Tainted: G W 4.17.0-rc1+ #18
Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
Call Trace:
dump_stack+0x91/0xeb
print_address_description+0x6b/0x290
kasan_report+0x261/0x360
radix_tree_next_chunk+0x9f/0x4b0
ida_remove+0x8b/0x180
ida_simple_remove+0x26/0x40
nvme_free_ns_head+0x58/0xc0
__blkdev_put+0x30a/0x3a0
blkdev_close+0x44/0x50
__fput+0x184/0x380
task_work_run+0xaf/0xe0
do_exit+0x501/0x1440
do_group_exit+0x89/0x140
__x64_sys_exit_group+0x28/0x30
do_syscall_64+0x72/0x230
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
- SPDX license tag cleanups (new tag Linux-OpenIB)
- RoCE GID fixes related to default GIDs
- Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
mlx4, mlx5, and hfi1 (medium batch)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJa7JXPAAoJELgmozMOVy/dc0AP/0i7EajAmgl1ihka6BYVj2pa
DV8iSrVMDPulh9AVnAtwLJSbdwmgN/HeVzLzcutHyMYk6tAf8RCs6TsyoB36XiOL
oUh5+V2GyNnyh9veWPwyGTgZKCpPJc3uQFV6502lZVDYwArMfGatumApBgQVKiJ+
YdPEXEQZPNIs6YZB1WXkNYV/ra9u0aBByQvUrxwVZ2AND+srJYO82tqZit2wBtjK
UXrhmZbWXGWMFg8K3/lpfUkQhkG3Arj+tMKkCfqsVzC7wUPhlTKBHR9NmvdLIiy9
5Vhv7Xp78udcxZKtUeTFsbhaMqqK7x7sKHnpKAs7hOZNZ/Eg47BrMwMrZVLOFuDF
nBLUL1H+nJ1mASZoMWH5xzOpVew+e9X0cot09pVDBIvsOIh97wCG7hgptQ2Z5xig
fcDiMmg6tuakMsaiD0dzC9JI5HR6Z7+6oR1tBkQFDxQ+XkkcoFabdmkJaIRRwOj7
CUhXRgcm0UgVd03Jdta6CtYXsjSODirWg4AvSSMt9lUFpjYf9WZP00/YojcBbBEH
UlVrPbsKGyncgrm3FUP6kXmScESfdTljTPDLiY9cO9+bhhPGo1OHf005EfAp178B
jGp6hbKlt+rNs9cdXrPSPhjds+QF8HyfSlwyYVWKw8VWlh/5DG8uyGYjF05hYO0q
xhjIS6/EZjcTAh5e4LzR
=PI8v
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:
"This is our first pull request of the rc cycle. It's not that it's
been overly quiet, we were just waiting on a few things before sending
this off.
For instance, the 6 patch series from Intel for the hfi1 driver had
actually been pulled in on Tuesday for a Wednesday pull request, only
to have Jason notice something I missed, so we held off for some
testing, and then on Thursday had to respin the series because the
very first patch needed a minor fix (unnecessary cast is all).
There is a sizable hns patch series in here, as well as a reasonably
largish hfi1 patch series, then all of the lines of uapi updates are
just the change to the new official Linux-OpenIB SPDX tag (a bunch of
our files had what amounts to a BSD-2-Clause + MIT Warranty statement
as their license as a result of the initial code submission years ago,
and the SPDX folks decided it was unique enough to warrant a unique
tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4
and core/cache/cma updates to round out the bunch.
None of it was overly large by itself, but in the 2 1/2 weeks we've
been collecting patches, it has added up :-/.
As best I can tell, it's been through 0day (I got a notice about my
last for-next push, but not for my for-rc push, but Jason seems to
think that failure messages are prioritized and success messages not
so much). It's also been through linux-next. And yes, we did notice in
the context portion of the CMA query gid fix patch that there is a
dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage
and remove it anywhere we can.
Summary:
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
- SPDX license tag cleanups (new tag Linux-OpenIB)
- RoCE GID fixes related to default GIDs
- Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
mlx4, mlx5, and hfi1 (medium batch)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits)
RDMA/cma: Do not query GID during QP state transition to RTR
IB/mlx4: Fix integer overflow when calculating optimal MTT size
IB/hfi1: Fix memory leak in exception path in get_irq_affinity()
IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure
IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
IB/hfi1: Fix loss of BECN with AHG
IB/hfi1 Use correct type for num_user_context
IB/hfi1: Fix handling of FECN marked multicast packet
IB/core: Make ib_mad_client_id atomic
iw_cxgb4: Atomically flush per QP HW CQEs
IB/uverbs: Fix kernel crash during MR deregistration flow
IB/uverbs: Prevent reregistration of DM_MR to regular MR
RDMA/mlx4: Add missed RSS hash inner header flag
RDMA/hns: Fix a couple misspellings
RDMA/hns: Submit bad wr
RDMA/hns: Update assignment method for owner field of send wqe
RDMA/hns: Adjust the order of cleanup hem table
RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set
RDMA/hns: Remove some unnecessary attr_mask judgement
RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set
...
After commit bb06ec3145 ("nvme: expand nvmf_check_if_ready checks")
resetting of the loopback nvme target failed as we forgot to switch
it's state to NVME_CTRL_CONNECTING before we reconnect the admin
queues. Therefore the checks in nvmf_check_if_ready() choose to go to
the reject_io case and thus we couldn't sent out an identify
controller command to reconnect.
Change the controller state to NVME_CTRL_CONNECTING after tearing down
the old connection and before re-establishing the connection.
Fixes: bb06ec3145 ("nvme: expand nvmf_check_if_ready checks")
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When CONFIG_NVME_MULTIPATH is set, but we're not using nvme to multipath,
namespaces with multiple paths were not creating unique names due to
reusing the same instance number from the namespace's head.
This patch fixes this by falling back to the non-multipath naming method
when the parameter disabled using multipath.
Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can't allow the user to change multipath settings at runtime, as this
will create naming conflicts due to the different naming schemes used
for each mode.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the command a separate metadata buffer attached, the request needs
to have the integrity flag set so the driver knows to map it.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When specifying same string type option several times,
current option parsing may cause memory leak. Hence,
call kfree for previous one in this case.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently nvme reconfigures discard for every disk revalidation. This
is problematic because any O_WRONLY or O_RDWR open will trigger a
partition scan through udev/systemd, and we will reconfigure discard.
This blows away any user settings, like discard_max_bytes.
Only re-configure the user settable settings if we need to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[removed redundant queue flag setting]
Signed-off-by: Keith Busch <keith.busch@intel.com>
This patch schedules the initial controller reset in an async_domain
so that it can be synchronized from wait_for_device_probe(). This way
the kernel waits for the initial nvme controller scan to complete for
all devices before proceeding with the boot sequence, which may have
nvme dependencies.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So
declare the kconfig dependency. This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add Seagate Nytro Flash Storage nvme drive to quirk list for
NVME_QUIRK_DELAY_BEFORE_CHK_RDY, which solves a bug where the drive is
probed on hot-add before the firmare is ready, I/O errors are generated
while reading sector 0, and linux is "unable to read partition table".
Signed-off-by: Micah Parrish <micah.parrish@hpe.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Keith reported that command submission and command completion
tracepoints have the order of the cmdid and qid fields swapped.
While it isn't easily possible to change the command submission
tracepoint, as there is a regression test parsing it in blktests we
can swap the command completion tracepoint to have the fields aligned.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Provide a descriptive error in case an lport to rport association
isn't found when creating the FC-NVME controller.
Currently it's very hard to debug the reason for a failed connect
attempt without a look at the source.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Instead of implementing our own version of a SR-IOV configuration stub in
the nvme driver, use the existing pci_sriov_configure_simple() function.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The nvmf_check_if_ready() checks that were added are very simplistic.
As such, the routine allows a lot of cases to fail ios during windows
of reset or re-connection. In cases where there are not multi-path
options present, the error goes back to the callee - the filesystem
or application. Not good.
The common routine was rewritten and calling syntax slightly expanded
so that per-transport is_ready routines don't need to be present.
The transports now call the routine directly. The routine is now a
fabrics routine rather than an inline function.
The routine now looks at controller state to decide the action to
take. Some states mandate io failure. Others define the condition where
a command can be accepted. When the decision is unclear, a generic
queue-or-reject check is made to look for failfast or multipath ios and
only fails the io if it is so marked. Otherwise, the io will be queued
and wait for the controller state to resolve.
Admin commands issued via ioctl share a live admin queue with commands
from the transport for controller init. The ioctls could be intermixed
with the initialization commands. It's possible for the ioctl cmd to
be issued prior to the controller being enabled. To block this, the
ioctl admin commands need to be distinguished from admin commands used
for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to
reflect this division and set it on ioctls requests. As the
nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(),
ensure that commands allocated by the ioctl path (actually anything
in core.c) preps the nvme_req(req) before starting the io. This will
preserve the USERCMD flag during execution and/or retry.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 42de82a8b5 previously attempted to fix this, and it did
correctly pad the MN and FR fields with spaces, but the SN field still
contains 0 bytes. The current code fills out the first 16 bytes with
hex2bin, leaving the last 4 bytes zeroed. Rather than adding a lot of
error-prone math to avoid overwriting SN twice, just set the whole thing
to spaces up front (it's only 20 bytes).
Fixes: 42de82a8b5 ("nvmet: don't report 0-bytes in serial number")
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Also add error flow in case srcu initialization function fails.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have to increment the number of logical blocks to a 1's based value
in the native format prior to converting to 512b units.
Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
[changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The admin and first IO queues shared the first irq vector, which has an
affinity mask including cpu0. If a system allows cpu0 to be offlined,
the admin queue may not be usable if no other CPUs in the affinity mask
are online. This is a problem since unlike IO queues, there is only
one admin queue that always needs to be usable.
To fix, this patch allocates one pre_vector for the admin queue that
is assigned all CPUs, so will always be accessible. The IO queues are
assigned the remaining managed vectors.
In case a controller has only one interrupt vector available, the admin
and IO queues will share the pre_vector with all CPUs assigned.
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All the queue memory is allocated up front. We don't take the node
into consideration when creating queues anymore, so removing the unused
parameter.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
User reported controller always retains CSTS.RDY to 1, which fails
controller disabling when resetting the controller. This is also before
the admin queue is allocated, and trying to disable an unallocated queue
results in a NULL dereference.
Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvmet_execute_get_disc_log_page() passes a fixed-length string into
nvmet_format_discovery_entry(), which then does a longer memcpy() on
it, as pointed out by gcc-8:
In function 'nvmet_format_discovery_entry',
inlined from 'nvmet_execute_get_disc_log_page' at drivers/nvme/target/discovery.c:126:4:
drivers/nvme/target/discovery.c:62:2: error: 'memcpy' forming offset [38, 223] is out of the bounds [0, 37] [-Werror=array-bounds]
memcpy(e->subnqn, subsys_nqn, NVMF_NQN_SIZE);
Using strncpy() will make this well-defined, filling the rest of the
buffer with zeroes, under the assumption that the input is either
a NUL-terminated string, or a byte sequence containing no zeroes.
If the input is a string that is longer than NVMF_NQN_SIZE, we
continue to have no NUL-termination in the output.
Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NVMe over Fabrics 1.0 Section 5.2 "Discovery Controller Properties and
Command Support" Figure 31 "Discovery Controller – Admin Commands"
explicitly listst all commands but "Get Log Page" and "Identify" as
reserved, but NetApp report the Linux host is sending Keep Alive
commands to the discovery controller, which is a violation of the
Spec.
We're already checking for discovery controllers when configuring the
keep alive timeout but when creating a discovery controller we're not
hard wiring the keep alive timeout to 0 and thus remain on
NVME_DEFAULT_KATO for the discovery controller.
This can be easily remproduced when issuing a direct connect to the
discovery susbsystem using:
'nvme connect [...] --nqn=nqn.2014-08.org.nvmexpress.discovery'
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: 07bfcd09a2 ("nvme-fabrics: add a generic NVMe over Fabrics library")
Reported-by: Martin George <marting@netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme_start_keep_alive() isn't used outside core.c so unexport it and
make it static.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When nvmet_req_init() fails, __nvmet_req_complete() is called
to handle the target request via .queue_response(), so
nvme_loop_queue_response() shouldn't be called again for
handling the failure.
This patch fixes this case by the following way:
- move blk_mq_start_request() before nvmet_req_init(), so
nvme_loop_queue_response() may work well to complete this
host request
- don't call nvme_cleanup_cmd() which is done in nvme_loop_complete_rq()
- don't call nvme_loop_queue_response() which is done via
.queue_response()
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[trimmed changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Compiling on 32 bits system produces a warning for the shift width
when shifting 32 bit integer with 64bit integer.
Make sure that offset always is 64bit, and use macros for retrieving
lower and upper bits of the offset.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
+Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
5J1axMgVH5LAsIEcEQVq
=RVhK
-----END PGP SIGNATURE-----
Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"It's a pretty quiet round this time, which is nice. This contains:
- series from Bart, cleaning up the way we set/test/clear atomic
queue flags.
- series from Bart, fixing races between gendisk and queue
registration and removal.
- set of bcache fixes and improvements from various folks, by way of
Michael Lyle.
- set of lightnvm updates from Matias, most of it being the 1.2 to
2.0 transition.
- removal of unused DIO flags from Nikolay.
- blk-mq/sbitmap memory ordering fixes from Omar.
- divide-by-zero fix for BFQ from Paolo.
- minor documentation patches from Randy.
- timeout fix from Tejun.
- Alpha "can't write a char atomically" fix from Mikulas.
- set of NVMe fixes by way of Keith.
- bsg and bsg-lib improvements from Christoph.
- a few sed-opal fixes from Jonas.
- cdrom check-disk-change deadlock fix from Maurizio.
- various little fixes, comment fixes, etc from various folks"
* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
blk-mq: Directly schedule q->timeout_work when aborting a request
blktrace: fix comment in blktrace_api.h
lightnvm: remove function name in strings
lightnvm: pblk: remove some unnecessary NULL checks
lightnvm: pblk: don't recover unwritten lines
lightnvm: pblk: implement 2.0 support
lightnvm: pblk: implement get log report chunk
lightnvm: pblk: rename ppaf* to addrf*
lightnvm: pblk: check for supported version
lightnvm: implement get log report chunk helpers
lightnvm: make address conversions depend on generic device
lightnvm: add support for 2.0 address format
lightnvm: normalize geometry nomenclature
lightnvm: complete geo structure with maxoc*
lightnvm: add shorten OCSSD version in geo
lightnvm: add minor version to generic geometry
lightnvm: simplify geometry structure
lightnvm: pblk: refactor init/exit sequences
lightnvm: Avoid validation of default op value
lightnvm: centralize permission check for lightnvm ioctl
...
For the sysfs functions, the function names are embedded into their
error strings. If the function name later changes, the string may
not be updated accordingly. Update the strings to use __func__
to avoid this.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 2.0 spec provides a report chunk log page that can be retrieved
using the stangard nvme get log page. This replaces the dedicated
get/put bad block table in 1.2.
This patch implements the helper functions to allow targets retrieve the
chunk metadata using get log page. It makes nvme_get_log_ext available
outside of nvme core so that we can use it form lightnvm.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normalize nomenclature for naming channels, luns, chunks, planes and
sectors as well as derivations in order to improve readability.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Complete the generic geometry structure with the maxoc and maxocpu
felds, present in the 2.0 spec. Also, expose them through sysfs.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Create a shorten version to use in the generic geometry.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Separate the version between major and minor on the generic geometry and
represent it through sysfs in the 2.0 path. The 1.2 path only shows the
major version to preserve the existing user space interface.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the device geometry is stored redundantly in the nvm_id and
nvm_geo structures at a device level. Moreover, when instantiating
targets on a specific number of LUNs, these structures are replicated
and manually modified to fit the instance channel and LUN partitioning.
Instead, create a generic geometry around nvm_geo, which can be used by
(i) the underlying device to describe the geometry of the whole device,
and (ii) instances to describe their geometry independently.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The nvme driver sets up the size of the nvme namespace in two steps.
First it initializes the device with standard logical block and
metadata sizes, and then sets the correct logical block and metadata
size. Due to the OCSSD 2.0 specification relies on the namespace to
expose these sizes for correct initialization, let it be updated
appropriately on the LightNVM side as well.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The value of max_phys_sect is always static. Instead of
defining it in the nvm_dev_ops structure, declare it as a global
value.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no groups in the 2.0 specification, make sure that the
nvm_id structure is flattened before 2.0 data structures are added.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make the 1.2 data structures explicit, so it will be easy to identify
the 2.0 data structures. Also fix the order of which the nvme_nvm_*
are declared, such that they follow the nvme_nvm_command order.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only one id group from the 1.2 specification is supported. Make
sure that only the first group is accessible.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The known implementations of the 1.2 specification, and upcoming 2.0
implementation all expose a sequential list of pages to write.
Remove the data structure, as it is no longer needed.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The identity structure is initialized to zero in the beginning of
the nvme_nvm_identity function. The chnl_offset is separately set to
zero. Since both the variable and assignment is never changed, remove
them.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The PCI interrupt vectors intended to be associated with a queue may
not start at 0; a driver may allocate pre_vectors for special use. This
patch adds an offset parameter so blk-mq may find the intended affinity
mask and updates all drivers using this API accordingly.
Cc: Don Brace <don.brace@microsemi.com>
Cc: <qla2xxx-upstream@qlogic.com>
Cc: <linux-scsi@vger.kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Enable the lightnvm integration to use the nvme_get_log_ext()
function.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Have a common table of mappings from numerical transport ids to names, and
zero the transport specific area in common code in nvmet_addr_trtype_store.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For consistancy reasons, any fabric-specific works
(e.g error recovery/reconnect) should be canceled in
nvme_stop_ctrl, as for all other NVMe pending works
(e.g. scan, keep alive).
The patch aims to simplify the logic of the code, as
we now only rely on a vague demand from any fabric
to flush its private workqueues at the beginning of
.delete_ctrl op.
Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While error recovery is ongoing, it is OK to move
ctrl to DELETING state (from concurrent delete_work).
Thus we don't need a warning for that case.
Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a task is holding a reference to a namespace on a removed controller,
the head will not be released. If the same controller is added again
later, its namespaces may not be successfully added. Instead, the user
will see kernel message "Duplicate IDs for nsid <X>".
This patch fixes that by skipping heads that don't have namespaces when
considering if a new namespace is safe to add.
Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Cc: stable@vger.kernel.org
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the work queue.
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the system work queue.
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We free nvmet rdma queues while handling rdma_cm events.
In order to avoid this we destroy the qp and the queue after destroying
the cm_id which guarantees that all rdma_cm events are done.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>