We always need to do non-equal comparisms on the native endian versions
to get the correct result.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull more block updates from Jens Axboe:
"This is a followup for block changes, that didn't make the initial
pull request. It's a bit of a mixed bag, this contains:
- A followup pull request from Sagi for NVMe. Outside of fixups for
NVMe, it also includes a series for ensuring that we properly
quiesce hardware queues when browsing live tags.
- Set of integrity fixes from Dmitry (mostly), fixing various issues
for folks using DIF/DIX.
- Fix for a bug introduced in cciss, with the req init changes. From
Christoph.
- Fix for a bug in BFQ, from Paolo.
- Two followup fixes for lightnvm/pblk from Javier.
- Depth fix from Ming for blk-mq-sched.
- Also from Ming, performance fix for mtip32xx that was introduced
with the dynamic initialization of commands"
* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
block: call bio_uninit in bio_endio
nvmet: avoid unneeded assignment of submit_bio return value
nvme-pci: add module parameter for io queue depth
nvme-pci: compile warnings in nvme_alloc_host_mem()
nvmet_fc: Accept variable pad lengths on Create Association LS
nvme_fc/nvmet_fc: revise Create Association descriptor length
lightnvm: pblk: remove unnecessary checks
lightnvm: pblk: control I/O flow also on tear down
cciss: initialize struct scsi_req
null_blk: fix error flow for shared tags during module_init
block: Fix __blkdev_issue_zeroout loop
nvme-rdma: unconditionally recycle the request mr
nvme: split nvme_uninit_ctrl into stop and uninit
virtio_blk: quiesce/unquiesce live IO when entering PM states
mtip32xx: quiesce request queues to make sure no submissions are inflight
nbd: quiesce request queues to make sure no submissions are inflight
nvme: kick requeue list when requeueing a request instead of when starting the queues
nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
...
Pull followup NVMe (mostly) changes from Sagi:
I added the quiesce/unquiesce patches in here as it's
easy for me easily apply changes on top. It has accumulated
reviews and includes mostly nvme anyway, please tell me if
you don't want to take them with this.
This includes:
- quiesce/unquiesce fixes in nvme and others from me
- nvme-fc add create association padding spec updates from James
- some more quirking from MKP
- nvmet nit cleanup from Max
- Fix nvme-rdma racy RDMA completion signalling from Marta
- some centralization patches from me
- add tagset nr_hw_queues updates on controller resets in
nvme drivers from me
- nvme-rdma fix resources recycling when doing error recovery from me
- minor cleanups in nvme-fc from me
We actually using the cookie returned from the last submit_bio
call.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
"i" should be signed or it could cause a forever loop on the cleanup
path. "size" can be used uninitialized.
Fixes: 87ad72a59a ("nvme-pci: implement host memory buffer support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Target validation of the Create Association LS revised to accept any
LS as long as all non-pad data has been received. This allows a (newer)
target to accept the LS from older initiators with varying pad lengths.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When our RDMA queue-pair is torn down with high load
of I/O traffic, we have no way of knowing if the
memory region was actually registered by the reg_mr
work request as it completion flushes with error (hw
might have done it or not).
So in order to not deal with all this uncertanty, we
simply recycle the MR in reinit_request.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Usually before we teardown the controller we want to:
1. complete/cancel any ctrl inflight works
2. remove ctrl namespaces (only for removal though, resets
shouldn't remove any namespaces).
but we do not want to destroy the controller device as
we might use it for logging during the teardown stage.
This patch adds nvme_start_ctrl() which queues inflight
controller works (aen, ns scan, queue start and keep-alive
if kato is set) and nvme_stop_ctrl() which cancels the works
namespace removal is left to the callers to handle.
Move nvme_uninit_ctrl after we are done with the
controller device.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When we requeue a request, we can always insert the request
back to the scheduler instead of doing it when restarting
the queues and kicking the requeue work, so get rid of
the requeue kick in nvme (core and drivers).
Also, now there is no need start hw queues in nvme_kill_queues
We don't stop the hw queues anymore, so no need to
start them.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues
quiescing/unquiescing respects the submission path rcu grace.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues
quiescing/unquiescing respects the submission path rcu grace.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues
quiescing/unquiescing respects the submission path rcu grace.
Also, make sure to unquiesce before cleanup the admin queue.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-By: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues
quiescing/unquiescing respects the submission path rcu grace.
Also make sure to kick the requeue list when appropriate.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
This patch improves the way the RDMA IB signalling is done by using atomic
operations for the signalling variable. This avoids race conditions on
sig_count.
The signalling interval changes slightly and is now the largest power of
two not larger than queue depth / 2.
ilog() usage idea by Bart Van Assche.
Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
We might have more/less queues once we reconnect/reset. For
example due to cpu going online/offline or controller constraints.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
We might have more/less queues once we reconnect/reset. For
example due to cpu going online/offline
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
We might have more/less queues once we reconnect/reset. For
example due to cpu going online/offline or controller constraints.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Its what the user passed, so its probably a better
idea to keep it intact. Also, limit the number of
I/O queues to max online cpus and the lport maximum
hw queues.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Pull irq updates from Thomas Gleixner:
"The irq department delivers:
- Expand the generic infrastructure handling the irq migration on CPU
hotplug and convert X86 over to it. (Thomas Gleixner)
Aside of consolidating code this is a preparatory change for:
- Finalizing the affinity management for multi-queue devices. The
main change here is to shut down interrupts which are affine to a
outgoing CPU and reenabling them when the CPU comes online again.
That avoids moving interrupts pointlessly around and breaking and
reestablishing affinities for no value. (Christoph Hellwig)
Note: This contains also the BLOCK-MQ and NVME changes which depend
on the rework of the irq core infrastructure. Jens acked them and
agreed that they should go with the irq changes.
- Consolidation of irq domain code (Marc Zyngier)
- State tracking consolidation in the core code (Jeffy Chen)
- Add debug infrastructure for hierarchical irq domains (Thomas
Gleixner)
- Infrastructure enhancement for managing generic interrupt chips via
devmem (Bartosz Golaszewski)
- Constification work all over the place (Tobias Klauser)
- Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)
- The usual set of fixes, updates and enhancements all over the
place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
irqchip/or1k-pic: Fix interrupt acknowledgement
irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
nvme: Allocate queues for all possible CPUs
blk-mq: Create hctx for each present CPU
blk-mq: Include all present CPUs in the default queue mapping
genirq: Avoid unnecessary low level irq function calls
genirq: Set irq masked state when initializing irq_desc
genirq/timings: Add infrastructure for estimating the next interrupt arrival time
genirq/timings: Add infrastructure to track the interrupt timings
genirq/debugfs: Remove pointless NULL pointer check
irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
irqchip/gic-v3-its: Add ACPI NUMA node mapping
irqchip/gic-v3-its-platform-msi: Make of_device_ids const
irqchip/gic-v3-its: Make of_device_ids const
irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
genirq/irqdomain: Remove auto-recursive hierarchy support
irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
...
Pull core block/IO updates from Jens Axboe:
"This is the main pull request for the block layer for 4.13. Not a huge
round in terms of features, but there's a lot of churn related to some
core cleanups.
Note this depends on the UUID tree pull request, that Christoph
already sent out.
This pull request contains:
- A series from Christoph, unifying the error/stats codes in the
block layer. We now use blk_status_t everywhere, instead of using
different schemes for different places.
- Also from Christoph, some cleanups around request allocation and IO
scheduler interactions in blk-mq.
- And yet another series from Christoph, cleaning up how we handle
and do bounce buffering in the block layer.
- A blk-mq debugfs series from Bart, further improving on the support
we have for exporting internal information to aid debugging IO
hangs or stalls.
- Also from Bart, a series that cleans up the request initialization
differences across types of devices.
- A series from Goldwyn Rodrigues, allowing the block layer to return
failure if we will block and the user asked for non-blocking.
- Patch from Hannes for supporting setting loop devices block size to
that of the underlying device.
- Two series of patches from Javier, fixing various issues with
lightnvm, particular around pblk.
- A series from me, adding support for write hints. This comes with
NVMe support as well, so applications can help guide data placement
on flash to improve performance, latencies, and write
amplification.
- A series from Ming, improving and hardening blk-mq support for
stopping/starting and quiescing hardware queues.
- Two pull requests for NVMe updates. Nothing major on the feature
side, but lots of cleanups and bug fixes. From the usual crew.
- A series from Neil Brown, greatly improving the bio rescue set
support. Most notably, this kills the bio rescue work queues, if we
don't really need them.
- Lots of other little bug fixes that are all over the place"
* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
lightnvm: pblk: set line bitmap check under debug
lightnvm: pblk: verify that cache read is still valid
lightnvm: pblk: add initialization check
lightnvm: pblk: remove target using async. I/Os
lightnvm: pblk: use vmalloc for GC data buffer
lightnvm: pblk: use right metadata buffer for recovery
lightnvm: pblk: schedule if data is not ready
lightnvm: pblk: remove unused return variable
lightnvm: pblk: fix double-free on pblk init
lightnvm: pblk: fix bad le64 assignations
nvme: Makefile: remove dead build rule
blk-mq: map all HWQ also in hyperthreaded system
nvmet-rdma: register ib_client to not deadlock in device removal
nvme_fc: fix error recovery on link down.
nvmet_fc: fix crashes on bad opcodes
nvme_fc: Fix crash when nvme controller connection fails.
nvme_fc: replace ioabort msleep loop with completion
nvme_fc: fix double calls to nvme_cleanup_cmd()
nvme-fabrics: verify that a controller returns the correct NQN
nvme: simplify nvme_dev_attrs_are_visible
...
- introduce the new uuid_t/guid_t types that are going to replace
the somewhat confusing uuid_be/uuid_le types and make the terminology
fit the various specs, as well as the userspace libuuid library.
(me, based on a previous version from Amir)
- consolidated generic uuid/guid helper functions lifted from XFS
and libnvdimm (Amir and me)
- conversions to the new types and helpers (Amir, Andy and me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllZfmILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMvyg/9EvWHOOsSdeDykCK3KdH2uIqnxwpl+m7ljccaGJIc
MmaH0KnsP9p/Cuw5hESh2tYlmCYN7pmYziNXpf/LRS65/HpEYbs4oMqo8UQsN0UM
2IXHfXY0HnCoG5OixH8RNbFTkxuGphsTY8meaiDr6aAmqChDQI2yGgQLo3WM2/Qe
R9N1KoBWH/bqY6dHv+urlFwtsREm2fBH+8ovVma3TO73uZCzJGLJBWy3anmZN+08
uYfdbLSyRN0T8rqemVdzsZ2SrpHYkIsYGUZV43F581vp8e/3OKMoMxpWRRd9fEsa
MXmoaHcLJoBsyVSFR9lcx3axKrhAgBPZljASbbA0h49JneWXrzghnKBQZG2SnEdA
ktHQ2sE4Yb5TZSvvWEKMQa3kXhEfIbTwgvbHpcDr5BUZX8WvEw2Zq8e7+Mi4+KJw
QkvFC1S96tRYO2bxdJX638uSesGUhSidb+hJ/edaOCB/GK+sLhUdDTJgwDpUGmyA
xVXTF51ramRS2vhlbzN79x9g33igIoNnG4/PV0FPvpCTSqxkHmPc5mK6Vals1lqt
cW6XfUjSQECq5nmTBtYDTbA/T+8HhBgSQnrrvmferjJzZUFGr/7MXl+Evz2x4CjX
OBQoAMu241w6Vp3zoXqxzv+muZ/NLar52M/zbi9TUjE0GvvRNkHvgCC4NmpIlWYJ
Sxg=
=J/4P
-----END PGP SIGNATURE-----
Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid
Pull uuid subsystem from Christoph Hellwig:
"This is the new uuid subsystem, in which Amir, Andy and I have started
consolidating our uuid/guid helpers and improving the types used for
them. Note that various other subsystems have pulled in this tree, so
I'd like it to go in early.
UUID/GUID summary:
- introduce the new uuid_t/guid_t types that are going to replace the
somewhat confusing uuid_be/uuid_le types and make the terminology
fit the various specs, as well as the userspace libuuid library.
(me, based on a previous version from Amir)
- consolidated generic uuid/guid helper functions lifted from XFS and
libnvdimm (Amir and me)
- conversions to the new types and helpers (Amir, Andy and me)"
* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
ACPI: hns_dsaf_acpi_dsm_guid can be static
mmc: sdhci-pci: make guid intel_dsm_guid static
uuid: Take const on input of uuid_is_null() and guid_is_null()
thermal: int340x_thermal: fix compile after the UUID API switch
thermal: int340x_thermal: Switch to use new generic UUID API
acpi: always include uuid.h
ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
ACPI / extlog: Switch to use new generic UUID API
ACPI / bus: Switch to use new generic UUID API
ACPI / APEI: Switch to use new generic UUID API
acpi, nfit: Switch to use new generic UUID API
MAINTAINERS: add uuid entry
tmpfs: generate random sb->s_uuid
scsi_debug: switch to uuid_t
nvme: switch to uuid_t
sysctl: switch to use uuid_t
partitions/ldm: switch to use uuid_t
overlayfs: use uuid_t instead of uuid_be
fs: switch ->s_uuid to uuid_t
ima/policy: switch to use uuid_t
...
The pci_error_handlers->reset_notify() method had a flag to indicate
whether to prepare for or clean up after a reset. The prepare and done
cases have no shared functionality whatsoever, so split them into separate
methods.
[bhelgaas: changelog, update locking comments]
Link: http://lkml.kernel.org/r/20170601111039.8913-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
we are going to need the name for the core routine...
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
All transports use either a private cache of controller cap or an on-stack
copy, move it to the generic struct nvme_ctrl. In the future it will also
be maintained by the core.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
All all transports use the queue_count in exactly the same, so move it to
the generic struct nvme_ctrl. In the future it will also be maintained by
the core.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-By: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
PM1725 controllers have a couple of quirks that need to be handled in
the driver:
- I/O queue depth must be limited to 64 entries on controllers that do
not report MQES.
- The host interface registers go offline briefly while resetting the
chip. Thus a delay is needed before checking whether the controller
is ready.
Note that the admin queue depth is also limited to 64 on older versions
of this board. Since our NVME_AQ_DEPTH is now 32 that is no longer an
issue.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Remove dead build rule for drivers/nvme/host/scsi.c which has been
removed by commit ("nvme: Remove SCSI translations").
Signed-off-by: Valentin Rothberg <vrothberg@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unlike most drіvers that simply pass the maximum possible vectors to
pci_alloc_irq_vectors NVMe needs to configure the device before allocting
the vectors, so it needs a manual update for the new scheme of using
all present CPUs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-4-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We can deadlock in case we got to a device removal
event on a queue which is already in the process of
destroying the cm_id is this is blocking until all
events on this cm_id will drain. On the other hand
we cannot guarantee that rdma_destroy_id was invoked
as we only have indication that the queue disconnect
flow has been queued (the queue state is updated before
the realease work has been queued).
So, we leave all the queue removal to a separate ib_client
to avoid this deadlock as ib_client device removal is in
a different context than the cm_id itself.
Reported-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the fc transport invokes nvme_fc_error_recovery() on every
io in which the transport detects an error. Which means:
a) it's really noisy on large io loads that all get hit by a link down.
b) we repeatively call nvme_stop_queues() even though queues are
stopped upon the first error or as first steps of reset_work.
Correct by:
Errors are only meaningful if the controller is in the LIVE state.
Thus, enact the reset_work only if LIVE. If called repeatively, state
will have already transitioned.
There's no need to stop the queues here. Let the first steps of
reset_work do the queue stopping.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
if a nvme command is issued with an opcode that is not supported by
the target (example: opcode 21 - detach namespace), the target
crashes due to a null pointer.
nvmet_req_init() detects the bad opcode and immediately calls the nvme
command done routine with an error status, allowing the transport to
send the response. However, the FC transport was aborting the command
on error, so the abort freed the lldd point, but the rsp transmit path
referenced it psot the free.
Fix by removing the abort call on nvmet_req_init() failure.
The completion response will be sent with an error status code.
As the completion path will terminate the io, ensure the data_sg
lists show an unused state so that teardown paths are successful.
Signed-off-by: Paul Ely <Paul.Ely@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a controller connection is attempted (say to a subsystem that
does not exist), the first attempt errors out. If another connect
is attempted, it crashes.
Issue is the prior controller has yet execute it's final put, thus
its still on lists. However, opts points on it have been cleared, thus
causing the crash if they are referenced.
Fix is to add the missing put after the nvme_uninit_ctrl() call on
the attachment failure.
Signed-off-by: Paul Ely <Paul.Ely@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Per the recommendation by Sagi on:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html
Wait for io aborts to complete wait converted from msleep look to
using a struct completion.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current fc transport code, on io termination, is calling
nvme_cleanup_cmd() followed by the transport dma unmap routine
which also calls nvme_cleanup_cmd(). Which means two kfrees occur
on the same address, raising havoc. This resulted in odd data errors,
effectively corruption..
Fix by removing the extraneous double calls. Call now occurs only in
teardown paths and as part of dma unmap routine.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the
Identify controller data structures. Use this NQN for the subsysnqn
sysfs attribute by storing it in the nvme_ctrl structure after verifying
it. For older controllers we generate a "fake" NQN per non-normative
text in the NVMe 1.3 spec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While a NVMe Namespace is somewhat similar to a SCSI Logical Unit (and not
a Logical Unit Number anyway) there are subtile differences. Remove the
misleading comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grmberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A user reports APST is enabled, even when the NVMe is quirked or with
option "default_ps_max_latency_us=0".
The current logic will not set APST if the device is quirked. But the
NVMe in question will enable APST automatically.
Separate the logic "apst is supported" and "to enable apst", so we can
use the latter one to explicitly disable APST at initialiaztion.
BugLink: https://bugs.launchpad.net/bugs/1699004
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No need to differentiate fabrics from pci/loop, also lower
it to 32 as we don't really need 256 inflight admin commands.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we have no way to define a stable host-id but always use the one
which is randomly generated when we add the host or use the default host.
Provide a "hostid=%s" for user-space to pass in a persistent host-id which
overrides the randomly generated one.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The SCSI-to-NVMe translations were added to assist storage applications
utilizing SG_IO transitioning to NVMe. It was always recommended,
however, to use native NVMe for device management as too much is lost
in translation and the maintenance burden in keeping this kludgey
layer around has been neglected such that much of the translations are
completely broken.
This patch removes SG_IO handling from NVMe to avoid any confusion
regarding maintenance support for this interface. The config option for
NVMe SCSI emulation has been disabled by default since 4.5. The driver
has supported native nvme user commands since the beginning, and native
tooling is publicly available for use or as reference for anyone writing
their own tools, so there's no excuse for hanging onto a broken crutch.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Given that the code is simple enough it seems better
then passing a tag by reference for each call site, also
we can now get rid of __nvme_process_cq.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Also, maintain a consumed counter to rely on for doorbell and
cqe_seen update instead of directly relying on the cq head and phase.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Makes the code slightly more readable.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nice abstraction of the actual mechanics of how to do it.
Note the change that we call it after we assign nvmeq->cq_head
to avoid passing it.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The controller state is set to resetting prior to disabling the
controller, so this patch accounts for that state when deciding if it
needs to freeze the queues. Without this, an 'nvme reset /dev/nvme0'
blocks forever because the queues were never frozen.
Fixes: 82b057caef ("nvme-pci: fix multiple ctrl removal scheduling")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If nvme_alloc_request fails, propagate the right error, instead of
assuming ENOMEM.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When nvme_kill_queues() is run, queues may be in
quiesced state, so we forcibly unquiesce queues to avoid
blocking dispatch, and I/O hang can be avoided in
remove path.
Peviously we use blk_mq_start_stopped_hw_queues() as
counterpart of blk_mq_quiesce_queue(), now we have
introduced blk_mq_unquiesce_queue(), so use it explicitly.
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_unquiesce_queue() is used for unquiescing the
queue explicitly, so replace blk_mq_start_stopped_hw_queues()
with it.
For the scsi part, this patch takes Bart's suggestion to
switch to block quiesce/unquiesce API completely.
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: dm-devel@redhat.com
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The NVMe 1.3 spec introduces Namespace Optimal IO Boundaries (NOIOB),
which standardizes the stripe mechanism we currently have quirks for.
This patch implements the necessary logic to handle this new feature.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We don't need to wait for the reset from the delayed work item that
is kicked off when we don't get a keepalive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
This moves the nvme_reset function from the PCIe driver to common code,
renaming it to nvme_reset_ctrl in the process. Additionally a new
helper nvme_reset_ctrl_sync is added for the case where we want to
wait for the reset. To facilitate that the reset_work work structure is
move to the common nvme_ctrl structure and the ->reset_ctrl method is
removed. For now the drivers initialize the reset_work with their own
callback, but longer term we should move to callouts for specific
parts of the reset process and move even more code to the core.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Now that we get the tagset passed we can have a single implementation for
the I/O and admin queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that we get the tagset passed we can have a single implementation for
the I/O and admin queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that we get the tagset passed we can have a single implementation for
the I/O and admin queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that we get the tagset passed we can have a single implementation for
the I/O and admin queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
It only applies to read/write commands, and this way non-PCIe drivers
get the check as well instead of having to duplicate it when adding
metadata support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
And open code the SHUTDOWN_TIMEOUT macro.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We accidentally return ERR_PTR(0) which is NULL. The caller isn't
explicitly checking for that but I couldn't immediately spot whether
this would lead to a NULL dereference. Anyway, we can fix add an
error code easily enough.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
To let the host know what happends to the connection establishment,
adjust the behavior of nvmf_log_connect_error to make more connect
specifig error codes human-readble.
Signed-off-by: Guan Junxiong <guanjunxiong@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This was detected by building the nvmet-fc driver with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Change the few left over users of ctrl->dev over to using ctrl->device
for logging purposes, so we consistently use the same device.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow overriding the announced NVMe Version of a via configfs.
This is particularly helpful when debugging new features for the host
or target side without bumping the hard coded version (as the target
might not be fully compliant to the announced version yet).
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add the UUID field from the NVMe Namespace Identification Descriptor
to the nvmet_ns structure and allow it's population via configfs.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A NVMe Identify NS command with a CNS value of '3' is expecting a list
of Namespace Identification Descriptor structures to be returned to
the host for the namespace requested in the namespace identify
command.
This Namespace Identification Descriptor structure consists of the
type of the namespace identifier, the length of the identifier and the
actual identifier.
Valid types are NGUID and UUID which we have saved in our nvme_ns
structure if they have been configured via configfs. If no value has
been assigened to one of these we return an "invalid opcode" back to
the host to maintain backward compatibiliy with older implementations
without Namespace Identify Descriptor list support.
Also as the Namespace Identify Descriptor list is the only mandatory
feature change between 1.2.1 and 1.3 we can bump the advertised
version as well.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that we have a way for getting the UUID from a target, provide it
to userspace as well.
Unfortunately there is already a sysfs attribute called UUID which is
a misnomer as it holds the NGUID value. So instead of creating yet
another wrong name, create a new 'nguid' sysfs attribute for the
NGUID. For the UUID attribute add a check wheter the namespace has a
UUID assigned to it and return this or return the NGUID to maintain
backwards compatibility. This should give userspace a chance to catch
up.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@rimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If a target identifies itself as NVMe 1.3 compliant, try to get the
list of Namespace Identification Descriptors and populate the UUID,
NGUID and EUI64 fileds in the NVMe namespace structure with these
values.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The uuid field in the nvme_ns structure represents the nguid field
from the identify namespace command. And as NVMe 1.3 introduced an
UUID in the NVMe Namespace Identification Descriptor this will
collide.
So rename the uuid to nguid to prevent any further
confusion. Unfortunately we export the nguid to sysfs in the uuid
sysfs attribute, but this can't be changed anymore without possibly
breaking existing userspace.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Use NVME_IDENTIFY_DATA_SIZE define instead of hard coding the magic
4096 value.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[hch: converted three more users]
Signed-off-by: Christoph Hellwig <hch@lst.de>
The controller status polling was added to preemptively reset a failed
controller. This early detection would allow commands that would normally
timeout a chance for a retry, or find broken links when the platform
didn't support hotplug.
This once-per-second MMIO read, however, created more problems than
it solves. This often races with PCIe Hotplug events that required
complicated syncing between work queues, frequently triggered PCIe
Completion Timeout errors that also lead to fatal machine checks, and
unnecessarily disrupts low power modes by running on idle controllers.
This patch removes the watchdog timer, and instead checks controller
health only on an IO timeout when we have a reason to believe something
is wrong. If the controller is failed, the driver will disable immediately
and request scheduling a reset.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The existing driver initially maps 8192 bytes of BAR0 which is
intended to cover doorbells of admin SQ and CQ. However, if a
large stride, e.g. 10, is used, the doorbell of admin CQ will
be out of 8192 bytes. Consequently, a page fault will be raised
when the admin CQ doorbell is accessed in nvme_configure_admin_queue().
This patch fixes this issue by remapping BAR0 before accessing
admin CQ doorbell if the initial mapping is not enough.
Signed-off-by: Xu Yu <yu.a.xu@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
It is not a user option but rather a variable controller
attribute.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
To suppress the warning triggered by nvme_uninit_ctrl:
kernel: [ 50.350439] nvme nvme0: rescanning
kernel: [ 50.363351] ------------[ cut here]------------
kernel: [ 50.363396] WARNING: CPU: 1 PID: 37 at kernel/workqueue.c:2423 check_flush_dependency+0x11f/0x130
kernel: [ 50.363409] workqueue: WQ_MEM_RECLAIM
nvme-wq:nvme_del_ctrl_work [nvme_core] is flushing !WQ_MEM_RECLAIM events:nvme_scan_work [nvme_core]
This was triggered with nvme-loop, but can happen with rdma/pci as well afaict.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of each transport using it's own workqueue, export
a single nvme-core workqueue and use that instead.
In the future, this will help us moving towards some unification
if controller setup/teardown flows.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The reset operation is guaranteed to fail for all scenarios
but the esoteric case where in the last reconnect attempt
concurrent with the reset we happen to successfully reconnect.
We just deny initiating a reset if we are reconnecting.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We only care about if the queue is LIVE for request submission,
so no need for CONNECTED.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of introducing a flag for if the queue is allocated,
simply free the rdma resources when we get the error.
We allocate the queue rdma resources when we have an address
resolution, their we allocate (or take a reference on) our device
so we should free it when we have error after the address resolution
namely:
1. route resolution error
2. connect reject
3. connect error
4. peer unreachable error
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We put the reference on the device in the destroy routine
so we should lookup and take the reference in the create
routine.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We don't need it as the core polling context will take
are of rearming the completion queue.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
If a controller supports the host memory buffer we try to provide
it with the requested size up to an upper cap set as a module
parameter. We try to give as few as possible descriptors, eventually
working our way down.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
We'll need the later for the HMB support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
The merge of 4.12-rc5 into the for-4.13/block tree didn't handle the queue
ready case correctly. Fix this by propagating blk_status_t into
nvme_rdma_queue_is_ready.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.
For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.
blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of reinventing it poorly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig suggests we should to make APST work out of the box.
Hence relax the the default max latency to make them able to enter
deepest power state on default.
Here are id-ctrl excerpts from two high latency NVMes:
vid : 0x14a4
ssvid : 0x1b4b
mn : CX2-GB1024-Q11 NVMe LITEON 1024GB
ps 3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-
vid : 0x15b7
ssvid : 0x1b4b
mn : A400 NVMe SanDisk 512GB
ps 3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When a NVMe is in non-op states, the latency is exlat.
The latency will be enlat + exlat only when the NVMe tries to transit
from operational state right atfer it begins to transit to
non-operational state, which should be a rare case.
Therefore, as Andy Lutomirski suggests, use exlat only when deciding power
states to trainsit to.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The failure case, of a create controller request, called
nvme_uninit_ctrl() but didn't do a put to allow the nvme
controller to be deleted.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Per FC-NVME, when lldd or transport detects an i/o error, the
connection must be terminated, which in turn requires the association
to be termianted. Currently the transport simply creates a nvme
completion status of transport error and returns the io. The FC-NVME
spec makes the mandate as initiator and host, depending on the error,
can get out of sync on outstanding io counts (sqhd/sqtail).
Implement the association teardown on lldd or transport detected
errors.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
When we encounter an transport/controller errors, error recovery
kicks in which performs:
1. stops io/admin queues
2. moves transport queues out of LIVE state
3. fast fail pending io
4. schedule periodic reconnects.
But we also need to fast fail incoming IO taht enters after we
already scheduled. Given that our queue is not LIVE anymore, simply
restart the request queues to fail in .queue_rq
Reported-by: Alex Turin <alex@vastdata.com>
Reported-by: shahar.salzman <shahar.salzman@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
We need to start admin queues too in nvme_kill_queues()
for avoiding hang in remove path[1].
This patch is very similar with 806f026f9b901eaf(nvme: use
blk_mq_start_hw_queues() in nvme_kill_queues()).
[1] hang stack trace
[<ffffffff813c9716>] blk_execute_rq+0x56/0x80
[<ffffffff815cb6e9>] __nvme_submit_sync_cmd+0x89/0xf0
[<ffffffff815ce7be>] nvme_set_features+0x5e/0x90
[<ffffffff815ce9f6>] nvme_configure_apst+0x166/0x200
[<ffffffff815cef45>] nvme_set_latency_tolerance+0x35/0x50
[<ffffffff8157bd11>] apply_constraint+0xb1/0xc0
[<ffffffff8157cbb4>] dev_pm_qos_constraints_destroy+0xf4/0x1f0
[<ffffffff8157b44a>] dpm_sysfs_remove+0x2a/0x60
[<ffffffff8156d951>] device_del+0x101/0x320
[<ffffffff8156db8a>] device_unregister+0x1a/0x60
[<ffffffff8156dc4c>] device_destroy+0x3c/0x50
[<ffffffff815cd295>] nvme_uninit_ctrl+0x45/0xa0
[<ffffffff815d4858>] nvme_remove+0x78/0x110
[<ffffffff81452b69>] pci_device_remove+0x39/0xb0
[<ffffffff81572935>] device_release_driver_internal+0x155/0x210
[<ffffffff81572a02>] device_release_driver+0x12/0x20
[<ffffffff815d36fb>] nvme_remove_dead_ctrl_work+0x6b/0x70
[<ffffffff810bf3bc>] process_one_work+0x18c/0x3a0
[<ffffffff810bf61e>] worker_thread+0x4e/0x3b0
[<ffffffff810c5ac9>] kthread+0x109/0x140
[<ffffffff8185800c>] ret_from_fork+0x2c/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
Fixes: c5552fde102fc("nvme: Enable autonomous power state transitions")
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Tested-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
They have known firmware bugs. A fix is apparently in the works --
once fixed firmware is available, someone from Intel (Hi, Keith!)
can adjust the quirk accordingly.
Cc: stable@vger.kernel.org # v4.11
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Mario Limonciello <mario_limonciello@dell.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently only the PCIe driver supports metadata, so we should not claim
integrity support for the other drivers. This prevents nasty crashes
with targets that advertise metadata support on fabrics.
Also use the opportunity to factor out some code into a separate helper
that isn't even compiled if CONFIG_BLK_DEV_INTEGRITY is disabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
So that we can have more flags for transport-specific behavior.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
This is what most of the code already does and gives much more useful
prefixes than the device embedded in the pci_dev.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
fix extra controller reference taken on reconnect by moving
reference to initial controller create
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
correct nvme status set on abort. Patch that changed status to being actual
nvme status crossed in the night with the patch that added abort values.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Per the review by Sagi on:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html
Looked at existing warn vs info vs err dev_xxx levels for the messages
printed on reconnects and deletes:
- Resets due to error and resets transitioned to deletes are dev_warn
- Other reset/disconnect messages are dev_info
- Removed chatty io queue related messages
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Per the recommendation by Sagi on:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html
An extra reference was pointed out. There's no issue with the
references, but rather a literal interpretation of what the comment
is saying.
Reword the comment to avoid confusion.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sync with Sagi's recent addition of ctrl_loss_tmo in the core fabrics
layer.
Remove local connect limits and connect_attempts variable.
Use fabrics new nr_connects variable and use of nvmf_should_reconnect()
Refactor duplicate reconnect failure code.
Addresses review comment by Sagi on controller reset support:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove the local copy of reconnect_delay.
Use the value in the controller options directly.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
NVMe may add request into requeue list simply and not kick off the
requeue if hw queues are stopped. Then blk_mq_abort_requeue_list()
is called in both nvme_kill_queues() and nvme_ns_remove() for
dealing with this issue.
Unfortunately blk_mq_abort_requeue_list() is absolutely a
race maker, for example, one request may be requeued during
the aborting. So this patch just calls blk_mq_kick_requeue_list() in
nvme_kill_queues() to handle this issue like what nvme_start_queues()
does. Now all requests in requeue list when queues are stopped will be
handled by blk_mq_kick_requeue_list() when queues are restarted, either
in nvme_start_queues() or in nvme_kill_queues().
Cc: stable@vger.kernel.org
Reported-by: Zhang Yi <yizhan@redhat.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Inside nvme_kill_queues(), we have to start hw queues for
draining requests in sw queues, .dispatch list and requeue list,
so use blk_mq_start_hw_queues() instead of blk_mq_start_stopped_hw_queues()
which only run queues if queues are stopped, but the queues may have
been started already, for example nvme_start_queues() is called in reset work
function.
blk_mq_start_hw_queues() run hw queues in current context, instead
of running asynchronously like before. Given nvme_kill_queues() is
run from either remove context or reset worker context, both are fine
to run hw queue directly. And the mutex of namespaces_mutex isn't a
problem too becasue nvme_start_freeze() runs hw queue in this way
already.
Cc: stable@vger.kernel.org
Reported-by: Zhang Yi <yizhan@redhat.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In the case of small NVMe-oF queue size (<32) we may enter a deadlock
caused by the fact that the IB completions aren't sent waiting for 32
and the send queue will fill up.
The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
This patch changes the way the signaling is done so that it depends on
the queue depth now. The magic define has been removed completely.
Cc: stable@vger.kernel.org
Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Signed-off-by: Samuel Jones <sjones@kalray.eu>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Pull block fixes from Jens Axboe:
"A small collection of fixes that should go into this cycle.
- a pull request from Christoph for NVMe, which ended up being
manually applied to avoid pulling in newer bits in master. Mostly
fibre channel fixes from James, but also a few fixes from Jon and
Vijay
- a pull request from Konrad, with just a single fix for xen-blkback
from Gustavo.
- a fuseblk bdi fix from Jan, fixing a regression in this series with
the dynamic backing devices.
- a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().
- a request leak fix for drbd from Lars, fixing a regression in the
last series with the kref changes. This will go to stable as well"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: release the sq ref on rdma read errors
nvmet-fc: remove target cpu scheduling flag
nvme-fc: stop queues on error detection
nvme-fc: require target or discovery role for fc-nvme targets
nvme-fc: correct port role bits
nvme: unmap CMB and remove sysfs file in reset path
blktrace: fix integer parse
fuseblk: Fix warning in super_setup_bdi_name()
block: xen-blkback: add null check to avoid null pointer dereference
drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
On rdma read errors, release the sq ref that was taken
when the req was initialized. This avoids a hang in
nvmet_sq_destroy() when the queue is being freed.
Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Per the recommendation by Sagi on:
http://lists.infradead.org/pipermail/linux-nvme/2017-April/009261.html
Rather than waiting for reset work thread to stop queues and abort the ios,
immediately stop the queues on error detection. Reset thread will restop
the queues (as it's called on other paths), but it does not appear to have
a side effect.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to create an association, the remoteport must be
serving either a target role or a discovery role.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
CMB doesn't get unmapped until removal while getting remapped on every
reset. Add the unmapping and sysfs file removal to the reset path in
nvme_pci_disable to match the mapping path in nvme_pci_enable.
Fixes: 202021c1a ("nvme : Add sysfs entry for NVMe CMBs when appropriate")
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
Cc: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"A smaller collection of fixes that should go into -rc1. This contains:
- A fix from Christoph, fixing a regression with the WRITE_SAME and
partial completions. Caused a BUG() on ppc.
- Fixup for __blk_mq_stop_hw_queues(), it should be static. From
Colin.
- Removal of dmesg error messages on elevator switching, when invoked
from sysfs. From me.
- Fix for blk-stat, using this_cpu_ptr() in a section only protected
by rcu_read_lock(). This breaks when PREEMPT_RCU is enabled. From
me.
- Two fixes for BFQ from Paolo, one fixing a crash and one updating
the documentation.
- An error handling lightnvm memory leak, from Rakesh.
- The previous blk-mq hot unplug lock reversal depends on the CPU
hotplug rework that isn't in mainline yet. This caused a lockdep
splat when people unplugged CPUs with blk-mq devices. From Wanpeng.
- A regression fix for DIF/DIX on blk-mq. From Wen"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: handle partial completions for special payload requests
blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
blk-stat: don't use this_cpu_ptr() in a preemptable section
elevator: remove redundant warnings on IO scheduler switch
block, bfq: stress that low_latency must be off to get max throughput
block, bfq: use pointer entity->sched_data only if set
nvme: lightnvm: fix memory leak
blk-mq: make __blk_mq_stop_hw_queues static
lightnvm: remove unused rq parameter of nvme_nvm_rqtocmd() to kill warning
block/mq: fix potential deadlock during cpu hotplug
Free up kmalloc allocated memory if failure happens while handling L2P
table transfer in nvme_nvm_get_l2p_tbl.
Fixes: 8e79b5cb ("lightnvm: move block provisioning to targets")
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZEHmsAAoJEFmIoMA60/r88SgQAJbFddueb0+DfJ+USDud4b/Z
akfS+G1UAm+TgtMyh1wM49dHzFssp36uWJxtWI+bPqBzuy94PMCbz7JVUV28gX9G
tFhFuc5YH94I/3y85rbZnolb6uZN9MhLjzTFqDC9ilW6HFqmwK4t4wlHSCjQN1St
svLYvs2G6n6/VK3Fre7/wOvdZ1erG4Qod+kn5Tx3K5TQydmRlaSBfK+DRANuDBkM
KzGO7Bkc/Cx8hb9pHmaey/wxmNrrgmVjTtWrEnb2tEq833zP4h6GhUIJEKodMSi5
gXPNZgKlu3n5L592M0UCh4EoHejzkv9wrcsoDm+djmsc5Zg2Howq4kAdHP8k4hUG
0gt8n0ni9vhJN56jikrGi7cAdHCKSNnx2Ue/qTCbX0ncB3XUMuJxJwCsgW/6wa9f
oU7tRtTS03UltnKoFAcyYclS4TaSY4SA4ySaK6Hi+cRkdVFDdyHQYbHHNSU7MsA+
IS2tXvGoIdSYyrZMHSRcl2rRTfYQUkmPEvBF3LvqZr32M4mJMmUNAPLZaly373ZE
iwq0ZJlrLeM0cqdFIG3S60RtJyQk/HBN1NMqrYHArWOxvWIgNd5F8NCsTTxY3wU3
IxgBIuUFcbVwVkqEHGs8K5AvB3oghqdnA3eGOV79799eMtLn3LOvyIlpHMSw9WUq
ags00JtMLitfNPBH3eSl
=eE4D
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- add framework for supporting PCIe devices in Endpoint mode (Kishon
Vijay Abraham I)
- use non-postable PCI config space mappings when possible (Lorenzo
Pieralisi)
- clean up and unify mmap of PCI BARs (David Woodhouse)
- export and unify Function Level Reset support (Christoph Hellwig)
- avoid FLR for Intel 82579 NICs (Sasha Neftin)
- add pci_request_irq() and pci_free_irq() helpers (Christoph Hellwig)
- short-circuit config access failures for disconnected devices (Keith
Busch)
- remove D3 sleep delay when possible (Adrian Hunter)
- freeze PME scan before suspending devices (Lukas Wunner)
- stop disabling MSI/MSI-X in pci_device_shutdown() (Prarit Bhargava)
- disable boot interrupt quirk for ASUS M2N-LR (Stefan Assmann)
- add arch-specific alignment control to improve device passthrough by
avoiding multiple BARs in a page (Yongji Xie)
- add sysfs sriov_drivers_autoprobe to control VF driver binding
(Bodong Wang)
- allow slots below PCI-to-PCIe "reverse bridges" (Bjorn Helgaas)
- fix crashes when unbinding host controllers that don't support
removal (Brian Norris)
- add driver for MicroSemi Switchtec management interface (Logan
Gunthorpe)
- add driver for Faraday Technology FTPCI100 host bridge (Linus
Walleij)
- add i.MX7D support (Andrey Smirnov)
- use generic MSI support for Aardvark (Thomas Petazzoni)
- make Rockchip driver modular (Brian Norris)
- advertise 128-byte Read Completion Boundary support for Rockchip
(Shawn Lin)
- advertise PCI_EXP_LNKSTA_SLC for Rockchip root port (Shawn Lin)
- convert atomic_t to refcount_t in HV driver (Elena Reshetova)
- add CPU IRQ affinity in HV driver (K. Y. Srinivasan)
- fix PCI bus removal in HV driver (Long Li)
- add support for ThunderX2 DMA alias topology (Jayachandran C)
- add ThunderX pass2.x 2nd node MCFG quirk (Tomasz Nowicki)
- add ITE 8893 bridge DMA alias quirk (Jarod Wilson)
- restrict Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices
(Manish Jaggi)
* tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (146 commits)
PCI: Don't allow unbinding host controllers that aren't prepared
ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP
MAINTAINERS: Add PCI Endpoint maintainer
Documentation: PCI: Add userguide for PCI endpoint test function
tools: PCI: Add sample test script to invoke pcitest
tools: PCI: Add a userspace tool to test PCI endpoint
Documentation: misc-devices: Add Documentation for pci-endpoint-test driver
misc: Add host side PCI driver for PCI test function device
PCI: Add device IDs for DRA74x and DRA72x
dt-bindings: PCI: dra7xx: Add DT bindings to enable unaligned access
PCI: dwc: dra7xx: Workaround for errata id i870
dt-bindings: PCI: dra7xx: Add DT bindings for PCI dra7xx EP mode
PCI: dwc: dra7xx: Add EP mode support
PCI: dwc: dra7xx: Facilitate wrapper and MSI interrupts to be enabled independently
dt-bindings: PCI: Add DT bindings for PCI designware EP mode
PCI: dwc: designware: Add EP mode support
Documentation: PCI: Add binding documentation for pci-test endpoint function
ixgbe: Use pcie_flr() instead of duplicating it
IB/hfi1: Use pcie_flr() instead of duplicating it
PCI: imx6: Fix spelling mistake: "contol" -> "control"
...
With gcc 4.1.2:
drivers/nvme/host/lightnvm.c: In function ‘nvme_nvm_submit_io’:
drivers/nvme/host/lightnvm.c:498: warning: ‘rq’ is used uninitialized in this function
Indeed, since commit 2e13f33a24 ("lightnvm: create cmd before
allocating request"), the request is passed to nvme_nvm_rqtocmd() before
it is allocated.
Fortunately, as of commit 91276162de ("lightnvm: refactor end_io
functions for sync"), nvme_nvm_rqtocmd () no longer uses the passed
request, so this warning is a false positive.
Drop the unused parameter to clean up the code and kill the warning.
Fixes: 2e13f33a24 ("lightnvm: create cmd before allocating request")
Fixes: 91276162de ("lightnvm: refactor end_io functions for sync")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Create nvme command before allocating a request using
nvme_alloc_request, which uses the command direction. Up until now, the
command has been zeroized, so all commands have been allocated as a
read operation.
Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq. Except for a superflous check in
mtip32xx it was unused anyway.
Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph writes:
"A couple more updates for 4.12. The biggest pile is fc and lpfc
updates from James, but there are various small fixes and cleanups as
well."
Fixes up a few merge issues, and also a warning in
lpfc_nvmet_rcv_unsol_abort() if CONFIG_NVME_TARGET_FC isn't enabled.
Signed-off-by: Jens Axboe <axboe@fb.com>
This function just returns the same error code and sense data as
the default statement in the switch in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
The current command submission code uses a sector-based value when
considering the maximum number of blocks per command. With a
4k-formatted namespace and a command exceeding max hardware limits, this
calculation doesn't split IOs which should be split and fails in the
nvme layer. This patch fixes that calculation and enables IO splitting
in these circumstances.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Do not call nvmf_free_options() from the nvme_fc_ctlr destructor if
nvme_fc_create_ctrl() returns an error, because nvmf_create_ctrl()
frees the options when an error is returned.
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We're probably going to be stuck quirking APST off on an over-broad
range of devices for 4.11. Let's make it easy to override the quirk
for testing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Debugging APST is currently a bit of a pain. This gives optional
simple log messages that describe the APST state.
The easiest way to use this is probably with the nvme_core.dyndbg=+p
module parameter.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
There was a typo in the description of the timeout heuristic.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
That's what it's used as.
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
The passed in desc_len is a big endian value, so mark it as such.
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
This patch actually does quite a few things. When looking to add
controller reset support, the organization modeled after rdma was
very fragmented. rdma duplicates the reset and teardown paths and does
different things to the block layer on the two paths. The code to build
up the controller is also duplicated between the initial creation and
the reset/error recovery paths. So I decided to make this sane.
I reorganized the controller creation and teardown so that there is a
connect path and a disconnect path. Initial creation obviously uses
the connect path. Controller teardown will use the disconnect path,
followed last access code. Controller reset will use the disconnect
path to stop operation, and then the connect path to re-establish
the controller.
Along the way, several things were fixed
- aens were not properly set up. They are allocated differently from
the per-request structure on the blk queues.
- aens were oddly torn down. the prior patch corrected to abort, but
we still need to dma unmap and free relative elements.
- missed a few ref counting points: in aen completion and on i/o's
that fail
- controller initial create failure paths were still confused vs teardown
before converting to ref counting vs after we convert to refcounting.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add abort support for aens. Commonized the op abort to apply to aen or
real ios (caused some reorg/routine movement). Abort path sets termination
flag in prep for next patch that will be watching i/o abort completion
before proceeding with controller teardown.
Now that we're aborting aens, the "exit" code that simply cleared out
their context no longer applies.
Also clarified how we detect an AEN vs a normal io - by a flag, not
by whether a rq exists or the a rqno is out of range.
Note: saw some interesting cases where if the queues are stopped and
we're waiting for the aborts, the core layer can call the complete_rq
callback for the io. So the io completion synchronizes link side completion
with possible blk layer completion under error.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The code validates the command_id in the response to the original
sqe command. But prior code was using the rq->rqno as the sqe command
id. The core layer overwrites what the transport set there originally.
Use the actual sqe content.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>