Commit Graph

196 Commits

Author SHA1 Message Date
Helen Koike f9f38e3338 nvme: improve performance for virtual NVMe devices
This change provides a mechanism to reduce the number of MMIO doorbell
writes for the NVMe driver. When running in a virtualized environment
like QEMU, the cost of an MMIO is quite hefy here. The main idea for
the patch is provide the device two memory location locations:
 1) to store the doorbell values so they can be lookup without the doorbell
    MMIO write
 2) to store an event index.
I believe the doorbell value is obvious, the event index not so much.
Similar to the virtio specification, the virtual device can tell the
driver (guest OS) not to write MMIO unless you are writing past this
value.

FYI: doorbell values are written by the nvme driver (guest OS) and the
event index is written by the virtual device (host OS).

The patch implements a new admin command that will communicate where
these two memory locations reside. If the command fails, the nvme
driver will work as before without any optimizations.

Contributions:
  Eric Northup <digitaleric@google.com>
  Frank Swiderski <fes@google.com>
  Ted Tso <tytso@mit.edu>
  Keith Busch <keith.busch@intel.com>

Just to give an idea on the performance boost with the vendor
extension: Running fio [1], a stock NVMe driver I get about 200K read
IOPs with my vendor patch I get about 1000K read IOPs. This was
running with a null device i.e. the backing device simply returned
success on every read IO request.

[1] Running on a 4 core machine:
  fio --time_based --name=benchmark --runtime=30
  --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
  --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
  --rw=randread --blocksize=4k --randrepeat=false

Signed-off-by: Rob Nelson <rlnelson@google.com>
[mlin: port for upstream]
Signed-off-by: Ming Lin <mlin@kernel.org>
[koike: updated for upstream]
Signed-off-by: Helen Koike <helen.koike@collabora.co.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2017-04-21 16:41:47 +02:00
Keith Busch 81c1cd9835 nvme/pci: Don't set reserved SQ create flags
The QPRIO field is only valid if weighted round robin arbitration is used,
and this driver doesn't enable that controller configuration option.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-21 16:41:46 +02:00
Christoph Hellwig 27fa9bc545 nvme: split nvme status from block req->errors
We want our own clearly defined error field for NVMe passthrough commands,
and the request errors field is going away in its current form.

Just store the status and result field in the nvme_request field from
hardirq completion context (using a new helper) and then generate a
Linux errno for the block layer only when we actually need it.

Because we can't overload the status value with a negative error code
for cancelled command we now have a flags filed in struct nvme_request
that contains a bit for this condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig e850fd16f7 nvme: implement REQ_OP_WRITE_ZEROES
But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig 987f699a8f nvme: move ->retries setup to nvme_setup_cmd
->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command.  We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig 77f02a7acd nvme: factor request completion code into a common helper
This avoids duplicating the logic four times, and it also allows to keep
some helpers static in core.c or just opencode them.

Note that this loses printing the aborted status on completions in the
PCI driver as that uses a data structure not available any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Eric Biggers f363b089be blk-mq: constify struct blk_mq_ops
Constify all instances of blk_mq_ops, as they are never modified.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31 08:28:58 -06:00
Keith Busch 302ad8cc09 nvme: Complete all stuck requests
If the nvme driver is shutting down its controller, the drievr will not
start the queues up again, preventing blk-mq's hot CPU notifier from
making forward progress.

To fix that, this patch starts a request_queue freeze when the driver
resets a controller so no new requests may enter. The driver will wait
for frozen after IO queues are restarted to ensure the queue reference
can be reinitialized when nvme requests to unfreeze the queues.

If the driver is doing a safe shutdown, the driver will wait for the
controller to successfully complete all inflight requests so that we
don't unnecessarily fail them. Once the controller has been disabled,
the queues will be restarted to force remaining entered requests to end
in failure so that blk-mq's hot cpu notifier may progress.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:59 -07:00
Shaohua Li d3af3ecdc6 nvme: allocate nvme_queue in correct node
nvme_queue is per-cpu queue (mostly). Allocating it in node where blk-mq
will use it.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:04 -07:00
Scott Bauer e286bcfc59 nvme/pci: re-check security protocol support after reset
A device may change capabilities after each reset, e.g. due to a firmware
upgrade.  We should thus check for Security Send/Receive and OPAL support
after each reset.

Based on patches from Christoph and Keith.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-23 11:55:43 -07:00
Daniel Roschka 124298bd03 nvme: detect NVMe controller in recent MacBooks
Adds support for detection of the NVMe controller found in the
following recent MacBooks:
- Retina MacBook 2016 (MacBook9,1)
- 13" MacBook Pro 2016 without Touch Bar (MacBook13,1)
- 13" MacBook Pro 2016 with Touch Bar (MacBook13,2)

Signed-off-by: Daniel Roschka <danielroschka@phoenitydawn.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 15:17:29 -07:00
Keith Busch 9ef3932e25 nvme/pci: No special case for queue busy on IO
This driver previously required we have a special check for IO submitted
to nvme IO queues that are temporarily suspended. That is no longer
necessary since blk-mq provides a quiesce, so any IO that actually gets
submitted to such a queue must be ended since the queue isn't going to
start back up.

This is fixing a condition where we have fewer IO queues after a
controller reset. This may happen if the number of CPU's has changed,
or controller firmware update changed the queue count, for example.

While it may be possible to complete the IO on a different queue, the
block layer does not provide a way to resubmit a request on a different
hardware context once the request has entered the queue. We don't want
these requests to be stuck indefinitely either, so ending them in error
is our only option at the moment.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Keith Busch 6db28eda26 nvme/pci: Disable on removal when disconnected
If the device is not present, the driver should disable the queues
immediately. Prior to this, the driver was relying on the watchdog timer
to kill the queues if requests were outstanding to the device, and that
just delays removal up to one second.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Jens Axboe 818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Jens Axboe 6010720da8 Merge branch 'for-4.11/block' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:06:45 -07:00
Scott Bauer 8a9ae52328 nvme: Check for Security send/recv support before issuing commands.
We need to verify that the controller supports the security
commands before actually trying to issue them.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
[hch: moved the check so that we don't call into the OPAL code if not
      supported]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 12:41:49 -07:00
Christoph Hellwig 4f1244c829 block/sed-opal: allocate struct opal_dev dynamically
Insted of bloating the containing structure with it all the time this
allocates struct opal_dev dynamically.  Additionally this allows moving
the definition of struct opal_dev into sed-opal.c.  For this a new
private data field is added to it that is passed to the send/receive
callback.  After that a lot of internals can be made private as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 12:41:47 -07:00
Scott Bauer a98e58e54f nvme: Add Support for Opal: Unlock from S3 & Opal Allocation/Ioctls
This patch implements the necessary logic to unlock an Opal
enabled device coming back from an S3.

The patch also implements the SED/Opal allocation necessary to support
the opal ioctls.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-06 09:44:21 -07:00
Christoph Hellwig 57292b58dd block: introduce blk_rq_is_passthrough
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:34 -07:00
Keith Busch 7bf7d77862 nvme/pci: Don't mark IOD as aborted if abort wasn't sent
This patch sets the aborted flag only if an abort was sent, reducing
excessive kernel message spamming for completed IO that wasn't actually
aborted.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:34:47 -07:00
Jens Axboe f924ba70c1 Merge branch 'for-4.11/block' into for-4.11/rq-refactor
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 15:08:31 -07:00
Jens Axboe d348499138 blk-mq-sched: allow setting of default IO scheduler
Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:31 -07:00
Christoph Hellwig b131c61d62 nvme: use blk_rq_payload_bytes
The new blk_rq_payload_bytes generalizes the payload length hacks
that nvme_map_len did before.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Jens Axboe 8e5d31eb02 Merge branch 'nvme-4.10' of git://git.infradead.org/nvme into for-linus
Christoph writes:

The most significant one is that we've agreed on shared maintaince and
a common repository for the PCIe NVMe driver and NVMe over Fabrics.  The
target code still only has a subset of the maintainers but goes through
the same tree as well.  Keith, Sagi and me will take turns at collecting
patches and sending you pull requests.
2016-12-22 11:54:46 -07:00
Keith Busch ff13b39ecf nvme/pci: Delete misleading queue-wrap comment
It is not theoretically possible for this driver to wrap twice while
processing completions. The driver allocates only 'queue_depth - 1'
tags, so there can never be more than that to reap when processing a
completion queue. Removing this misleading comment makes it a little
less likely people with broken controllers will blame the driver for
their spurious interrupts.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:23 +01:00
Max Gurtovoy 9fa196e7fc nvme/pci: Fix whitespace problem
Convert to tabs and remove unneeded whitespaces.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:18 +01:00
Stephen Bates c965809c66 nvme : Use correct scnprintf in cmb show
Make sure we are using the correct scnprintf in the sysfs show
function for the CMB.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-19 08:35:24 -07:00
Andy Lutomirski d2a6191840 nvme/pci: Log PCI_STATUS when the controller dies
When debugging nvme controller crashes, it's nice to know whether
the controller died cleanly so that the failure is just reflected in
CSTS, it died and put an error in PCI_STATUS, or whether it died so
badly that it stopped responding to PCI configuration space reads.

I've seen a failure that gives 0xffff in PCI_STATUS on a Samsung
"SM951 NVMe SAMSUNG 256GB" with firmware "BXW75D0Q".

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>

Fixed up white space and hunk reject.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-13 21:07:08 -07:00
Linus Torvalds b92e09bb5b Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:

 - Adam added opt-in ATA command priority support.

 - There are machines which hide multiple nvme devices behind an ahci
   BAR. Dan Williams proposed a solution to force-switch the mode but
   deemed too hackishd. People are gonna discuss the proper way to
   handle the situation in nvme standard meetings. For now, detect and
   warn about the situation.

 - Low level driver specific changes.

Christoph Hellwig pipes in about the hidden nvme warning:
 "I wish that was the case. We've pretty much agreed that we'll want to
  implement it as a virtual PCIe root bridge, similar to Intels other
  'innovation' VMD that we work around that way.

  But Intel management has apparently decided that they don't want to
  spend more cycles on this now that Lenovo has an optional BIOS that
  doesn't force this broken mode anymore, and no one outside of Intel
  has enough information to implement something like this.

  So for now I guess this warning is it, until Intel reconsideres and
  spends resources on fixing up the damage their Chipset people caused"

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  ahci: warn about remapped NVMe devices
  ahci-remap.h: add ahci remapping definitions
  nvme: move NVMe class code to pci_ids.h
  pata: imx: support controller modes up to PIO4
  pata: imx: add support of setting timings for PIO modes
  pata: imx: set controller PIO mode with .set_piomode callback
  pata: imx: sort headers out
  ata: set ncq_prio_enabled iff device has support
  ata: ATA Command Priority Disabled By Default
  ata: Enabling ATA Command Priorities
  block: Add iocontext priority to request
  ahci: qoriq: added ls1046a platform support
2016-12-13 13:26:24 -08:00
Linus Torvalds 36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Christoph Hellwig f9d03f96b9 block: improve handling of the magic discard payload
Instead of allocating a single unused biovec for discard requests, send
them down without any payload.  Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

 - we don't have to allocate the bio_vec
 - the amount of special casing for discard requests in the block
   layer is significantly reduced
 - using this same scheme for other request types is trivial,
   which will be important for implementing the new WRITE_ZEROES
   op on devices where it actually requires a payload (e.g. SCSI)
 - we can get rid of playing games with the request length, as
   we'll never touch it and completions will work just fine
 - it will allow us to support ranged discard operations in the
   future by merging non-contiguous discard bios into a single
   request
 - last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09 08:30:51 -07:00
James Smart 721b3917c4 nvme-fabrics: set sqe.command_id in core not transports
Currently, core.c sets command_id only on rd/wr commands, leaving it to
the transport to set it again to ensure the request had a command id.

Move location of set in core so applies to all commands.
Remove transport sets.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Christoph Hellwig a2e7eefd56 nvme: move NVMe class code to pci_ids.h
We'll need to check for it in the AHCI drivers (yes, really) soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-12-05 14:31:23 -05:00
Keith Busch d48756228e nvme/pci: Don't free queues on error
The nvme_remove function tears down all allocated resources in the correct
order, so no need to free queues on error during initialization. This
fixes possible use-after-free errors when queues are still associated
with a blk-mq hctx.

Reported-by: Scott Bauer <scott.bauer@intel.com>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimbeg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 12:39:57 -07:00
Omar Sandoval bac0000af5 nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK
Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-15 12:50:11 -07:00
Christoph Hellwig 7bf58533a0 nvme: don't pass the full CQE to nvme_complete_async_event
We only need the status and result fields, and passing them explicitly
makes life a lot easier for the Fibre Channel transport which doesn't
have a full CQE for the fast path case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:06:26 -07:00
Christoph Hellwig d49187e97e nvme: introduce struct nvme_request
This adds a shared per-request structure for all NVMe I/O.  This structure
is embedded as the first member in all NVMe transport drivers request
private data and allows to implement common functionality between the
drivers.

The first use is to replace the current abuse of the SCSI command
passthrough fields in struct request for the NVMe command passthrough,
but it will grow a field more fields to allow implementing things
like common abort handlers in the future.

The passthrough commands are handled by having a pointer to the SQE
(struct nvme_command) in struct nvme_request, and the union of the
possible result fields, which had to be turned from an anonymous
into a named union for that purpose.  This avoids having to pass
a reference to a full CQE around and thus makes checking the result
a lot more lightweight.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:06:24 -07:00
Christoph Hellwig e806402130 block: split out request-only flags into a new namespace
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests.  It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Linus Torvalds ecd06f2883 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes that missed the merge window, mostly due to me being
  away around that time.

  Nothing major here, a mix of nvme cleanups and fixes, and one fix for
  the badblocks handling"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvmet: use symbolic constants for CNS values
  nvme: use symbolic constants for CNS values
  nvme.h: add an enum for cns values
  nvme.h: don't use uuid_be
  nvme.h: resync with nvme-cli
  nvme: Add tertiary number to NVME_VS
  nvme : Add sysfs entry for NVMe CMBs when appropriate
  nvme: don't schedule multiple resets
  nvme: Delete created IO queues on reset
  nvme: Stop probing a removed device
  badblocks: fix overlapping check for clearing
2016-10-21 10:54:01 -07:00
Gabriel Krisman Bertazi 8ef2074d28 nvme: Add tertiary number to NVME_VS
NVMe 1.2.1 specification adds a tertiary element to the version number.
This updates the macro and its callers to include the final number and
fixup a single place in nvmet where the version was generated manually.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Stephen Bates 202021c1a6 nvme : Add sysfs entry for NVMe CMBs when appropriate
Add a sysfs attribute that contains salient information about the NVMe
Controller Memory Buffer when one is present. For now, just display the
information about the CMB available from the control registers. We attach
the CMB attribute file to the existing nvme_ctrl sysfs group so it can
handle the sysfs teardown.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Acked-by Jon Derrick: <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 11:10:28 -06:00
Keith Busch c5f6ce97c1 nvme: don't schedule multiple resets
The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:

	WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.

This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg<sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 09:24:39 -06:00
Keith Busch 7065906096 nvme: Delete created IO queues on reset
The driver was decrementing the online_queues prior to attempting to
delete those IO queues, so the driver ended up not requesting the
controller delete any. This patch saves the online_queues prior to
suspending them, and adds that parameter for deleting io queues.

Fixes: c21377f8 ("nvme: Suspend all queues before deletion")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 09:22:16 -06:00
Mauricio Faria de Oliveira 2b6b535d91 nvme: use the DMA_ATTR_NO_WARN attribute
Use the DMA_ATTR_NO_WARN attribute for the dma_map_sg() call of the nvme
driver that returns BLK_MQ_RQ_QUEUE_BUSY (not for BLK_MQ_RQ_QUEUE_ERROR).

Link: http://lkml.kernel.org/r/1470092390-25451-4-git-send-email-mauricfo@linux.vnet.ibm.com
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Linus Torvalds 12e3d3cdd9 Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
 "This is the block-irq topic branch for 4.9-rc. It's mostly from
  Christoph, and it allows drivers to specify their own mappings, and
  more importantly, to share the blk-mq mappings with the IRQ affinity
  mappings. It's a good step towards making this work better out of the
  box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
  blk_mq: linux/blk-mq.h does not include all the headers it depends on
  blk-mq: kill unused blk_mq_create_mq_map()
  blk-mq: get rid of the cpumask in struct blk_mq_tags
  nvme: remove the post_scan callout
  nvme: switch to use pci_alloc_irq_vectors
  blk-mq: provide a default queue mapping for PCI device
  blk-mq: allow the driver to pass in a queue mapping
  blk-mq: remove ->map_queue
  blk-mq: only allocate a single mq_map per tag_set
  blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09 17:29:33 -07:00
Christoph Hellwig dca51e7892 nvme: switch to use pci_alloc_irq_vectors
Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig 7d7e0f90b7 blk-mq: remove ->map_queue
All drivers use the default, so provide an inline version of it.  If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Wenbo Wang 015282c9eb nvme/quirk: Add a delay before checking device ready for memblaze device
Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-08 10:17:45 -06:00
Gabriel Krisman Bertazi 82469c59d2 nvme: Don't suspend admin queue that wasn't created
This fixes a regression in my previous commit c21377f836 ("nvme:
Suspend all queues before deletion"), which provoked an Oops in the
removal path when removing a device that became IO incapable very early
at probe (i.e. after a failed EEH recovery).

Turns out, if the error occurred very early at the probe path, before
even configuring the admin queue, we might try to suspend the
uninitialized admin queue, accessing bad memory.

Fixes: c21377f836 ("nvme: Suspend all queues before deletion")
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-07 08:53:21 -06:00
Gabriel Krisman Bertazi c21377f836 nvme: Suspend all queues before deletion
When nvme_delete_queue fails in the first pass of the
nvme_disable_io_queues() loop, we return early, failing to suspend all
of the IO queues.  Later, on the nvme_pci_disable path, this causes us
to disable MSI without actually having freed all the IRQs, which
triggers the BUG_ON in free_msi_irqs(), as show below.

This patch refactors nvme_disable_io_queues to suspend all queues before
start submitting delete queue commands.  This way, we ensure that we
have at least returned every IRQ before continuing with the removal
path.

[  487.529200] kernel BUG at ../drivers/pci/msi.c:368!
cpu 0x46: Vector: 700 (Program Check) at [c0000078c5b83650]
    pc: c000000000627a50: free_msi_irqs+0x90/0x200
    lr: c000000000627a40: free_msi_irqs+0x80/0x200
    sp: c0000078c5b838d0
   msr: 9000000100029033
  current = 0xc0000078c5b40000
  paca    = 0xc000000002bd7600   softe: 0        irq_happened: 0x01
    pid   = 1376, comm = kworker/70:1H
kernel BUG at ../drivers/pci/msi.c:368!
Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413
(Ubuntu/IBM 5.3.1-14ubuntu2.1) ) #104 SMP Fri Jul 29 09:20:17 CDT 2016
enter ? for help
[c0000078c5b83920] d0000000363b0cd8 nvme_dev_disable+0x208/0x4f0 [nvme]
[c0000078c5b83a10] d0000000363b12a4 nvme_timeout+0xe4/0x250 [nvme]
[c0000078c5b83ad0] c0000000005690e4 blk_mq_rq_timed_out+0x64/0x110
[c0000078c5b83b40] c00000000056c930 bt_for_each+0x160/0x170
[c0000078c5b83bb0] c00000000056d928 blk_mq_queue_tag_busy_iter+0x78/0x110
[c0000078c5b83c00] c0000000005675d8 blk_mq_timeout_work+0xd8/0x1b0
[c0000078c5b83c50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c0000078c5b83ce0] c0000000000e9148 worker_thread+0xa8/0x660
[c0000078c5b83d80] c0000000000f2090 kthread+0x110/0x130
[c0000078c5b83e30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-11 09:35:57 -06:00