Commit Graph

165 Commits

Author SHA1 Message Date
Keith Busch 1b9dbf7fe0 NVMe: Add revalidate_disk callback
This adds a callback to revalidate the disk and change its block size
and capacity if needed. Before, a user would have to remove + rescan
an entire device if they changed the logical block size using an NVMe
Format or other vendor specific command; now they can just run something
that issues the BLKRRPART IOCTL, like

 # hdparm -z /dev/nvmeXnY

This can also be used in response to the 1.2 Spec's Namespace Attribute
Change asynchronous event.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch 7be50e93fb NVMe: Fix nvmeq waitqueue entry initialization
We need to update the nvme queue's wait_queue_t entry during each
initialization since the nvme_thread may be ended and restarted when
the device is reset. If a device reset occurs during a large amount
of buffered IO, it would take a lot longer to complete the outstanding
requests due to the 1 second polling instead of waking up as completions
occur.

Fixes: b9afca3efb

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch b4ff9c8ddb NVMe: Translate NVMe status to errno
This returns a more appropriate error for the "capacity exceeded"
status. In case other NVMe statuses have a better errno, this patch adds
a convience function to translate an NVMe status code to an errno for
IO commands, defaulting to the current -EIO.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch e179729a82 NVMe: Remove duplicate compat SG_IO code
We can return -ENOIOCTLCMD and the ioctl will be handled by
fs/compat_ioctl.c instead. This removes a lot of duplicate code in the
nvme driver.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch a96d4f5c2d NVMe: Reference count pci device
If an nvme device is removed but user space has an open reference,
the nvme driver would have been holding an invalid reference to its pci
device. You may get a general protection fault on x86 h/w when the driver
uses that reference in dma_map_sg(), as is done in nvme_map_user_pages()
from the IOCTL interface.

This patch fixes the fault by taking a reference on the pci device and
holding it even after device removal until all opens on the nvme device
are closed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Andreea-Cristina Bernat 062261be4e nvme: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Sam Bradshaw 5905535610 NVMe: Correctly handle IOCTL_SUBMIT_IO when cpus > online queues
nvme_submit_io_cmd() uses smp_processor_id() to pick an IO queue index.
This patch fixes the case where there are more cpus from which the ioctl
call can originate than online queues, which can happen when a device
supports or was allocated fewer interrupt vectors than exist cpu cores.

Thanks to Keith Busch for the implementation suggestion.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch 302c6727e5 NVMe: Fix filesystem sync deadlock on removal
This changes the order of deleting the gendisks so it happens after the
nvme IO queues are freed. If a device is removed while a filesystem has
associated dirty data, the removal will wait on these to complete before
proceeding from del_gendisk, which could have caused deadlock before.

The implication of this is that an orderly removal of a responsive
device won't necessarily wait for dirty data to be written, but we are
not guaranteed the device is even going to respond at this point either.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch f435c2825b NVMe: Call nvme_free_queue directly
Rather than relying on call_rcu, this patch directly frees the
nvme_queue's memory after ensuring no readers exist. Some arch specific
dma_free_coherent implementations may not be called from a call_rcu's
soft interrupt context, hence the change.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Matthew Minter <matthew_minter@xyratex.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Dan McLeran 2484f40780 NVMe: Add shutdown timeout as module parameter.
The current implementation hard-codes the shutdown timeout to 2 seconds.
Some devices take longer than this to complete a normal shutdown.
Changing the shutdown timeout to a module parameter with a default
timeout of 5 seconds.

Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch 7c1b245038 NVMe: Skip orderly shutdown on failed devices
Rather than skipping shutdown only for devices that have been removed,
skip the orderly shutdown on failed devices to avoid the long timeout
handling that inevitably happens when deleting queues on such a device.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch a67394790a NVMe: Whitespace fixes
Fixing tabs inadvertently converted to spaces.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch c81f49758a NVMe: Use pci_stop_and_remove_bus_device_locked()
Race conditions are theoretically possible between the NVMe PCI device
removal and the generic PCI bus rescan and device removal that can be
triggered via sysfs.

To avoid those race conditions make the NVMe code use
pci_stop_and_remove_bus_device_locked().

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Keith Busch badc34d415 NVMe: Handling devices incapable of I/O
This is a minor refactor for handling devices that are incapable of IO.
The driver previously used special error codes to know that IO queues
are unavailable, but we have an online queue count now.

This also fixes an issue where the driver successfully sets the queue
count, but either is unable to allocate an IO queue or the device can't
create one for some reason.

If the driver can successfully enable the device and get responses to
admin commands, the driver will bring up a character device for managment
but not create block devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Dan McLeran 01079522f9 NVMe: Change nvme_enable_ctrl to set EN and manage CC thru ctrl_config.
Change the behavior of nvme_enable_ctrl to set EN.
Clear CC.SH for both nvme_enable_ctrl and nvme_disable_ctrl.
Remove reading of the CC register and manage the state in
dev->ctrl_config.

Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
[removed an unwanted write to CC]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Keith Busch 1d09062460 NVMe: Mismatched host/device page size support
Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Keith Busch 6fccf9383b NVMe: Async event request
Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Joe Perches 4d51abf9bc block: Use dma_zalloc_coherent
Use the zeroing function instead of dma_alloc_coherent & memset(,0,)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Mike Snitzer b277da0a8a block: disable entropy contributions for nonrot devices
Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.

Historically, all block devices have automatically made entropy
contributions.  But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
    - On SSD disks, the completion times aren't as random as they
      are for rotational drives. So it's questionable whether they
      should contribute to the random pool in the first place.
    - Calling add_disk_randomness() has a lot of overhead.

There are more reliable sources for randomness than non-rotational block
devices.  From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-04 10:55:32 -06:00
Linus Torvalds b55b390202 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe update from Matthew Wilcox:
 "Mostly bugfixes again for the NVMe driver.  I'd like to call out the
  exported tracepoint in the block layer; I believe Keith has cleared
  this with Jens.

  We've had a few reports from people who're really pounding on NVMe
  devices at scale, hence the timeout changes (and new module
  parameters), hotplug cpu deadlock, tracepoints, and minor performance
  tweaks"

[ Jens hadn't seen that tracepoint thing, but is ok with it - it will
  end up going away when mq conversion happens ]

* git://git.infradead.org/users/willy/linux-nvme: (22 commits)
  NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
  NVMe: Use Log Page constants in SCSI emulation
  NVMe: Define Log Page constants
  NVMe: Fix hot cpu notification dead lock
  NVMe: Rename io_timeout to nvme_io_timeout
  NVMe: Use last bytes of f/w rev SCSI Inquiry
  NVMe: Adhere to request queue block accounting enable/disable
  NVMe: Fix nvme get/put queue semantics
  NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
  NVMe: Make admin timeout a module parameter
  NVMe: Make iod bio timeout a parameter
  NVMe: Prevent possible NULL pointer dereference
  NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
  NVMe: Update data structures for NVMe 1.2
  NVMe: Enable BUILD_BUG_ON checks
  NVMe: Update namespace and controller identify structures to the 1.1a spec
  NVMe: Flush with data support
  NVMe: Configure support for block flush
  NVMe: Add tracepoints
  NVMe: Protect against badly formatted CQEs
  ...
2014-06-15 15:58:03 -10:00
Keith Busch f3db22feb5 NVMe: Fix hot cpu notification dead lock
There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.

The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13 10:43:34 -04:00
Matthew Wilcox bd67608a61 NVMe: Rename io_timeout to nvme_io_timeout
It's positively immoral to have a global variable called 'io_timeout'.
Keep the module parameter called io_timeout, though.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03 23:04:30 -04:00
Sam Bradshaw b4e75cbf13 NVMe: Adhere to request queue block accounting enable/disable
Recently, a new sysfs control "iostats" was added to selectively
enable or disable io statistics collection for request queues.  This
patch hooks that control.

IO statistics collection is rather expensive on large, multi-node
machines with drives pushing millions of iops.  Having the ability to
disable collection if not needed can improve throughput significantly.

As a data point, on a quad E5-4640, I see more than 50% throughput
improvement when io statistics accounting is disabled during heavily
multi-threaded small block random read benchmarks where device
performance is in the million iops+ range.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03 22:58:54 -04:00
Keith Busch a51afb5433 NVMe: Fix nvme get/put queue semantics
The routines to get and lock nvme queues required the caller to "put"
or "unlock" them even if getting one returned NULL. This patch fixes that.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03 22:58:34 -04:00
Keith Busch 9d43cf646e NVMe: Make admin timeout a module parameter
Signed-off-by: Keith Busch <keith.busch@intel.com>
[made admin_timeout static]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03 22:57:56 -04:00
Keith Busch 61e4ce086d NVMe: Make iod bio timeout a parameter
This was originally set to 4 times the IO timeout, but that was when
the IO timeout was 5 seconds instead of 30. 20 seconds for total time
to failure seemed more reasonable than 2 minutes for most, but other
users have requested to make this a module parameter instead.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[renamed the module parameter to retry_time]
[made retry_time static]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03 22:56:43 -04:00
Santosh Y 6808c5fb7f NVMe: Prevent possible NULL pointer dereference
kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod'
can fail. So check the return value.

Signed-off-by: Santosh Y <santosh.sy@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03 16:43:30 -04:00
Keith Busch f0d54a5417 NVMe: Implement PCIe reset notification callback
Quiesce and shutdown the device prior to reset, then restart the device and
resume IO after.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-05-27 11:12:29 -06:00
Matthew Wilcox 21bd78bcf4 NVMe: Enable BUILD_BUG_ON checks
Since _nvme_check_size() wasn't being called from anywhere, the compiler
was optimising it away ... along with all the link-time build failures
that would result if any of the structures were the wrong size.  Call it
from nvme_exit() for no particular reason.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-05-09 22:42:39 -04:00
Keith Busch 53562be74b NVMe: Flush with data support
It is possible a filesystem may send a flush flagged bio with write
data. There is no such composite NVMe command, so the driver sends flush
and write separately.

The device is allowed to execute these commands in any order, so it was
possible the driver ends the bio after the write completes, but while the
flush is still active. We don't want to let a filesystem believe flush
succeeded before it really has; this could cause data corruption on a
power loss between these events. To fix, this patch splits the flush
and write into chained bios.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-05-05 10:54:02 -04:00
Keith Busch a7d2ce2832 NVMe: Configure support for block flush
This configures an nvme request_queue as flush capable if the device
has a volatile write cache present.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-05-05 10:53:53 -04:00
Keith Busch 3291fa57cb NVMe: Add tracepoints
Adding tracepoints for bio_complete and block_split into nvme to help
with gathering IO info using blktrace and blkparse.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-05-05 10:41:26 -04:00
Keith Busch 94bbac4052 NVMe: Protect against badly formatted CQEs
If a misbehaving device posts a CQE with a command id < depth but for
one that was never allocated, the command info will have a callback
function set to NULL and we don't want to try invoking that.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-05-05 10:41:25 -04:00
Matthew Wilcox 27e8166c31 NVMe: Improve error messages
Help people diagnose what is going wrong at initialisation time by
printing out which command has gone wrong and what the device returned.
Also fix the error message printed while waiting for reset.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2014-05-05 10:41:25 -04:00
Matthew Wilcox 8757ad65d3 NVMe: Update copyright headers
Make the copyright dates accurate and remove the final paragraph that
includes the address of the FSF.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-05-05 10:41:25 -04:00
Linus Torvalds 3e8072d48b Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver updates from Matthew Wilcox:
 "Various updates to the NVMe driver.  The most user-visible change is
  that drive hotplugging now works and CPU hotplug while an NVMe drive
  is installed should also work better"

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Retry failed commands with non-fatal errors
  NVMe: Add getgeo to block ops
  NVMe: Start-stop nvme_thread during device add-remove.
  NVMe: Make I/O timeout a module parameter
  NVMe: CPU hot plug notification
  NVMe: per-cpu io queues
  NVMe: Replace DEFINE_PCI_DEVICE_TABLE
  NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmds
  NVMe: IOCTL path RCU protect queue access
  NVMe: RCU protected access to io queues
  NVMe: Initialize device reference count earlier
  NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions
2014-04-11 16:45:59 -07:00
Keith Busch edd10d3328 NVMe: Retry failed commands with non-fatal errors
For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.

Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.

The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[fixed checkpatch issue with long line]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10 17:11:59 -04:00
Keith Busch 4cc09e2dc4 NVMe: Add getgeo to block ops
Some programs require HDIO_GETGEO work, which requires we implement
getgeo.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10 17:06:11 -04:00
Dan McLeran b9afca3efb NVMe: Start-stop nvme_thread during device add-remove.
Done to ensure nvme_thread is not running when there
are no devices to poll.

Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10 17:04:46 -04:00
Keith Busch b355084a89 NVMe: Make I/O timeout a module parameter
Increase the default timeout to 30 seconds to match SCSI.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[use byte instead of ushort]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10 17:04:38 -04:00
Keith Busch 33b1e95c90 NVMe: CPU hot plug notification
Registers with hot cpu notification to rebalance, and potentially allocate
additional, io queues.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10 17:03:42 -04:00
Keith Busch 42f614201e NVMe: per-cpu io queues
The device's IO queues are associated with CPUs, so we can use a per-cpu
variable to map the a qid to a cpu. This provides a convienient way
to optimally assign queues to multiple cpus when the device supports
fewer queues than the host has cpus. The previous implementation may
have assigned these poorly in these situations. This patch addresses
this by sharing queues among cpus that are "close" together and should
have a lower lock contention penalty.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10 17:03:15 -04:00
Linus Torvalds b33ce44299 Merge branch 'for-3.15/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "On top of the core pull request, here's the pull request for the
  driver related changes for 3.15.  It contains:

   - Improvements for msi-x registration for block drivers (mtip32xx,
     skd, cciss, nvme) from Alexander Gordeev.

   - A round of cleanups and improvements for drbd from Andreas
     Gruenbacher and Rashika Kheria.

   - A round of clanups and improvements for bcache from Kent.

   - Removal of sleep_on() and friends in DAC960, ataflop, swim3 from
     Arnd Bergmann.

   - Bug fix for a bug in the mtip32xx async completion code from Sam
     Bradshaw.

   - Bug fix for accidentally bouncing IO on 32-bit platforms with
     mtip32xx from Felipe Franciosi"

* 'for-3.15/drivers' of git://git.kernel.dk/linux-block: (103 commits)
  bcache: remove nested function usage
  bcache: Kill bucket->gc_gen
  bcache: Kill unused freelist
  bcache: Rework btree cache reserve handling
  bcache: Kill btree_io_wq
  bcache: btree locking rework
  bcache: Fix a race when freeing btree nodes
  bcache: Add a real GC_MARK_RECLAIMABLE
  bcache: Add bch_keylist_init_single()
  bcache: Improve priority_stats
  bcache: Better alloc tracepoints
  bcache: Kill dead cgroup code
  bcache: stop moving_gc marking buckets that can't be moved.
  bcache: Fix moving_pred()
  bcache: Fix moving_gc deadlocking with a foreground write
  bcache: Fix discard granularity
  bcache: Fix another bug recovering from unclean shutdown
  bcache: Fix a bug recovering from unclean shutdown
  bcache: Fix a journalling reclaim after recovery bug
  bcache: Fix a null ptr deref in journal replay
  ...
2014-04-01 19:43:53 -07:00
Matthew Wilcox 6eb0d698ef NVMe: Replace DEFINE_PCI_DEVICE_TABLE
Checkpatch has started warning against using DEFINE_PCI_DEVICE_TABLE,
so replace it.  Also update the copyright date and bump the module
version number to 0.9.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-03-24 10:11:22 -04:00
Keith Busch 4f5099af4f NVMe: IOCTL path RCU protect queue access
This adds rcu protected access to a queue in the nvme IOCTL path
to fix potential races between a surprise removal and queue usage in
nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
the nvme_queue from freeing while this path is executing so it can't
sleep, and so this path will no longer wait for a available command
id should they all be in use at the time a passthrough IOCTL request
is received.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-03-24 08:54:40 -04:00
Keith Busch 5a92e700af NVMe: RCU protected access to io queues
This adds rcu protected access to nvme_queue to fix a race between a
surprise removal freeing the queue and a thread with open reference on
a NVMe block device using that queue.

The queues do not need to be rcu protected during the initialization or
shutdown parts, so I've added a helper function for raw deferencing
to get around the sparse errors.

There is still a hole in the IOCTL path for the same problem, which is
fixed in a subsequent patch.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-03-24 08:45:57 -04:00
Keith Busch fb35e914b3 NVMe: Initialize device reference count earlier
If an NVMe device becomes ready but fails to create IO queues, the driver
creates a character device handle so the device can be managed. The
device reference count needs to be initialized before creating the
character device.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-03-24 08:15:14 -04:00
Jingoo Han 671a6018db NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following
build warning when CONFIG_PM_SLEEP is not selected. This is because
sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when
the CONFIG_PM_SLEEP is enabled.

drivers/block/nvme-core.c:2541:12: warning: 'nvme_suspend' defined but not used [-Wunused-function]
drivers/block/nvme-core.c:2550:12: warning: 'nvme_resume' defined but not used [-Wunused-function]

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-03-24 08:05:28 -04:00
Alexander Gordeev be577fabf3 nvme: Use pci_enable_msi_range() and pci_enable_msix_range()
As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers
using these two interfaces need to be updated to use the
new pci_enable_msi_range()  or pci_enable_msi_exact()
and pci_enable_msix_range() or pci_enable_msix_exact()
interfaces.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-03-13 14:56:39 -06:00
Tejun Heo 9ca9737444 nvme: don't use PREPARE_WORK
PREPARE_[DELAYED_]WORK() are being phased out.  They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.

nvme_dev->reset_work is multiplexed with multiple work functions.
Introduce nvme_reset_workfn() which invokes nvme_dev->reset_workfn and
always use it as the work function and update the users to set the
->reset_workfn field instead of overriding the work function using
PREPARE_WORK().

It would probably be best to route this with other related updates
through the workqueue tree.

Compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-nvme@lists.infradead.org
2014-03-07 10:24:49 -05:00
Linus Torvalds 8352650a5c Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
 "Looks like I missed the merge window ...  but these are almost all
  bugfixes anyway (the ones that aren't have been baking for months)"

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Namespace use after free on surprise removal
  NVMe: Correct uses of INIT_WORK
  NVMe: Include device and queue numbers in interrupt name
  NVMe: Add a pci_driver shutdown method
  NVMe: Disable admin queue on init failure
  NVMe: Dynamically allocate partition numbers
  NVMe: Async IO queue deletion
  NVMe: Surprise removal handling
  NVMe: Abort timed out commands
  NVMe: Schedule reset for failed controllers
  NVMe: Device resume error handling
  NVMe: Cache dev->pci_dev in a local pointer
  NVMe: Fix lockdep warnings
  NVMe: compat SG_IO ioctl
  NVMe: remove deprecated IRQF_DISABLED
  NVMe: Avoid shift operation when writing cq head doorbell
2014-02-05 15:53:26 -08:00
Keith Busch 9ac27090f6 NVMe: Namespace use after free on surprise removal
An nvme block device may have open references when the device is
removed. New commands may still be sent on the removed device, so we
need to ref count the opens, return errors for new commands, and not
free the namespace and nvme_dev until all references are closed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-02-02 13:31:15 -05:00
Matthew Wilcox bdfd70fde3 NVMe: Correct uses of INIT_WORK
We need to initialise the work_struct when we initialise the rest of the
struct nvme_dev, otherwise we'll hit a lockdep warning when we remove
the device.  Use PREPARE_WORK to change the function pointer instead
of INIT_WORK.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-29 11:42:32 -05:00
Matthew Wilcox 3193f07bb7 NVMe: Include device and queue numbers in interrupt name
On larger systems with many drives, it may help debugging to know which
queue is tied to which interrupt, just by looking at /proc/interrupts.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:14:08 -05:00
Keith Busch 09ece1424f NVMe: Add a pci_driver shutdown method
We need to shut down the device cleanly when the system is being shut down.
This was in an earlier patch but was inadvertently lost during a rewrite.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:12:15 -05:00
Keith Busch a1a5ef9998 NVMe: Disable admin queue on init failure
Disable the admin queue if device fails during initialization so the
queue's irq is freed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[rewritten to use nvme_free_queues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:11:38 -05:00
Matthew Wilcox 469071a37a NVMe: Dynamically allocate partition numbers
Some users need more than 64 partitions per device.  Rather than simply
increasing the number of partitions, switch to the dynamic partition
allocation scheme.

This means that minor numbers are not stable across boots, but since major
numbers aren't either, I cannot see this being a significant problem.

Tested-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:11:29 -05:00
Keith Busch 4d11542070 NVMe: Async IO queue deletion
This attempts to delete all IO queues at the same time asynchronously on
shutdown. This is necessary for a present device that is not responding;
a shutdown operation previously would take 2 minutes per queue-pair
to timeout before moving on to the next queue, making a device removal
appear to take a very long time or "hung" as reported by users.

In the previous worst case, a removal may be stuck forever until a kill
signal is given if there are more than 32 queue pairs since it would run
out of admin command IDs after over an hour of timed out sync commands
(admin queue depth is 64).

This patch will wait for the admin command timeout for all commands to
complete, so the worst case now for an unresponsive controller is 60
seconds, though that still seems like a long time.

Since this adds another way to take queues offline, some duplicate code
resulted so I moved these into more convienient functions.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[make functions static, correct line length and whitespace issues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 20:07:35 -05:00
Keith Busch 0e53d18051 NVMe: Surprise removal handling
This adds checks to see if the nvme pci device was removed. The check
reads the status register for the value of -1, which it should never be
unless the device is no longer present.

If a user performs a surprise removal on an nvme device, the driver will
be notified either by the pci driver remove callback if the platform's
slot is capable of this event, or via reading the device BAR status
register, which will indicate controller failure and trigger a reset.

Either way, the device is not present so all outstanding commands would
timeout. This will not send queue deletion commands to a drive that
isn't present and fail after ioremap, significantly speeding up surprise
removal; previously this took over 2 minutes per IO queue pair created,
but this will complete removing the device within a few seconds.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 19:28:43 -05:00
Keith Busch c30341dc3c NVMe: Abort timed out commands
Send nvme abort command to io requests that have timed out on an
initialized device. If the command is not returned after another timeout,
schedule the controller for reset.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[fix endianness issues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 19:27:53 -05:00
Keith Busch d4b4ff8e28 NVMe: Schedule reset for failed controllers
Schedules a controller reset when it indicates it has a failed status. If
the device does not become ready after a reset, the pci device will be
scheduled for removal.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[fixed checkpatch issue]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-01-27 19:20:02 -05:00
Keith Busch 9a6b94584d NVMe: Device resume error handling
Adds controller error handling on resume power management. If the device
fails to initialize, the device is queued for a reset. If the reset fails,
a thread is spawned to remove the pci device.

If the device resumes as "busy", the device is responding to admin
commands but will not create IO queues. In this case, we need to remove
the gendisks and free the IO queues since they can't be used and may be
holding bios in their lists.

From testing, the dma pools require a pci device so this had to change
the pci driver 'remove' to release the dma resources in line with that
call instead of after all references to the device are released.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:54:39 -05:00
Matthew Wilcox 68608c268b NVMe: Cache dev->pci_dev in a local pointer
Helps with line-length issues

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:54:35 -05:00
Matthew Wilcox 0a8d44cb33 NVMe: Fix lockdep warnings
During the initialisation path, the queue lock is taken without interrupt
protection.  It's perfectly safe to do so, because the interrupt handler
can't run at this point, but it confuses lockdep.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:54:34 -05:00
Keith Busch 320a382746 NVMe: compat SG_IO ioctl
For 32-bit versions of sg3-utils running on a 64-bit system. This is
mostly a copy from the relevent portions of fs/compat_ioctl.c, with
slight modifications for going through block_device_operations.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
[fixed up CONFIG_COMPAT=n build problems]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-12-16 15:49:40 -05:00
Kent Overstreet 20d0189b10 block: Introduce new bio_split()
The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.

(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Neil Brown <neilb@suse.de>
2013-11-23 22:33:57 -08:00
Kent Overstreet 7988613b0e block: Convert bio_for_each_segment() to bvec_iter
More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
2013-11-23 22:33:49 -08:00
Kent Overstreet 4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Michael Opdenacker 481e5bad84 NVMe: remove deprecated IRQF_DISABLED
This patch proposes to remove the use of the IRQF_DISABLED flag

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-11-18 17:13:54 -05:00
Haiyan Hu b80d5ccca3 NVMe: Avoid shift operation when writing cq head doorbell
Changes the type of dev->db_stride to unsigned and changes the value
stored there to be 1 << the current value. Then there is less
calculation to be done at completion time.

Signed-off-by: Haiyan Hu <huhaiyan@huawei.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-11-18 17:10:51 -05:00
Russell King 052d0efada DMA-API: block: nvme-core: replace dma_set_mask()+dma_set_coherent_mask() with new helper
Replace the following sequence:

	dma_set_mask(dev, mask);
	dma_set_coherent_mask(dev, mask);

with a call to the new helper dma_set_mask_and_coherent().

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-09-21 21:02:24 +01:00
Keith Busch d82e8bfdef NVMe: Merge issue on character device bring-up
A recent patch made it possible to bring up the character handle when the
device is responsive but not accepting a set-features command. Another
recent patch moved the initialization that requires we move where the
checks for this condition occur. This patch merges these two ideas so
it works much as before.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-06 16:26:58 -04:00
Keith Busch 9d713c2bfb NVMe: Handle ioremap failure
Decrement the number of queues required for doorbell remapping until
the memory is successfully mapped for that size.

Additional checks are done so that we don't call free_irq if it has
already been freed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:44:25 -04:00
Keith Busch cd63894630 NVMe: Add pci suspend/resume driver callbacks
Used for going in and out of low power states. Resuming reuses the IO
queues from the previous initialization, freeing any allocated queues
that are no longer usable.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:44:16 -04:00
Keith Busch 1894d8f16a NVMe: Use normal shutdown
The NVMe spec recommends using the shutdown normal sequence when safely
taking the controller offline instead of hitting CC.EN on the next
start-up to reset the controller. The spec recommends a minimum of 1
second for the shutdown complete. This patch waits 2 seconds to be on
the safe side.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:40:32 -04:00
Keith Busch f0b50732a9 NVMe: Separate controller init from disk discovery
This combines the controller initialization into one function, removing
IO queue setup from namespace discovery, and creates symetric functions
for device removal. The controller start and shutdown functions can now
be called from resume/suspend context as well as probe/remove.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:40:28 -04:00
Keith Busch 2240427425 NVMe: Separate queue alloc/free from create/delete
This separates nvme queue allocation from creation, and queue deletion
from freeing. This is so that we may in the future temporarily disable
queues and reuse the same memory when bringing them back online, like
coming back from suspend state.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:39:28 -04:00
Keith Busch 0877cb0d28 NVMe: Group pci related actions in functions
This will make it easier to reuse these outside probe/remove.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:39:25 -04:00
Keith Busch 9e59d091b0 NVMe: Disk stats for read/write commands only
Flush and discard requests would previously mess up the accounting.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:33:35 -04:00
Keith Busch 7e03b12406 NVMe: Bring up cdev on set feature failure
This patch creates the character device as long as a device's admin queues
are usable so a user has an opprotunity to perform administration tasks.
A device may be in a state that does not allow IO and setting the queue
count feature in such a state returns an error. Previously the driver
would bail and the controller would be unusable.

Signed-off-by: Keith Busch <keith.busch@intel.com>
2013-09-03 16:32:26 -04:00
Keith Busch 1b56749e54 NVMe: Fix checkpatch issues
Signed-off-by: Keith Busch <keith.busch@intel.com>
2013-09-03 16:32:26 -04:00
Matthew Wilcox c3bfe7176c NVMe: Namespace IDs are unsigned
The 'Number of Namespaces' read from the device was being treated as
signed, which would cause us to not scan any namespaces for a device
with more than 2 billion namespaces.  That led to noticing that the
namespace ID was also being treated as signed, which could lead to the
result from NVME_IOCTL_ID being treated as an error code.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-09-03 16:32:26 -04:00
Matthew Wilcox 7d8224574c NVMe: Call nvme_process_cq from submission path
Since we have the queue locked, it makes sense to check if there are
any completion queue entries on the queue before we release the lock.
If there are, it may save an interrupt and reduce latency for the I/Os
that happened to complete.  This happens fairly often for some workloads.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 13:57:27 -04:00
Matthew Wilcox bc57a0f7a4 NVMe: Remove "process_cq did something" message
I was originally intending to log the fact that the kthread had done
some work since it might help us find interrupt handling problems, but
that hasn't been done yet, and spamming the logs with this message is
just rude.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 13:57:08 -04:00
Matthew Wilcox e9539f4752 NVMe: Return correct value from interrupt handler
The interrupt handler currently reports whether it found any new
completion queue entries.  If the completion queue is primarily being
processed by a method other than the interrupt handler, it may return
IRQ_NONE so often that Linux thinks that the interrupt is being falsely
triggered.

To solve this problem, report whether any completion queue entries have
been seen since the last interrupt was received for this queue.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-24 11:54:20 -04:00
Keith Busch 6198221fa0 NVMe: Disk IO statistics
Add io stats accounting for bio requests so nvme block devices show
useful disk stats.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-20 12:06:35 -04:00
Matthew Wilcox 063a8096f3 NVMe: Restructure MSI / MSI-X setup
The current code copies 'nr_io_queues' into 'q_count', modifies
'nr_io_queues' during MSI-X setup, then resets 'nr_io_queues' for
MSI setup.  Instead, copy 'nr_io_queues' into 'vecs' and modify 'vecs'
during both MSI-X and MSI setup.

This lets us simplify the for-loops that set up MSI-X and MSI, and opens
the possibility of using more I/O queues than we have interrupt vectors,
should future benchmarking prove that to be a useful feature.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-06-20 11:09:23 -04:00
Ramachandra Rao Gajula fa08a39664 NVMe: Add MSI support
Some devices only have support for MSI, not MSI-X.  While MSI is more
limited, it still provides better performance than line-based interrupts.

Signed-off-by: Ramachandra Gajula <rama@fastorsystems.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-31 11:45:52 -04:00
Matthew Wilcox cf9f123b38 NVMe: Use dma_set_mask() correctly
In some circumstances setting a 64-bit DMA mask can fail, as explained
in Documentation/DMA-API-HOWTO.txt.  Use the recommended code sequence
to set a 32-bit DMA mask if setting a 64-bit DMA mask fails.

Reported-by: Chayan Biswas <Chayan.Biswas@sandisk.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-28 16:46:46 -04:00
Chayan Biswas cf90bc4830 Return the result from user admin command IOCTL even in case of failure
We copy the result to user if the command is completed from the
controller even if it completes with failure (non-zero) status.
A return status of < 0 indicates the command was not completed
by the controller. The user application may expect the error code
in the result field in case of failure.

Signed-off-by: Chayan Biswas <Chayan.Biswas@sandisk.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-23 13:38:59 -04:00
Keith Busch 053ab702cc NVMe: Do not cancel command multiple times
Cancelling an already cancelled command does not do anything, so check
the command context before cancelling it, continuing if had already been
cancelled so we do not log the same problem every second if a device
stops responding.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:18:38 -04:00
Wei Yongjun 1287dabd34 NVMe: fix error return code in nvme_submit_bio_queue()
nvme_submit_flush_data() might overwrite the initialisation of the
return value with 0, so move the -ENOMEM setting close to the usage.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:13:18 -04:00
Dan Carpenter 5460fc0310 NVMe: check for integer overflow in nvme_map_user_pages()
You need to have CAP_SYS_ADMIN to trigger this overflow but it makes the
static checkers complain so we should fix it.  The worry is that
"length" comes from copy_from_user() so we need to check that "length +
offset" can't overflow.

I also changed the min_t() cast to be unsigned instead of signed.  Now
that we cap "length" to INT_MAX it doesn't make a difference, but it's a
little easier for reviewers to know that large values aren't cast to
negative.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-17 09:11:03 -04:00
Keith Busch 94f370cab6 NVMe: Use user defined admin ioctl timeout
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-09 16:03:50 -04:00
Matthew Wilcox 44af146a84 NVMe: Only clear the enable bit when disabling controller
Many of the bits in the Controller Configuration register may only be
modified when the Enable bit is clear.  Clearing them at the same time
as the Enable bit might be OK, but let's play it safe and only touch the
Enable bit.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:54:31 -04:00
Matthew Wilcox ba47e3865e NVMe: Wait for device to acknowledge shutdown
A recent update to the specification makes it clear that the host
is expected to wait for the device to acknowledge the Enable bit
transitioning to 0 as well as waiting for the device to acknowledge a
transition to 1.

Reported-by: Khosrow Panah <Khosrow.Panah@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:53:49 -04:00
Keith Busch 78f8d2577b NVMe: Schedule timeout for sync commands
Schedule a timeout on sync commands in case the command times out and
the device is not being polled for timeouts. This prevents device removal
from hanging forever if the device has stopped responding.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 15:36:02 -04:00
Keith Busch f410c680b5 NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
This adds support for namespaces with separate meta-data formats in the
submit io ioctl. The meta-data buffer has to be a contiguous, so such
a buffer is allocated and the mapped user pages are copied to/from this
buffer for write/read commands.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 15:35:09 -04:00
Keith Busch 159b67d7ae NVMe: Device specific stripe size handling
We have an nvme device that has a concept of a stripe size. IO requests
that do not transfer data crossing a stripe boundary has greater
performance compared to IO that does cross it. This patch sets the
stripe size for the device if the device and vendor ids match one with
this feature and splits IO requests that cross the stripe boundary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:41:05 -04:00
Keith Busch 427e970801 NVMe: Split non-mergeable bio requests
It is possible a bio request can not be submitted as a single NVMe IO
command if the bio_vec is not mergeable with the NVMe PRP alignement
constraints. This condition was handled by submitting an IO for the
mergeable portion then submitting a follow on IO for the remaining data
after the previous IO completes. The remainder to be sent was tracked
by manipulating the bio->bi_idx and bio->bi_sector. This patch splits
the request as many times as necessary and submits the bios together.

Since submitting the bio may cause it to be requeued on split,
nvme_resubmit_bios had to be modified to remove the wait queue when
the bio list is empty prior to submitting the bio since a split would
have added the wait queue a second time, corrupting the wait queue head
task list.

There are a few other benefits from doing this: it fixes a potential
issue with the previous handling of a non-mergeable bio as the requeuing
method could would use an unlocked nvme_queue if the callback isn't
invoked on the queue's associated cpu; it will be possible to retry a
failed bio if desired at some later time since it does not manipulate
the original bio; the bio integrity extensions require the bio to be in
its original condition for the checks to work correctly if we implement
the end-to-end data protection in the future.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:38:59 -04:00