Commit Graph

737708 Commits

Author SHA1 Message Date
Javier González 43d4712721 lightnvm: pblk: refactor init/exit sequences
Refactor init and exit sequences to eliminate dependencies among init
modules and improve readability.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Heiner Litz 9d7aa4a484 lightnvm: Avoid validation of default op value
Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl")
Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Johannes Thumshirn 40f962d78a lightnvm: centralize permission check for lightnvm ioctl
Currently all functions for handling the lightnvm core ioctl commands
do a check for CAP_SYS_ADMIN.

Change this to fail early in nvm_ctl_ioctl(), so we don't have to
duplicate the permission checks all over.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Heiner Litz a38c78d82d lightnvm: fix bad block initialization
fix reading bad block device information to correctly setup the per line
blk_bitmap during lightnvm initialization

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling 96257a8a7f nvme: lightnvm: add late setup of block size and metadata
The nvme driver sets up the size of the nvme namespace in two steps.
First it initializes the device with standard logical block and
metadata sizes, and then sets the correct logical block and metadata
size. Due to the OCSSD 2.0 specification relies on the namespace to
expose these sizes for correct initialization, let it be updated
appropriately on the LightNVM side as well.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling 89a09c5643 lightnvm: remove nvm_dev_ops->max_phys_sect
The value of max_phys_sect is always static. Instead of
defining it in the nvm_dev_ops structure, declare it as a global
value.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling af569398c3 lightnvm: remove max_rq_size
The field is no longer used.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling 62771fe0aa lightnvm: add 2.0 geometry identification
Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling c6ac3f35d4 lightnvm: flatten nvm_id_group into nvm_id
There are no groups in the 2.0 specification, make sure that the
nvm_id structure is flattened before 2.0 data structures are added.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling a04e0cf93a lightnvm: make 1.2 data structures explicit
Make the 1.2 data structures explicit, so it will be easy to identify
the 2.0 data structures. Also fix the order of which the nvme_nvm_*
are declared, such that they follow the nvme_nvm_command order.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Javier González e411b33117 lightnvm: pblk: refactor bad block identification
In preparation for the OCSSD 2.0 spec. bad block identification,
refactor the current code to generalize bad block get/set functions and
structures.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg 3c05ef115c lightnvm: pblk: prevent race in pblk_rb_flush_point_set
Make sure that we are not advancing the sync pointer while
we're adding bios to the write buffer entry completion list.

This race condition results in bios not completing and was identified
by a hang when running xfstest generic/113.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg b966c50b14 lightnvm: pblk: allow allocation of new lines during shutdown
When shutting down pblk the write buffer is flushed and if the
current line can't fit the data in the write buffer we need
to allocate a new line, so remove the check that prevents this.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg 7be970b225 lightnvm: pblk: delete writer kick timer before stopping thread
Unless we delete the timer that wakes up the write thread
before we stop the thread we risk re-starting the thread, so
delete the timer first.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg 5d149bfabe lightnvm: pblk: add padding distribution sysfs attribute
When pblk receives a sync, all data up to that point in the write buffer
must be comitted to persistent storage, and as flash memory comes with a
minimal write size there is a significant cost involved both in terms
of time for completing the sync and in terms of write amplification
padded sectors for filling up to the minimal write size.

In order to get a better understanding of the costs involved for syncs,
Add a sysfs attribute to pblk: padded_dist, showing a normalized
distribution of sectors padded. In order to facilitate measurements of
specific workloads during the lifetime of the pblk instance, the
distribution can be reset by writing 0 to the attribute.

Do this by introducing counters for each possible padding:
{0..(minimal write size - 1)} and calculate the normalized distribution
when showing the attribute.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Rearranged total_buckets statement in pblk_sysfs_get_padding_dist
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling ff12581ec7 lightnvm: remove multiple groups in 1.2 data structure
Only one id group from the 1.2 specification is supported. Make
sure that only the first group is accessible.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling d8a39caee0 lightnvm: remove mlc pairs structure
The known implementations of the 1.2 specification, and upcoming 2.0
implementation all expose a sequential list of pages to write.
Remove the data structure, as it is no longer needed.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg 76758390f8 lightnvm: pblk: export write amplification counters to sysfs
In a SSD, write amplification, WA, is defined as the average
number of page writes per user page write. Write amplification
negatively affects write performance and decreases the lifetime
of the disk, so it's a useful metric to add to sysfs.

In plkb's case, the number of writes per user sector is the sum of:

    (1) number of user writes
    (2) number of sectors written by the garbage collector
    (3) number of sectors padded (i.e. due to syncs)

This patch adds persistent counters for 1-3 and two sysfs attributes
to export these along with WA calculated with five decimals:

    write_amp_mileage: the accumulated write amplification stats
                      for the lifetime of the pblk instance

    write_amp_trip: resetable stats to facilitate delta measurements,
                    values reset at creation and if 0 is written
                    to the attribute.

64-bit counters are used as a 32 bit counter would wrap around
already after about 17 TB worth of user data. It will take a
long long time before the 64 bit sector counters wrap around.

The counters are stored after the bad block bitmap in the first
emeta sector of each written line. There is plenty of space in the
first emeta sector, so we don't need to bump the major version of
the line data format.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg d0ab0b1ab9 lightnvm: pblk: check data lines version on recovery
As a preparation for future bumps of data line persistent storage
versions, we need to start checking the emeta line version during
recovery. Also slit up the current emeta/smeta version into two
bytes (major,minor).

Recovering lines with the same major number as the current pblk data
line version must succeed. This means that any changes in the
persistent format must be:

 (1) Backward compatible: if we switch back to and older
     kernel, recovery of lines stored with major == current_major
     and minor > current_minor must succeed.

 (2) Forward compatible: switching to a newer kernel,
     recovery of lines stored with major=current_major and
     minor < minor must handle the data format differences
     gracefully(i.e. initialize new data structures to default values).

If we detect lines that have a different major number than
the current we must abort recovery. The user must manually
migrate the data in this case.

Previously the version stored in the emeta header was copied
from smeta, which has version 1, so we need to set the minor
version to 1.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Hans Holmberg cfe1c9e2e2 lightnvm: pblk: handle bad sectors in the emeta area correctly
Unless we check if there are bad sectors in the entire emeta-area
we risk ending up with valid bitmap / available sector count inconsistency.
This results in lines with a bad chunk at the last LUN marked as bad,
so go through the whole emeta area and mark up the invalid sectors.

Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling 8f37d1913f lightnvm: remove chnl_offset in nvme_nvm_identity
The identity structure is initialized to zero in the beginning of
the nvme_nvm_identity function. The chnl_offset is separately set to
zero. Since both the variable and assignment is never changed, remove
them.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Markus Elfring 5da84cf603 lightnvm/pblk-gc: Delete an error message for a failed memory allocation in pblk_gc_line_prepare_ws()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Keith Busch f23f5bece6 blk-mq: Allow PCI vector offset for mapping queues
The PCI interrupt vectors intended to be associated with a queue may
not start at 0; a driver may allocate pre_vectors for special use. This
patch adds an offset parameter so blk-mq may find the intended affinity
mask and updates all drivers using this API accordingly.

Cc: Don Brace <don.brace@microsemi.com>
Cc: <qla2xxx-upstream@qlogic.com>
Cc: <linux-scsi@vger.kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-27 21:25:36 -06:00
Omar Sandoval 3148ffbdb9 loop: use killable lock in ioctls
Even after the previous patch to drop lo_ctl_mutex while calling
vfs_getattr(), there are other cases where we can end up sleeping for a
long time while holding lo_ctl_mutex. Let's avoid the uninterruptible
sleep from the ioctls.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-27 14:21:12 -06:00
Omar Sandoval 2d1d4c1e59 loop: don't call into filesystem while holding lo_ctl_mutex
We hit an issue where a loop device on NFS was stuck in
loop_get_status() doing vfs_getattr() after the NFS server died, which
caused a pile-up of uninterruptible processes waiting on lo_ctl_mutex.
There's no reason to hold this lock while we wait on the filesystem;
let's drop it so that other processes can do their thing. We need to
grab a reference on lo_backing_file while we use it, and we can get rid
of the check on lo_device, which has been unnecessary since commit
a34c0ae9ebd6 ("[PATCH] loop: remove the bio remapping capability") in
the linux-history tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-27 14:21:11 -06:00
Paolo Valente bc56e2cafa block, bfq: lower-bound the estimated peak rate to 1
If a storage device handled by BFQ happens to be slower than 7.5 KB/s
for a certain amount of time (in the order of a second), then the
estimated peak rate of the device, maintained in BFQ, becomes equal to
0. The reason is the limited precision with which the rate is
represented (details on the range of representable values in the
comments introduced by this commit). This leads to a division-by-zero
error where the estimated peak rate is used as divisor. Such a type of
failure has been reported in [1].

This commit addresses this issue by:
1. Lower-bounding the estimated peak rate to 1
2. Adding and improving comments on the range of rates representable

[1] https://www.spinics.net/lists/kernel/msg2739205.html

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 10:18:27 -06:00
Matias Bjørling d558fb51ad nvme: make nvme_get_log_ext non-static
Enable the lightnvm integration to use the nvme_get_log_ext()
function.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Christoph Hellwig e929f06d9e nvmet: constify struct nvmet_fabrics_ops
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Christoph Hellwig a5d1861229 nvmet: refactor configfs transport type handling
Have a common table of mappings from numerical transport ids to names, and
zero the transport specific area in common code in nvmet_addr_trtype_store.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy f871749a9f nvmet: move device_uuid configfs attr definition to suitable place
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Nitzan Carmi b435ecea2a nvme: Add .stop_ctrl to nvme ctrl ops
For consistancy reasons, any fabric-specific works
(e.g error recovery/reconnect) should be canceled in
nvme_stop_ctrl, as for all other NVMe pending works
(e.g. scan, keep alive).

The patch aims to simplify the logic of the code, as
we now only rely on a vague demand from any fabric
to flush its private workqueues at the beginning of
.delete_ctrl op.

Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Nitzan Carmi 187c0832ee nvme-rdma: Allow DELETING state change failure in error_recovery
While error recovery is ongoing, it is OK to move
ctrl to DELETING state (from concurrent delete_work).
Thus we don't need a warning for that case.

Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Keith Busch 2079699c10 nvme: Skip checking heads without namespaces
If a task is holding a reference to a namespace on a removed controller,
the head will not be released. If the same controller is added again
later, its namespaces may not be successfully added. Instead, the user
will see kernel message "Duplicate IDs for nsid <X>".

This patch fixes that by skipping heads that don't have namespaces when
considering if a new namespace is safe to add.

Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Cc: stable@vger.kernel.org
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy 9bad0404ec nvme-rdma: Don't flush delete_wq by default during remove_one
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the work queue.

Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy a3dd7d0022 nvmet-rdma: Don't flush system_wq by default during remove_one
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the system work queue.

Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Israel Rukshin e1a2ee249b nvmet-rdma: Fix use after free in nvmet_rdma_cm_handler()
We free nvmet rdma queues while handling rdma_cm events.
In order to avoid this we destroy the qp and the queue after destroying
the cm_id which guarantees that all rdma_cm events are done.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Israel Rukshin be9bddeb0a nvmet-rdma: Remove unused queue state
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
James Smart 9d625f7792 nvmet_fc: prevent new io rqsts in possible isr completions
When a bio completion calls back into the transport for a
back-end io device, the request completion path can free
the transport io job structure allowing it to be reused for
other operations. The transport has a defer_rcv queue which
holds temporary cmd rcv ops while waitng for io job structures.
when the job frees, if there's a cmd waiting, it is picked up
and submitted for processing, which can call back out to the
bio path if it's a read.  Unfortunately, what is unknown is the
context of the original bio done call, and it may be in a state
(softirq) that is not compatible with submitting the new bio in
the same calling sequence. This is especially true when using
scsi back-end devices as scsi is in softirq when it makes the
done call.

Correct by scheduling the io to be started via workq rather
than calling the start new io path inline to the original bio
done path.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
James Smart 0cdd5fca87 nvme_fc: on remoteport reuse, set new nport_id and role.
When reattaching to a removed remoteport that has not yet been
fully deleted as it's waiting for reconnect timeouts, be sure to
re-set the ports nport id and role.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
James Smart b12740d316 nvme_fc: fix abort race on teardown with lld reject
Another abort race: An io request is started, becomes active,
and is attempted to be started with the lldd. At the same time
the controller is stopped/torndown and an itterator is run to
abort the ios. As the io is active, it is added to the outstanding
aborted io count.  However on the original io request thread, the
driver ends up rejecting the io due to the condition that induced
the controller teardown. The driver reject path didn't check whether
it was in the outstanding io count. This left the count outstanding
stopping controller teardown.

Correct by, in the driver reject case, setting the state to
inactive and checking whether it was in the outstanding io count.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
James Smart 041018c634 nvme_fc: io timeout should defer abort to ctrl reset
The current nvme_fc code, when an io times out, will abort the io
on the fc link, then call the error recovery routine to reset the
controller. It is during the reset of the controller that the
transport will wait for all ios to be aborted before sending a
Disconnect LS to the target.

However, the reset routine only waits for the io which it generates
the abort for to complete. Any io that was aborted just prior to the
reset isn't in it's list to wait for. Thus the Disconnect is getting
sent before the aborts have completed.

Correct by removing the abort in the timeout handler. The reset will
generate the abort. At that point the timeout handler can be simplified
to request the reset (via the error handler) and restart the timeout
timer.

Also fixes a small typo in a comment in the reset handler.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
James Smart cf25809bec nvme_fc: fix ctrl create failures racing with workq items
If there are errors during initial controller create, the transport
will teardown the partially initialized controller struct and free
the ctlr memory.  Trouble is - most of those errors can occur due
to asynchronous events happening such io timeouts and subsystem
connectivity failures. Those failures invoke async workq items to
reset the controller and attempt reconnect.  Those may be in progress
as the main thread frees the ctrl memory, resulting in NULL ptr oops.

Prevent this from happening by having the main ctrl failure thread
changing state to DELETING followed by synchronously cancelling any
pending queued work item. The change of state will prevent the
scheduling of resets or reconnect events.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jarosław Janik 467c77d4cb nvme-pci: disable APST for Samsung NVMe SSD 960 EVO + ASUS PRIME Z370-A
Yet another "incompatible" Samsung NVMe SSD 960 EVO and Asus motherboard
combination. 960 EVO device disappears from PCIe bus within few minutes
after boot-up when APST is in use and never gets back. Forcing
NVME_QUIRK_NO_APST is the only way to make this drive work with this
particular motherboard. NVME_QUIRK_NO_DEEPEST_PS doesn't work, upgrading
motherboard's BIOS didn't help either.
Since this is a desktop motherboard, the only drawback of not using APST
is increased device temperature.

Signed-off-by: Jarosław Janik <jaroslaw.janik@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy 77d0612da0 nvme: centralize ctrl removal prints
nvme_delete_ctrl can be called from various contexts in parallel,
and cause duplicated information prints, even though the specific
context doesn't perform the actual removal. Instead, print the
information when the actual removal occurs.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Keith Busch 97c122233f nvme-pci: Add .get_address ctrl callback
The nvme-fabrics exports the controller address to sysfs, and we'd
like to have parity with this feature for PCIe. This patch provides
the appropiate callback and returns the controller address as the pci
domain🚌device.function.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Matias Bjørling 70da6094a6 nvme: implement log page low/high offset and dwords
NVMe 1.2.1 extends the get log page interface to include 64 bit
offset and increases the number of dwords to 32 bits. Implement
for future use.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jianchao Wang 765cc031cd nvme: change namespaces_mutext to namespaces_rwsem
namespaces_mutext is used to synchronize the operations on ctrl
namespaces list. Most of the time, it is a read operation.

On the other hand, there are many interfaces in nvme core that
need this lock, such as nvme_wait_freeze, and even more interfaces
will be added. If we use mutex here, circular dependency could be
introduced easily. For example:
context A                  context B
nvme_xxx                   nvme_xxx
hold namespaces_mutext     require namespaces_mutext
sync context B

So it is better to change it from mutex to rwsem.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jianchao Wang 6f8e0d787e nvme: fix the dangerous reference of namespaces list
nvme_remove_namespaces and nvme_remove_invalid_namespaces reference
the ctrl->namespaces list w/o holding namespaces_mutext. It is ok
to invoke nvme_ns_remove there, but what if there is others.

To be safer, reference the ctrl->namespaces list under
namespaces_mutext.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jianchao Wang 9a915a5be7 nvme-pci: quiesce IO queues prior to disabling device HMB accesses
Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Thomas Tai cf4182f3d0 Documentation: nvme: Documentation for nvme fault injection
Add examples to show how to use nvme fault injection.

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: Karl Volz <karl.volz@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00