Commit Graph

246 Commits

Author SHA1 Message Date
Steve Wise f361e5a01e nvme-rdma: destroy nvme queue rdma resources on connect failure
After address resolution, the nvme_rdma_queue rdma resources are
allocated.  If rdma route resolution or the connect fails, or the
controller reconnect times out and gives up, then the rdma resources
need to be freed.  Otherwise, rdma resources are leaked.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimbrg.me>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-04 10:00:54 +03:00
Steve Wise cdbecc8d24 nvme_rdma: keep a ref on the ctrl during delete/flush
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimbrg.me>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-09-04 10:00:53 +03:00
Sagi Grimberg 4d8c6a7946 nvme-rdma: Get rid of redundant defines
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-28 15:41:08 +03:00
Sagi Grimberg f5b7b559e1 nvme-rdma: Get rid of duplicate variable
We already have need_inval in ib_mr, lets use
that instead.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-28 15:40:54 +03:00
Christoph Hellwig aa71987472 nvme: fabrics drivers don't need the nvme-pci driver
So select the NVME_CORE symbol instead of depending on BLK_DEV_NVME.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-19 14:22:28 +03:00
Christoph Hellwig 98096d8a78 nvme-fabrics: get a reference when reusing a nvme_host structure
Without this we'll get a use after free after connecting two controller
using the same hostnqn and then disconnecting one of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-19 14:22:12 +03:00
Daniel Verkamp 7a665d2f60 nvme-fabrics: change NQN UUID to big-endian format
NVM Express 1.2.1 section 7.9, NVMe Qualified Names, specifies that the
UUID format of NQN uses a UUID based on RFC 4122.

RFC 4122 specifies that the UUID is encoded in big-endian byte order.

Switch the NVMe over Fabrics host ID field from little-endian UUID to
big-endian UUID to match the specification.

Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-19 12:00:44 +03:00
Jay Freyensee c5af8654c4 nvme-rdma: fix sqsize/hsqsize per spec
Per NVMe-over-Fabrics 1.0 spec, sqsize is represented as
a 0-based value.

Also per spec, the RDMA binding values shall be set
to sqsize, which makes hsqsize 0-based values.

Thus, the sqsize during NVMf connect() is now:

[root@fedora23-fabrics-host1 for-48]# dmesg
[  318.720645] nvme_fabrics: nvmf_connect_admin_queue(): sqsize for
admin queue: 31
[  318.720884] nvme nvme0: creating 16 I/O queues.
[  318.810114] nvme_fabrics: nvmf_connect_io_queue(): sqsize for i/o
queue: 127

Finally, current interpretation implies hrqsize is 1's based
so set it appropriately.

Reported-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-18 09:58:06 +03:00
Jay Freyensee f994d9dc28 fabrics: define admin sqsize min default, per spec
Upon admin queue connect(), the rdma qp was being
set based on NVMF_AQ_DEPTH.  However, the fabrics layer was
using the sqsize field value set for I/O queues for the admin
queue, which threw the nvme layer and rdma layer off-whack:

root@fedora23-fabrics-host1 nvmf]# dmesg
[ 3507.798642] nvme_fabrics: nvmf_connect_admin_queue():admin sqsize
being sent is: 128
[ 3507.798858] nvme nvme0: creating 16 I/O queues.
[ 3507.896407] nvme nvme0: new ctrl: NQN "nullside-nqn", addr
192.168.1.3:4420

Thus, to have a different admin queue value, we use
NVMF_AQ_DEPTH for connect() and RDMA private data
as the minimum depth specified in the NVMe-over-Fabrics 1.0 spec
(and in that RDMA private data we treat hrqsize as 1's-based
value, per current understanding of the fabrics spec).

Reported-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-18 09:58:05 +03:00
Colin Ian King 39bbee4e54 nvme-rdma: initialize ret to zero to avoid returning garbage
ret is not initialized so it contains garbage.  Ensure garbage
is not returned by initializing rc to 0.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-16 11:42:29 +03:00
Sagi Grimberg e3266378bd nvme-rdma: Remove unused includes
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
2016-08-04 17:47:54 +03:00
Sagi Grimberg 3ef1b4b298 nvme-rdma: start async event handler after reconnecting to a controller
When we reset or reconnect to a controller, we are cancelling the
async event handler so we can safely re-establish resources, but we
need to remember to start it again when we successfully reconnect.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-04 17:45:35 +03:00
Sagi Grimberg 45862ebcc4 nvme-rdma: Make sure to shutdown the controller if we can
Relying on ctrl state in nvme_rdma_shutdown_ctrl is wrong because
it will never be NVME_CTRL_LIVE (delete_ctrl or reset_ctrl invoked it).

Instead, check that the admin queue is connected. Note that it is safe
because we can never see a copmeting thread trying to destroy the admin
queue (reset or delete controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:23 +03:00
Sagi Grimberg a34ca17a97 nvme-rdma: Free the I/O tags when we delete the controller
If we wait until we free the controller (free_ctrl) we might
lose our rdma device without any notification while we still
have open resources (tags mrs and dma mappings).

Instead, destroy the tags with their rdma resources once we
delete the device and not when freeing it.

Note that we don't do that in nvme_rdma_shutdown_ctrl because
controller reset uses it as well and we want to give active I/O
a chance to complete successfully.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:16 +03:00
Sagi Grimberg 2461a8dd38 nvme-rdma: Remove duplicate call to nvme_remove_namespaces
nvme_uninit_ctrl already does that for us. Note that we reordered
nvme_rdma_shutdown_ctrl and nvme_uninit_ctrl, this is perfectly
fine because we actually want ctrl uninit (aen, scan cancellation
and namespaces removal) to happen before we shutdown the rdma
resources.

Also, centralize the deletion work and the dead controller removal
work code duplication into __nvme_rdma_shutdown_ctrl that accepts
a shutdown boolean.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:11 +03:00
Sagi Grimberg 57de5a0a40 nvme-rdma: Fix device removal handling
Device removal sequence may have crashed because the
controller (and admin queue space) was freed before
we destroyed the admin queue resources. Thus we
want to destroy the admin queue and only then queue
controller deletion and wait for it to complete.

More specifically we:
1. own the controller deletion (make sure we are not
   competing with another deletion).
2. get rid of inflight reconnects if exists (which
   also destroy and create queues).
3. destroy the queue.
4. safely queue controller deletion (and wait for it
   to complete).

Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:07 +03:00
Sagi Grimberg 5f372eb3e7 nvme-rdma: Queue ns scanning after a sucessful reconnection
On an ordered target shutdown, the target can send a AEN on a namespace
removal, this will trigger the host to queue ns-list query. The shutdown
will trigger error recovery which will attepmt periodic reconnect.

We can hit a race where the ns rescanning fails (error recovery kicked
in and we're not connected) causing removing all the namespaces and when
we reconnect we won't see any namespaces for this controller.

So, queue a namespace rescan after we successfully reconnected to the target.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-08-03 16:25:03 +03:00
Roland Dreier 0b857b44b5 nvme-rdma: Don't leak uninitialized memory in connect request private data
Zero out the full nvme_rdma_cm_req structure before sending it.
Otherwise we end up leaking kernel memory in the reserved field, which
might break forward compatibility in the future.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-08-03 16:24:57 +03:00
Linus Torvalds 3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Keith Busch 13880f5b57 nvme/pci: Provide SR-IOV support
This registers an sr-iov callback for nvme.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:29:59 -06:00
Jay Freyensee fa9a89fc66 nvme: initialize variable before logical OR'ing it
It is typically not good coding or secure coding practice
to logical OR a variable without an initialization value first.
Here on this line:

integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE;

BLK_INTEGRITY_DEVICE_CAPABLE is being OR'ed to a member variable
never set to an initial value. This patch fixes that.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:26:16 -06:00
Christoph Hellwig 0c4de0f33b block: ensure bios return from blk_get_request are properly initialized
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API.  Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:30 -06:00
Christoph Hellwig 70246286e9 block: get rid of bio_rw and READA
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:01 -06:00
NeilBrown b09dcf585d NVMe: don't allocate unused nvme_major
When alloc_disk(0) is used, the ->major number is ignored.  All device
numbers are allocated with a major of BLOCK_EXT_MAJOR.

So remove all references to nvme_major.

[akpm@linux-foundation.org: one unregister_blkdev() was missed]
Link: http://lkml.kernel.org/r/20160602064318.4403.93301.stgit@noble
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-14 08:52:56 -07:00
Keith Busch 32f0c4afb4 nvme: Remove RCU namespace protection
We can't sleep with RCU read lock held, but we need to do potentially
blocking stuff to namespace queues when iterating the list. This patch
removes the RCU locking and holds a mutex instead.

To prevent deadlocks, this patch removes holding the mutex during
namespace scanning and removal. The unlocked namespace scanning is made
safe by holding a reference to the namespace being scanned.

List iteration that does IO has to be unlocked to allow error recovery.
The caller must ensure the list can not be manipulated during such an
event, so this patch adds a comment explaining this requirement to the
only function that iterates an unlocked list. All callers currently
meet this requirement, so no further changes required.

List iterations that do not do IO can safely use the lock since it couldn't
block recovery from missing forced IO completions.

Reported-by: Ming Lin <mlin at kernel.org>
[fixes 0bf77e9 nvme: switch to RCU freeing the namespace]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-14 08:48:08 -07:00
Masayoshi Mizuma 2fa843512b nvme: avoid crashes when node 0 is memoryless node.
When CONFIG_NUMA is enabled and node 0 is memoryless, the system
crashes because nvme_probe() sets the device->numa_node to 0 by
set_dev_node(&pdev->dev, 0), so it tries to allocate memory from node 0.
To avoid the crash, we should change the 0 to first_memory_node.

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-13 09:17:55 -07:00
Keith Busch f80ec966c1 nvme: Limit command retries
Many controller implementations will return errors to commands that will
not succeed, but without the DNR bit set. The driver previously retried
these commands an unlimited number of times until the command timeout
has exceeded, which takes an unnecessarilly long period of time.

This patch limits the number of retries a command can have, defaulting
to 5, but is user tunable at load or runtime.

The struct request's 'retries' field is used to track the number of
retries attempted. This is in contrast with scsi's use of this field,
which indicates how many retries are allowed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 16:20:31 -07:00
Ming Lin e76debd996 nvme-fabrics: add-remove ctrl repeat fix
Repeatedly adding then removing the same NVMe-over-Fabrics controller
over and over again (shown below) can cause a kernel crash (also shown
below).  This patch fixes that.

[nvmf]# ./setup_nvme_connections.sh
traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=darkside
-nqn,hostnqn=evil-wins-nqn,nr_io_queues=16 > /dev/nvme-fabrics
traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=lightside
-nqn,hostnqn=good-wins-nqn > /dev/nvme-fabrics
[nvmf]# ./remove_nvme_connections.sh 2
echo 1 > /sys/class/nvme/nvme0/delete_controller
echo 1 > /sys/class/nvme/nvme1/delete_controller
[nvmf]# ./setup_nvme_connections.sh
traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=darkside
-nqn,hostnqn=evil-wins-nqn,nr_io_queues=16 > /dev/nvme-fabrics
Killed

[nvmf]# dmesg
[  313.416908] nvme nvme0: creating 16 I/O queues.
[  313.523908] nvme nvme0: new ctrl: NQN "darkside-nqn", addr
192.168.1.100:4420
[  313.524857] BUG: unable to handle kernel NULL pointer dereference at
0000000000000010
[  313.525262] IP: [<ffffffff8136c60e>] strcmp+0xe/0x30
[  313.525490] PGD 0
[  313.525726] Oops: 0000 [#1] SMP
[  313.525900] Modules linked in: nvme_rdma nvme_fabrics nvme_core
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_en
mlx4_ib ib_core mlx4_core
[  313.527085] CPU: 15 PID: 5856 Comm: setup_nvme_conn Not tainted
4.7.0-rc2+ #2
[  313.527259] Hardware name: Supermicro X9DRT-F/IBQF/IBFF/X9DRT
-F/IBQF/IBFF, BIOS 1.0a 10/09/2012
[  313.527551] task: ffff88027646cd40 ti: ffff88025b980000 task.ti:
ffff88025b980000
[  313.527879] RIP: 0010:[<ffffffff8136c60e>]  [<ffffffff8136c60e>]
strcmp+0xe/0x30
[  313.528232] RSP: 0018:ffff88025b983db0  EFLAGS: 00010206
[  313.528403] RAX: 0000000000000000 RBX: ffff880471879880 RCX:
fffffffffffffff1
[  313.528594] RDX: 0000000000000000 RSI: ffff880474afa860 RDI:
0000000000000011
[  313.528778] RBP: ffff88025b983db0 R08: ffff880474afa860 R09:
ffff880471879058
[  313.528956] R10: 000000000000002c R11: ffff88047f415000 R12:
ffff880471879800
[  313.529129] R13: ffff880471879000 R14: ffff880474afa860 R15:
fffffffffffffff8
[  313.529303] FS:  00007f778f510700(0000) GS:ffff88047fbc0000(0000)
knlGS:0000000000000000
[  313.529629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  313.529817] CR2: 0000000000000010 CR3: 0000000274174000 CR4:
00000000000406e0
[  313.529989] Stack:
[  313.530154]  ffff88025b983e48 ffffffffa0171c74 0000000000000001
0000000000000059
[  313.530621]  ffff880476f32400 ffff88047e8add80 0000010074b33aa0
ffff880471879059
[  313.531162]  ffff88047187904b ffff880471879058 0000000000000000
ffff88047736e000
[  313.531629] Call Trace:
[  313.531797]  [<ffffffffa0171c74>] nvmf_dev_write+0x674/0x840
[nvme_fabrics]
[  313.531974]  [<ffffffff81180b53>] __vfs_write+0x23/0x120
[  313.532146]  [<ffffffff8119daff>] ? __fd_install+0x1f/0xc0
[  313.532316]  [<ffffffff8119d97a>] ? __alloc_fd+0x3a/0x170
[  313.532487]  [<ffffffff811811f3>] vfs_write+0xb3/0x1b0
[  313.532658]  [<ffffffff8117e321>] ? filp_close+0x51/0x70
[  313.532845]  [<ffffffff811824e1>] SyS_write+0x41/0xa0
[  313.533016]  [<ffffffff8183055b>]
entry_SYSCALL_64_fastpath+0x13/0x8f
[  313.533188] Code: 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83 c2 01
84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 84 c0 74 18 48 83
c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 74 eb 19 c0 83 c8 01 5d c3 31
[  313.536563] RIP  [<ffffffff8136c60e>] strcmp+0xe/0x30
[  313.536815]  RSP <ffff88025b983db0>
[  313.536981] CR2: 0000000000000010
[  313.537151] ---[ end trace 3d952e590e7bc2d5 ]---

Reported-and-tested-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <mlin@kernel.org>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:32:19 -07:00
Sagi Grimberg 6a92967ccb nvme-fabrics: Remove tl_retry_count
The timeout before error recovery logic kicks in is
dictated by the nvme keep-alive, so we don't really need
a transport layer retry count. transports can retry for
as much as they like.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:31:11 -07:00
Sagi Grimberg 2ac17c283a nvme-rdma: Don't use tl_retry_count
Always use the maximum qp retry count as the
error recovery timeout is dictated from the nvme
keep-alive.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:31:10 -07:00
Wei Yongjun 458a9632ad nvme-rdma: fix the return value of nvme_rdma_reinit_request()
PTR_ERR should be applied before its argument is reassigned, otherwise the
return value will be set to 0, not error code.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:27:03 -07:00
Guilherme G. Piccoli 54adc01055 nvme/quirk: Add a delay before checking for adapter readiness
When disabling the controller, the specification says the register
NVME_REG_CC should be written and then driver needs to wait the
adapter to be ready, which is checked by reading another register
bit (NVME_CSTS_RDY). There's a timeout validation in this checking,
so in case this timeout is reached the driver gives up and removes
the adapter from the system.

After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003)
(HGST adapter) end up being removed if we issue a reset_controller,
because driver keeps verifying the NVME_REG_CSTS until the timeout is
reached. This patch adds a necessary quirk for this adapter, by
introducing a delay before nvme_wait_ready(), so the reset procedure
is able to be completed. This quirk is needed because just increasing
the timeout is not enough in case of this adapter - the driver must
wait before start reading NVME_REG_CSTS register on this specific
device.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:23:00 -07:00
Jens Axboe 41d512e51b Merge branch 'for-4.8/block' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm into for-4.8/drivers
Dan writes:

"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk().  See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.

It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1].  We would extend device_add_disk()
to take an attribute_group list.

This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.

[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"
2016-07-08 16:04:11 -06:00
Christoph Hellwig 7110230719 nvme-rdma: add a NVMe over Fabrics RDMA host driver
This patch implements the RDMA host (initiator in SCSI speak) driver.  It
can be used to connect to remote NVMe over Fabrics controllers over
Infiniband, RoCE or iWarp, and uses the existing NVMe core driver as well
a the new fabrics library.

To connect to all NVMe over Fabrics controller reachable on a given taget
port using RDMA/CM use the following command:

	nvme connect-all -t rdma -a $IPADDR

This requires the latest version of nvme-cli with Fabrics support.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-08 08:38:49 -06:00
Christoph Hellwig def61eca96 nvme: add new reconnecting controller state
The nvme fabric (RDMA, FC, etc...) can introduce port, link or node
failures that may require a reconnect to re-establish the connection.

Add a new reconnecting state that will initially be used by the RDMA
driver.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-08 08:38:49 -06:00
Johannes Thumshirn 6f92970219 nvme: lightnvm: make MLC num_pairs little endian
According to the OpenChannel SSD interface specification the NAND flash
MLC page pairing information's number of page page pairings field is the
first two bytes in the MLC Page Pairing data structure. The hardware's
data structure itself is little endian so annotate it as such, like the
rest of lighnvm's data structures.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Sagi Grimberg 038bd4cb67 nvme: add keep-alive support
Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and
optional in NVMe 1.2.1 for PCIe.  This patch adds periodic keep-alive
sent from the host to verify that the controller is still responsive
and vice-versa.  The keep-alive timeout is user-defined (with
keep_alive_tmo connection parameter) and defaults to 5 seconds.

In order to avoid a race condition where the host sends a keep-alive
competing with the target side keep-alive timeout expiration, the host
adds a grace period of 10 seconds when publishing the keep-alive timeout
to the target.

In case a keep-alive failed (or timed out), a transport specific error
recovery kicks in.

For now only NVMe over Fabrics is wired up to support keep alive, but
we can add PCIe support easily once controllers actually supporting it
become available.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@chelsio.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:20 -06:00
Christoph Hellwig 07bfcd09a2 nvme-fabrics: add a generic NVMe over Fabrics library
The NVMe over Fabrics library provides an interface for both transports
and the nvme core to handle fabrics specific commands and attributes
independent of the underlying transport.

In addition, the fabrics library adds a misc device interface that allow
actually creating a fabrics controller, as we can't just autodiscover
it like in the PCI case.  The nvme-cli utility has been enhanced to use
this interface to support fabric connect and discovery.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>,
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>,
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:16 -06:00
Christoph Hellwig eb793e2c92 nvme.h: add NVMe over Fabrics definitions
The NVMe over Fabrics specification defines a protocol interface and
related extensions to NVMe that enable operation over network protocols.
The NVMe over Fabrics specification has an NVMe Transport binding for
each NVMe Transport.

This patch adds the fabrics related definitions:
- fabric specific command set and error codes
- transport addressing and binding definitions
- fabrics sgl extensions
- controller identification fabrics enhancements
- discovery log page definition

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:14 -06:00
Ming Lin 1a353d85b0 nvme: add fabrics sysfs attributes
- delete_controller: This attribute allows to delete a controller.
  A driver is not obligated to support it (pci doesn't) so it is
  created only if the driver supports it. The new fabrics drivers
  will support it (essentialy a disconnect operation).

  Usage:
  echo > /sys/class/nvme/nvme0/delete_controller

- subsysnqn: This attribute shows the subsystem nqn of the configured
  device. If a driver does not implement the get_subsysnqn method, the
  file will not appear in sysfs.

- transport: This attribute shows the transport name. Added a "name"
  field to struct nvme_ctrl_ops.

  For loop,
  cat /sys/class/nvme/nvme0/transport
  loop

  For RDMA,
  cat /sys/class/nvme/nvme0/transport
  rdma

  For PCIe,
  cat /sys/class/nvme/nvme0/transport
  pcie

- address: This attributes shows the controller address. The fabrics
  drivers that will implement get_address can show the address of the
  connected controller.

  example:
  cat /sys/class/nvme/nvme0/address
  traddr=192.168.2.2,trsvcid=1023

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:12 -06:00
Christoph Hellwig eb71f43557 nvme: Modify and export sync command submission for fabrics
NVMe over fabrics will use __nvme_submit_sync_cmd in the the
transport and require a few tweaks to it.  For that we export it
and add a few more paramters:

1. allow passing a queue ID to the block layer

   For the NVMe over Fabrics connect command we need to able to specify a
   queue ID that we want to send the command on.  Add a qid parameter to
   the relevant functions to enable this behavior.

2. allow submitting at_head commands

   In cases where we want to (re)connect to a controller
   where we have inflight queued commands we want to first
   connect and only then allow the other queued commands to
   be kicked. This will prevents failures in controller resets
   and reconnects.

3. allow passing flags to blk_mq_allocate_request

   Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we
   want to be able to use reserved requests.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:11 -06:00
Christoph Hellwig 7d2e80080d nvme: allow transitioning from NEW to LIVE state
For Fabrics we're not going through an intermediate reset state
(at least for now).

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:09 -06:00
Dan Williams 0d52c756a6 block: convert to device_add_disk()
For block drivers that specify a parent device, convert them to use
device_add_disk().

This conversion was done with the following semantic patch:

    @@
    struct gendisk *disk;
    expression E;
    @@

    - disk->driverfs_dev = E;
    ...
    - add_disk(disk);
    + device_add_disk(E, disk);

    @@
    struct gendisk *disk;
    expression E1, E2;
    @@

    - disk->driverfs_dev = E1;
    ...
    E2 = disk;
    ...
    - add_disk(E2);
    + device_add_disk(E1, E2);

...plus some manual fixups for a few missed conversions.

Cc: Jens Axboe <axboe@fb.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-27 12:26:08 -07:00
Christoph Hellwig f5fa90dc0a nvme: move the workaround for I/O queue-less controllers from PCIe to core
We want to apply this to Fabrics drivers as well, so move it to common
code.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig 7a5abb4b48 nvme: factor out a add nvme_is_write helper
Centralize the check if a given NVMe command reads or writes data.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Christoph Hellwig a229dbf61e nvme: allow for size limitations from transport drivers
Some transport drivers may have a lower transfer size than
the controller. So allow the transport to set it in the
controller max_hw_sectors.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-12 07:29:43 -06:00
Johannes Thumshirn edb50a5403 NVMe: Only release requested regions
The NVMe driver only requests the PCIe device's memory regions but releases
all possible regions (including eventual I/O regions). This leads to a stale
warning entry in dmesg about freeing non existent resources.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 14:28:28 -06:00
Sunad Bhandary 47b0e50ac7 NVMe: Fix removal in case of active namespace list scanning method
In case of the active namespace list scanning method, a namespace that
is detached is not removed from the host if it was the last entry in
the list. Fix this by adding a scan to validate namespaces greater than
the value of prev.

This also handles the case of removing namespaces whose value exceed
the device's reported number of namespaces.

Signed-off-by: Sunad Bhandary S <sunad.s@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-08 08:48:47 -06:00
Minfei Huang bd0fc2884c nvme: use UINT_MAX for max discard sectors
It's more elegant to use UINT_MAX to represent the max value of
type unsigned int. So replace the actual value by using this define.

Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 22:23:12 -06:00