Commit Graph

342 Commits

Author SHA1 Message Date
Bart Van Assche 1c4665272c nvmet-fc: fix kernel-doc headers
This patch avoids that the kernel-doc tool complains about two function
headers when building with W=1.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:30 +02:00
Bart Van Assche 8eacd1bd21 nvmet: avoid integer overflow in the discard code
Although I'm not sure whether it is a good idea to support large discard
commands, I think integer overflow for discard ranges larger than 4 GB
should be avoided. This patch avoids that smatch reports the following:

drivers/nvme/target/io-cmd-file.c:249:1 nvmet_file_execute_discard() warn: should '((range.nlb)) << req->ns->blksize_shift' be a 64 bit type?

Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:27 +02:00
Bart Van Assche 0d3ebdec93 nvmet-rdma: declare local symbols static
This patch avoids that sparse complains about missing declarations.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:27 +02:00
Bart Van Assche 5eadc9cce1 nvmet: use strlcpy() instead of strcpy()
Although the code modified by this patch looks fine to me, this patch avoids
that Coverity reports the following complaint (ID 1364971 and ID 1364973):
"You might overrun the 256-character fixed-size string id->subnqn".

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:26 +02:00
Bart Van Assche 43a6f8fb61 nvmet: use strcmp() instead of strncmp() for subsystem lookup
strncmp() stops comparing when either the end of one of the first two arguments
is reached or when 'n' characters have been compared, whichever comes first.
That means that strncmp(s1, s2, n) is equivalent to strcmp(s1, s2) if n exceeds
the length of s1 or the length of s2. Since that is the case in
nvmet_find_get_subsys(), change strncmp() into strcmp(). This patch avoids that
the following warning is reported by smatch:

drivers/nvme/target/core.c:940:1 nvmet_find_get_subsys() error: strncmp() '"nqn.2014-08.org.nvmexpress.discovery"' too small (37 vs 223)

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:24 +02:00
Chaitanya Kulkarni 48440ab6dc nvmet: remove unreachable code
Get rid of the unreachable code in the nvmet_parse_discovery_cmd().
Keep the error message identical to the admin-cmd.c and io-cmd*.c

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:23 +02:00
Sagi Grimberg 2acf70ade7 nvmet-rdma: use a private workqueue for delete
Queue deletion is done asynchronous when the last reference on the queue
is dropped.  Thus, in order to make sure we don't over allocate under a
connect/disconnect storm, we let queue deletion complete before making
forward progress.

However, given that we flush the system_wq from rdma_cm context which
runs from a workqueue context, we can have a circular locking complaint
[1]. Fix that by using a private workqueue for queue deletion.

[1]:
======================================================
WARNING: possible circular locking dependency detected
4.19.0-rc4-dbg+ #3 Not tainted
------------------------------------------------------
kworker/5:0/39 is trying to acquire lock:
00000000a10b6db9 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm]

but task is already holding lock:
00000000331b4e2c ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3ed/0xa20

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 ((work_completion)(&queue->release_work)){+.+.}:
       process_one_work+0x474/0xa20
       worker_thread+0x63/0x5a0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

-> #2 ((wq_completion)"events"){+.+.}:
       flush_workqueue+0xf3/0x970
       nvmet_rdma_cm_handler+0x133d/0x1734 [nvmet_rdma]
       cma_ib_req_handler+0x72f/0xf90 [rdma_cm]
       cm_process_work+0x2e/0x110 [ib_cm]
       cm_req_handler+0x135b/0x1c30 [ib_cm]
       cm_work_handler+0x2b7/0x38cd [ib_cm]
       process_one_work+0x4ae/0xa20
nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000040357082
       worker_thread+0x63/0x5a0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30
nvme nvme0: Reconnecting in 10 seconds...

-> #1 (&id_priv->handler_mutex/1){+.+.}:
       __mutex_lock+0xfe/0xbe0
       mutex_lock_nested+0x1b/0x20
       cma_ib_req_handler+0x6aa/0xf90 [rdma_cm]
       cm_process_work+0x2e/0x110 [ib_cm]
       cm_req_handler+0x135b/0x1c30 [ib_cm]
       cm_work_handler+0x2b7/0x38cd [ib_cm]
       process_one_work+0x4ae/0xa20
       worker_thread+0x63/0x5a0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

-> #0 (&id_priv->handler_mutex){+.+.}:
       lock_acquire+0xc5/0x200
       __mutex_lock+0xfe/0xbe0
       mutex_lock_nested+0x1b/0x20
       rdma_destroy_id+0x6f/0x440 [rdma_cm]
       nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma]
       process_one_work+0x4ae/0xa20
       worker_thread+0x63/0x5a0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

Fixes: 777dc82395 ("nvmet-rdma: occasionally flush ongoing controller teardown")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Bart Van Assche <bvanassche@acm.org>

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-05 09:25:18 +02:00
Sagi Grimberg 73383adfad nvmet: don't split large I/Os unconditionally
If we know that the I/O size exceeds our inline bio vec, no
point using it and split the rest to begin with. We could
in theory reuse the inline bio and only allocate the bio_vec,
but its really not worth optimizing for.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:13 -07:00
James Smart ea96d6496f nvmet_fc: support target port removal with nvmet layer
Currently, if a targetport has been connected to via the nvmet config
(in other words, the add_port() transport routine called, and the nvmet
port pointer stored for using in upcalls on new io), and if the
targetport is then removed (say the lldd driver decides to unload or
fully reset its hardware) and then re-added (the lldd driver reloads or
reinits its hardware), the port pointer has been lost so there's no way
to continue to post commands up to nvmet via the transport port.

Correct by allocating a small "port context" structure that will be
linked to by the targetport. The context will save the targetport WWN's
and the nvmet port pointer to use for it.  Initial allocation will occur
when the targetport is bound to via add_port.  The context will be
deallocated when remove_port() is called.  If a targetport is removed
while nvmet has the active port context, the targetport will be unlinked
from the port context before removal.  If a new targetport is registered,
the port contexts without a binding are looked through and if the WWN's
match (so it's the same as nvmet's port context) the port context is
linked to the new target port.  Thus new io can be received on the new
targetport and operation resumes with nvmet.

Additionally, this also resolves nvmet configuration changing out from
underneath of the nvme-fc target port (for example: a nvmetcli clear).

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:10 -07:00
Milan P. Gandhi d4e4230c8f nvme-fc: fix for a minor typos
Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:09 -07:00
Chaitanya Kulkarni d93cb3927c nvmet: remove redundant module prefix
This patch removes the redundant module prefix used in the pr_err() when
nvmet_get_smart_log_nsid() failed to find the namespace provided as a part
of smart-log command.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:09 -07:00
Hannes Reinecke be1277f5eb nvme: count all ANA groups for ANA Log page
When issuing a short read on the ANA log page the number of groups
should not change, even though the final returned data might contain
less groups than that number.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[switched to a for loop]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-17 15:49:40 +02:00
Sagi Grimberg 8407879c4e nvmet-rdma: fix possible bogus dereference under heavy load
Currently we always repost the recv buffer before we send a response
capsule back to the host. Since ordering is not guaranteed for send
and recv completions, it is posible that we will receive a new request
from the host before we got a send completion for the response capsule.

Today, we pre-allocate 2x rsps the length of the queue, but in reality,
under heavy load there is nothing that is really preventing the gap to
expand until we exhaust all our rsps.

To fix this, if we don't have any pre-allocated rsps left, we dynamically
allocate a rsp and make sure to free it when we are done. If under memory
pressure we fail to allocate a rsp, we silently drop the command and
wait for the host to retry.

Reported-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[hch: dropped a superflous assignment]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-05 12:18:01 -07:00
Chaitanya Kulkarni 04db0e5ec5 nvmet: free workqueue object if module init fails
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-28 08:40:44 +02:00
James Smart afd299ca99 nvme-fcloop: Fix dropped LS's to removed target port
When a targetport is removed from the config, fcloop will avoid calling
the LS done() routine thinking the targetport is gone. This leaves the
initiator reset/reconnect hanging as it waits for a status on the
Create_Association LS for the reconnect.

Change the filter in the LS callback path. If tport null (set when
failed validation before "sending to remote port"), be sure to call
done. This was the main bug. But, continue the logic that only calls
done if tport was set but there is no remoteport (e.g. case where
remoteport has been removed, thus host doesn't expect a completion).

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-28 08:40:43 +02:00
Jason Gunthorpe 0a3173a5f0 Merge branch 'linus/master' into rdma.git for-next
rdma.git merge resolution for the 4.19 merge window

Conflicts:
 drivers/infiniband/core/rdma_core.c
   - Use the rdma code and revise with the new spelling for
     atomic_fetch_add_unless
 drivers/nvme/host/rdma.c
   - Replace max_sge with max_send_sge in new blk code
 drivers/nvme/target/rdma.c
   - Use the blk code and revise to use NULL for ib_post_recv when
     appropriate
   - Replace max_sge with max_recv_sge in new blk code
 net/rds/ib_send.c
   - Use the net code and revise to use NULL for ib_post_recv when
     appropriate

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 14:21:29 -06:00
Jason Gunthorpe 89982f7cce Linux 4.18
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
 mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
 qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
 zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
 vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
 wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
 L47IdXo=
 =sJkt
 -----END PGP SIGNATURE-----

Merge tag 'v4.18' into rdma.git for-next

Resolve merge conflicts from the -rc cycle against the rdma.git tree:

Conflicts:
 drivers/infiniband/core/uverbs_cmd.c
  - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
  - Merge removal of file->ucontext in for-next with new code in -rc
 drivers/infiniband/core/uverbs_main.c
  - for-next removed code from ib_uverbs_write() that was modified
    in for-rc

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 13:12:00 -06:00
Linus Torvalds 73ba2fb33c for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Chaitanya Kulkarni dedf0be544 nvmet: add ns write protect support
This patch implements the Namespace Write Protect feature described in
"NVMe TP 4005a Namespace Write Protect". In this version, we implement
No Write Protect and Write Protect states for target ns which can be
toggled by set-features commands from the host side.

For write-protect state transition, we need to flush the ns specified
as a part of command so we also add helpers for carrying out synchronous
flush operations.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[hch: fixed an incorrect endianess conversion, minor cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-08 12:00:53 +02:00
Chaitanya Kulkarni b369b30cf5 nvmet: use Retain Async Event bit to clear AEN
In the current implementation, we clear the AEN bit when we get the
"get log page" command if given log page is associated with AEN.
This patch allows optionally retaining the AEN for the ctrl
under consideration when Retain Asynchronous Event (RAE) bit is set
as a part of "get log page" command.

This allows the host to read the Log page and optionally retaining the
AEN associated with this log page when using userspace tools like
nvme-cli.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[hch: also use the new helper in the just merged ANA code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-27 19:14:31 +02:00
Christoph Hellwig 62ac0d32f7 nvmet: support configuring ANA groups
Allow creating non-default ANA groups (group ID > 1).  Groups are created
either by assigning the group ID to a namespace, or by creating a configfs
group object under a specific port.  All namespaces assigned to a group
that doesn't have a configfs object for a given port are marked as
inaccessible.

Allow changing the ANA state on a per-port basis by creating an
ana_groups directory under each port, and another directory with an
ana_state file in it.  The default ANA group 1 directory is created
automatically for each port.

For all changes in ANA configuration the ANA change AEN is sent.  We only
keep a global changecount instead of additional per-group changecounts to
keep the implementation as simple as possible.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:13:06 +02:00
Christoph Hellwig 72efd25dcf nvmet: add minimal ANA support
Add support for Asynchronous Namespace Access as specified in NVMe 1.3
TP 4004.

Just add a default ANA group 1 that is optimized on all ports.  This is
(and will remain) the default assignment for any namespace not epxlicitly
assigned to another ANA group.  The ANA state can be manually changed
through the configfs interface, including the change state.

Includes fixes and improvements from Hannes Reinecke.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:13:02 +02:00
Christoph Hellwig 793c7cfce0 nvmet: track and limit the number of namespaces per subsystem
TP 4004 introduces a new 'Maximum Number of Allocated Namespaces' field
in the Identify controller data to help the host size resources.  Put
an upper limit on the supported namespaces to be able to support this
value as supporting 32-bits worth of namespaces would lead to very
large buffers.  The limit is completely arbitrary at this point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:13:01 +02:00
Christoph Hellwig 4ee4328048 nvmet: keep a port pointer in nvmet_ctrl
This will be needed for the ANA AEN code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:12:52 +02:00
Hannes Reinecke 405a751960 nvmet: only check for filebacking on -ENOTBLK
We only need to check for a file-backed namespace if
nvmet_bdev_ns_enable() returns -ENOTBLK. For any other error
it's pointless as the open() error will remain the same.

Fixes: d5eff33e ("nvmet: add simple file backed ns support")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-25 13:14:04 +02:00
Hannes Reinecke 5613d31214 nvmet: fixup crash on NULL device path
When writing an empty string into the device_path attribute the kernel
will crash with

nvmet: failed to open block device (null): (-22)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000

This patch sanitizes the error handling for invalid device path settings.

Fixes: a07b4970 ("nvmet: add a generic NVMe target")
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-25 13:14:03 +02:00
Bart Van Assche 23f96d1f15 nvmet-rdma: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 16:06:36 -06:00
Andy Shevchenko 1b0d274523 nvmet: don't use uuid_le type
Don't use sizeof(uuid_le) where none of the parameters is type of uuid_le.
Since both arguments are u8 [16], use size of destination there.

Moreover, uuid_le is a deprecated type, and nvmet is using uuid_t
already.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24 15:55:51 +02:00
Sagi Grimberg 9c891c1398 nvmet: check fileio lba range access boundaries
Fail out-of-bounds with a proper status code.

Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24 15:55:51 +02:00
Sagi Grimberg 1b72b71fac nvmet: fix file discard return status
If nvmet_copy_from_sgl failed, we falsly return successful
completion status.

Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24 15:55:50 +02:00
James Smart 6cdefc6e2a nvme: if_ready checks to fail io to deleting controller
The revised if_ready checks skipped over the case of returning error when
the controller is being deleted.  Instead it was returning BUSY, which
caused the ios to retry, which caused the ns delete to hang waiting for
the ios to drain.

Stack trace of hang looks like:
 kworker/u64:2   D    0    74      2 0x80000000
 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
 Call Trace:
  ? __schedule+0x26d/0x820
  schedule+0x32/0x80
  blk_mq_freeze_queue_wait+0x36/0x80
  ? remove_wait_queue+0x60/0x60
  blk_cleanup_queue+0x72/0x160
  nvme_ns_remove+0x106/0x140 [nvme_core]
  nvme_remove_namespaces+0x7e/0xa0 [nvme_core]
  nvme_delete_ctrl_work+0x4d/0x80 [nvme_core]
  process_one_work+0x160/0x350
  worker_thread+0x1c3/0x3d0
  kthread+0xf5/0x130
  ? process_one_work+0x350/0x350
  ? kthread_bind+0x10/0x10
  ret_from_fork+0x1f/0x30

Extend nvmf_fail_nonready_command() to supply the controller pointer so
that the controller state can be looked at. Fail any io to a controller
that is deleting.

Fixes: 3bc32bb118 ("nvme-fabrics: refactor queue ready check")
Fixes: 35897b920c ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
2018-07-24 13:44:40 +02:00
James Smart d082dc1562 nvmet-fc: fix target sgl list on large transfers
The existing code to carve up the sg list expected an sg element-per-page
which can be very incorrect with iommu's remapping multiple memory pages
to fewer bus addresses. To hit this error required a large io payload
(greater than 256k) and a system that maps on a per-page basis. It's
possible that large ios could get by fine if the system condensed the
sgl list into the first 64 elements.

This patch corrects the sg list handling by specifically walking the
sg list element by element and attempting to divide the transfer up
on a per-sg element boundary. While doing so, it still tries to keep
sequences under 256k, but will exceed that rule if a single sg element
is larger than 256k.

Fixes: 48fa362b6c ("nvmet-fc: simplify sg list handling")
Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24 13:44:09 +02:00
Sagi Grimberg 59e29ce66b nvme: cache struct nvme_ctrl reference to struct nvme_request
We will need to reference the controller in the setup and completion
time for tracing and future traffic based keep alive support.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:18 +02:00
Max Gurtovoy 202093848c nvmet-rdma: add an error flow for post_recv failures
Posting receive buffer operation can fail, thus we should make
sure to have an error flow during initialization phase. While
we're here, add a debug print in case of a failure.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:17 +02:00
Max Gurtovoy 2fc464e216 nvmet-rdma: add unlikely check in the fast path
ib_post_send operation should succeed unless something unusual
happened to the ib device.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:16 +02:00
Steve Wise 0d5ee2b2ab nvmet-rdma: support max(16KB, PAGE_SIZE) inline data
The patch enables inline data sizes using up to 4 recv sges, and capping
the size at 16KB or at least 1 page size.  So on a 4K page system, up to
16KB is supported, and for a 64K page system 1 page of 64KB is supported.

We avoid > 0 order page allocations for the inline buffers by using
multiple recv sges, one for each page.  If the device cannot support
the configured inline data size due to lack of enough recv sges, then
log a warning and reduce the inline size.

Add a new configfs port attribute, called param_inline_data_size,
to allow configuring the size of inline data for a given nvmf port.
The maximum size allowed is still enforced by nvmet-rdma with
NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is now max(16KB, PAGE_SIZE).
And the default size, if not specified via configfs, is still PAGE_SIZE.
This preserves the existing behavior, but allows larger inline sizes
for small page systems.  If the configured inline data size exceeds
NVMET_RDMA_MAX_INLINE_DATA_SIZE, a warning is logged and the size is
reduced.  If param_inline_data_size is set to 0, then inline data is
disabled for that nvmf port.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:16 +02:00
Chaitanya Kulkarni 55eb942eda nvmet: add buffered I/O support for file backed ns
Add a new "buffered_io" attribute, which disabled direct I/O and thus
enables page cache based caching when enabled.   The attribute can only
be changed when the namespace is disabled as the file has to be reopend
for the change to take effect.

The possibly blocking read/write are deferred to a newly introduced
global workqueue.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:14 +02:00
Chaitanya Kulkarni 0866bf0c37 nvmet: add commands supported and effects log page
This patch adds support for Commands Supported and Effects log page
(Log Identifier 05h) for NVMeOF. This also makes it easier to find
which commands are supported, e.g. :-

subnqn    : testnqn1
Admin Command Set
ACS2     [Get Log Page                    ] 00000001
ACS6     [Identify                        ] 00000001
ACS8     [Abort                           ] 00000001
ACS9     [Set Features                    ] 00000001
ACS10    [Get Features                    ] 00000001
ACS12    [Asynchronous Event Request      ] 00000001
ACS24    [Keep Alive                      ] 00000001

NVM Command Set
IOCS0    [Flush                           ] 00000001
IOCS1    [Write                           ] 00000001
IOCS2    [Read                            ] 00000001
IOCS8    [Write Zeroes                    ] 00000001
IOCS9    [Dataset Management              ] 00000001

This partticular functionality can be used from the host side to examine
the NVMeOF ctrl commands supported.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:13 +02:00
Max Gurtuvoy d68a90e148 nvmet: reset keep alive timer in controller enable
Controllers that are not yet enabled should not really enforce keep alive
timeouts, but we still want to track a timeout and cleanup in case a host
died before it enabled the controller.  Hence, simply reset the keep
alive timer when the controller is enabled.

Suggested-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-20 14:20:51 +02:00
Steve Wise 33023fb85a IB/core: add max_send_sge and max_recv_sge attributes
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths.  For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16.  Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.

Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 13:17:28 -06:00
Christoph Hellwig 3bc32bb118 nvme-fabrics: refactor queue ready check
Move the is_connected check to the fibre channel transport, as it has no
meaning for other transports.  To facilitate this split out a new
nvmf_fail_nonready_command helper that is called by the transport when
it is asked to handle a command on a queue that is not ready.

Also avoid a function call for the queue live fast path by inlining
the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
2018-06-15 11:21:00 +02:00
Chaitanya Kulkarni c42d7a30ab nvmet: free smart-log buffer after use
Free smart-log buffer allocated in the function after use.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-11 16:18:05 +02:00
Sagi Grimberg 9ba2a5cb88 nvmet: filter newlines from user input
We should avoid consuming the newlines in traddr, trsvcid and
device_path. Add minimal processing to make sure they are gone.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 12:51:10 -06:00
Christoph Hellwig f39ae4719b nvmet: return all zeroed buffer when we can't find an active namespace
Quote from Figure 106 in NVMe 1.3a:

  The Identify Namespace data structure is returned to the host for the
  namespace specified in the Namespace Identifier (CDW1.NSID) field if it
  is an active NSID. If the specified namespace is not an active NSID,
  then the controller returns a zero filled data structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@rimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 12:51:09 -06:00
Christoph Hellwig 55fdd6b613 nvmet: mask pending AENs
Per section 5.2 of the NVMe 1.3 spec:

  "When the controller posts a completion queue entry for an outstanding
  Asynchronous Event Request command and thus reports an asynchronous
  event, subsequent events of that event type are automatically masked by
  the controller until the host clears that event. An event is cleared by
  reading the log page associated with that event using the Get Log Page
  command (see section 5.14)."

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-06-01 14:37:35 +02:00
Christoph Hellwig c86b8f7b41 nvmet: add AEN configuration support
AEN configuration via the 'Get Features' and 'Set Features' admin
command is mandatory, so we should be implemeting handling for it.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[hch: use WRITE_ONCE, check for invalid values]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com>
2018-06-01 14:37:35 +02:00
Christoph Hellwig c16734ea98 nvmet: implement the changed namespaces log
Just keep a per-controller buffer of changed namespaces and copy it out
in the get log page implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com>
2018-06-01 14:37:35 +02:00
Christoph Hellwig 8ab0805f11 nvmet: split log page implementation
Remove the common code to allocate a buffer and copy it into the SGL.
Instead the two no-op implementations just zero the SGL directly, and
the smart log allocates a buffer on its own.  This prepares for the
more elaborate ANA log page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-06-01 14:37:35 +02:00
Christoph Hellwig c7759fff22 nvmet: add a new nvmet_zero_sgl helper
Zeroes the SGL in the payload.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-06-01 14:37:35 +02:00
Wei Yongjun 1367bc8285 nvmet: fix error return code in nvmet_file_ns_enable()
Fix to return error code -ENOMEM from the memory alloc fail error
handling case instead of 0, as done elsewhere in this function.

Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-31 18:46:46 +02:00
Wei Yongjun 81cf54e01a nvmet: fix a typo in nvmet_file_ns_enable()
Fix a typo in nvmet_file_ns_enable().

Fixes: d5eff33ee6 ("nvmet: add simple file backed ns support")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-31 18:46:46 +02:00
Christoph Hellwig fe4a97918d nvme-loop: add support for multiple ports
This is useful at least for multipath testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-05-30 08:05:18 +02:00
Jens Axboe b7405176b5 Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/block
Pull NVMe changes from Christoph:

"Here is the current batch of nvme updates for 4.18, we have a few more
 patches in the queue, but I'd like to get this pile into your tree
 and linux-next ASAP.

 The biggest item is support for file-backed namespaces in the NVMe
 target from Chaitanya, in addition to that we mostly small fixes from
 all the usual suspects."

* 'nvme-4.18-2' of git://git.infradead.org/nvme:
  nvme: fixup memory leak in nvme_init_identify()
  nvme: fix KASAN warning when parsing host nqn
  nvmet-loop: use nr_phys_segments when map rq to sgl
  nvmet-fc: increase LS buffer count per fc port
  nvmet: add simple file backed ns support
  nvmet: remove duplicate NULL initialization for req->ns
  nvmet: make a few error messages more generic
  nvme-fabrics: allow duplicate connections to the discovery controller
  nvme-fabrics: centralize discovery controller defaults
  nvme-fabrics: remove unnecessary controller subnqn validation
  nvme-fc: remove setting DNR on exception conditions
  nvme-rdma: stop admin queue before freeing it
  nvme-pci: Fix AER reset handling
  nvme-pci: set nvmeq->cq_vector after alloc cq/sq
  nvme: host: core: fix precedence of ternary operator
  nvme: fix lockdep warning in nvme_mpath_clear_current_path
2018-05-29 12:56:20 -06:00
Christoph Hellwig db8c48e4b2 nvme: return BLK_EH_DONE from ->timeout
NVMe always completes the request before returning from ->timeout, either
by polling for it, or by disabling the controller.  Return BLK_EH_DONE so
that the block layer doesn't even try to complete it again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Chaitanya Kulkarni eb464833a2 nvmet-loop: use nr_phys_segments when map rq to sgl
Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check
if a command contains data to me mapped.  This fixes the case where
a struct requests contains LBAs, but no data will actually be send,
e.g. the pending Write Zeroes support.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
James Smart 17d78252ee nvmet-fc: increase LS buffer count per fc port
Todays limit on concurrent LS's is very small - 4 buffers. With large
subsystem counts or large numbers of initiators connecting, the limit
may be exceeded.

Raise the LS buffer count to 256.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni d5eff33ee6 nvmet: add simple file backed ns support
This patch adds simple file backed namespace support for NVMeOF target.

The new file io-cmd-file.c is responsible for handling the code for I/O
commands when ns is file backed. Also, we introduce mempools based slow
path using sync I/Os for file backed ns to ensure forward progress under
reclaim.

The old block device based implementation is moved to io-cmd-bdev.c and
use a "nvmet_bdev_" symbol prefix.  The enable/disable calls are also
move into the respective files.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[hch: updated changelog, fixed double req->ns lookup in bdev case]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni 618cff4285 nvmet: remove duplicate NULL initialization for req->ns
Remove the duplicate NULL initialization for req->ns.  req->ns is always
initialized to NULL in nvmet_req_init(), so there is no need to reset
it later on failures unless we have previously assigned a value to it.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
Chaitanya Kulkarni b40b83e365 nvmet: make a few error messages more generic
"nvmet_check_ctrl_status()" is called from admin-cmd.c along
with io-cmd.c, make the error message more generic.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
Linus Torvalds eb4f959b26 First pull request for 4.17-rc
- Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)
 - SPDX license tag cleanups (new tag Linux-OpenIB)
 - RoCE GID fixes related to default GIDs
 - Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
   mlx4, mlx5, and hfi1 (medium batch)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJa7JXPAAoJELgmozMOVy/dc0AP/0i7EajAmgl1ihka6BYVj2pa
 DV8iSrVMDPulh9AVnAtwLJSbdwmgN/HeVzLzcutHyMYk6tAf8RCs6TsyoB36XiOL
 oUh5+V2GyNnyh9veWPwyGTgZKCpPJc3uQFV6502lZVDYwArMfGatumApBgQVKiJ+
 YdPEXEQZPNIs6YZB1WXkNYV/ra9u0aBByQvUrxwVZ2AND+srJYO82tqZit2wBtjK
 UXrhmZbWXGWMFg8K3/lpfUkQhkG3Arj+tMKkCfqsVzC7wUPhlTKBHR9NmvdLIiy9
 5Vhv7Xp78udcxZKtUeTFsbhaMqqK7x7sKHnpKAs7hOZNZ/Eg47BrMwMrZVLOFuDF
 nBLUL1H+nJ1mASZoMWH5xzOpVew+e9X0cot09pVDBIvsOIh97wCG7hgptQ2Z5xig
 fcDiMmg6tuakMsaiD0dzC9JI5HR6Z7+6oR1tBkQFDxQ+XkkcoFabdmkJaIRRwOj7
 CUhXRgcm0UgVd03Jdta6CtYXsjSODirWg4AvSSMt9lUFpjYf9WZP00/YojcBbBEH
 UlVrPbsKGyncgrm3FUP6kXmScESfdTljTPDLiY9cO9+bhhPGo1OHf005EfAp178B
 jGp6hbKlt+rNs9cdXrPSPhjds+QF8HyfSlwyYVWKw8VWlh/5DG8uyGYjF05hYO0q
 xhjIS6/EZjcTAh5e4LzR
 =PI8v
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Doug Ledford:
 "This is our first pull request of the rc cycle. It's not that it's
  been overly quiet, we were just waiting on a few things before sending
  this off.

  For instance, the 6 patch series from Intel for the hfi1 driver had
  actually been pulled in on Tuesday for a Wednesday pull request, only
  to have Jason notice something I missed, so we held off for some
  testing, and then on Thursday had to respin the series because the
  very first patch needed a minor fix (unnecessary cast is all).

  There is a sizable hns patch series in here, as well as a reasonably
  largish hfi1 patch series, then all of the lines of uapi updates are
  just the change to the new official Linux-OpenIB SPDX tag (a bunch of
  our files had what amounts to a BSD-2-Clause + MIT Warranty statement
  as their license as a result of the initial code submission years ago,
  and the SPDX folks decided it was unique enough to warrant a unique
  tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4
  and core/cache/cma updates to round out the bunch.

  None of it was overly large by itself, but in the 2 1/2 weeks we've
  been collecting patches, it has added up :-/.

  As best I can tell, it's been through 0day (I got a notice about my
  last for-next push, but not for my for-rc push, but Jason seems to
  think that failure messages are prioritized and success messages not
  so much). It's also been through linux-next. And yes, we did notice in
  the context portion of the CMA query gid fix patch that there is a
  dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage
  and remove it anywhere we can.

  Summary:

   - Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off)

   - SPDX license tag cleanups (new tag Linux-OpenIB)

   - RoCE GID fixes related to default GIDs

   - Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch),
     mlx4, mlx5, and hfi1 (medium batch)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits)
  RDMA/cma: Do not query GID during QP state transition to RTR
  IB/mlx4: Fix integer overflow when calculating optimal MTT size
  IB/hfi1: Fix memory leak in exception path in get_irq_affinity()
  IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure
  IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
  IB/hfi1: Fix loss of BECN with AHG
  IB/hfi1 Use correct type for num_user_context
  IB/hfi1: Fix handling of FECN marked multicast packet
  IB/core: Make ib_mad_client_id atomic
  iw_cxgb4: Atomically flush per QP HW CQEs
  IB/uverbs: Fix kernel crash during MR deregistration flow
  IB/uverbs: Prevent reregistration of DM_MR to regular MR
  RDMA/mlx4: Add missed RSS hash inner header flag
  RDMA/hns: Fix a couple misspellings
  RDMA/hns: Submit bad wr
  RDMA/hns: Update assignment method for owner field of send wqe
  RDMA/hns: Adjust the order of cleanup hem table
  RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set
  RDMA/hns: Remove some unnecessary attr_mask judgement
  RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set
  ...
2018-05-04 20:51:10 -10:00
Johannes Thumshirn 8bfc3b4c6f nvmet: switch loopback target state to connecting when resetting
After commit bb06ec3145 ("nvme: expand nvmf_check_if_ready checks")
resetting of the loopback nvme target failed as we forgot to switch
it's state to NVME_CTRL_CONNECTING before we reconnect the admin
queues. Therefore the checks in nvmf_check_if_ready() choose to go to
the reject_io case and thus we couldn't sent out an identify
controller command to reconnect.

Change the controller state to NVME_CTRL_CONNECTING after tearing down
the old connection and before re-establishing the connection.

Fixes: bb06ec3145 ("nvme: expand nvmf_check_if_ready checks")
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 09:37:50 -06:00
Greg Thelen d6fc6a22fc nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 11:15:43 -04:00
James Smart bb06ec3145 nvme: expand nvmf_check_if_ready checks
The nvmf_check_if_ready() checks that were added are very simplistic.
As such, the routine allows a lot of cases to fail ios during windows
of reset or re-connection. In cases where there are not multi-path
options present, the error goes back to the callee - the filesystem
or application. Not good.

The common routine was rewritten and calling syntax slightly expanded
so that per-transport is_ready routines don't need to be present.
The transports now call the routine directly. The routine is now a
fabrics routine rather than an inline function.

The routine now looks at controller state to decide the action to
take. Some states mandate io failure. Others define the condition where
a command can be accepted.  When the decision is unclear, a generic
queue-or-reject check is made to look for failfast or multipath ios and
only fails the io if it is so marked. Otherwise, the io will be queued
and wait for the controller state to resolve.

Admin commands issued via ioctl share a live admin queue with commands
from the transport for controller init. The ioctls could be intermixed
with the initialization commands. It's possible for the ioctl cmd to
be issued prior to the controller being enabled. To block this, the
ioctl admin commands need to be distinguished from admin commands used
for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to
reflect this division and set it on ioctls requests.  As the
nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(),
ensure that commands allocated by the ioctl path (actually anything
in core.c) preps the nvme_req(req) before starting the io. This will
preserve the USERCMD flag during execution and/or retry.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Daniel Verkamp c739969849 nvmet: fix space padding in serial number
Commit 42de82a8b5 previously attempted to fix this, and it did
correctly pad the MN and FR fields with spaces, but the SN field still
contains 0 bytes.  The current code fills out the first 16 bytes with
hex2bin, leaving the last 4 bytes zeroed.  Rather than adding a lot of
error-prone math to avoid overwriting SN twice, just set the whole thing
to spaces up front (it's only 20 bytes).

Fixes: 42de82a8b5 ("nvmet: don't report 0-bytes in serial number")
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Rodrigo R. Galvao 543c09c89f nvmet: Fix nvmet_execute_write_zeroes sector count
We have to increment the number of logical blocks to a 1's based value
in the native format prior to converting to 512b units.

Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
[changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Arnd Bergmann 6038aa532a nvme: target: fix buffer overflow
nvmet_execute_get_disc_log_page() passes a fixed-length string into
nvmet_format_discovery_entry(), which then does a longer memcpy() on
it, as pointed out by gcc-8:

In function 'nvmet_format_discovery_entry',
    inlined from 'nvmet_execute_get_disc_log_page' at drivers/nvme/target/discovery.c:126:4:
drivers/nvme/target/discovery.c:62:2: error: 'memcpy' forming offset [38, 223] is out of the bounds [0, 37] [-Werror=array-bounds]
  memcpy(e->subnqn, subsys_nqn, NVMF_NQN_SIZE);

Using strncpy() will make this well-defined, filling the rest of the
buffer with zeroes, under the assumption that the input is either
a NUL-terminated string, or a byte sequence containing no zeroes.
If the input is a string that is longer than NVMF_NQN_SIZE, we
continue to have no NUL-termination in the output.

Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Ming Lei 11d9ea6f2c nvme-loop: fix kernel oops in case of unhandled command
When nvmet_req_init() fails, __nvmet_req_complete() is called
to handle the target request via .queue_response(), so
nvme_loop_queue_response() shouldn't be called again for
handling the failure.

This patch fixes this case by the following way:

- move blk_mq_start_request() before nvmet_req_init(), so
nvme_loop_queue_response() may work well to complete this
host request

- don't call nvme_cleanup_cmd() which is done in nvme_loop_complete_rq()

- don't call nvme_loop_queue_response() which is done via
.queue_response()

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[trimmed changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Christoph Hellwig e929f06d9e nvmet: constify struct nvmet_fabrics_ops
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Christoph Hellwig a5d1861229 nvmet: refactor configfs transport type handling
Have a common table of mappings from numerical transport ids to names, and
zero the transport specific area in common code in nvmet_addr_trtype_store.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy f871749a9f nvmet: move device_uuid configfs attr definition to suitable place
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy a3dd7d0022 nvmet-rdma: Don't flush system_wq by default during remove_one
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the system work queue.

Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Israel Rukshin e1a2ee249b nvmet-rdma: Fix use after free in nvmet_rdma_cm_handler()
We free nvmet rdma queues while handling rdma_cm events.
In order to avoid this we destroy the qp and the queue after destroying
the cm_id which guarantees that all rdma_cm events are done.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Israel Rukshin be9bddeb0a nvmet-rdma: Remove unused queue state
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
James Smart 9d625f7792 nvmet_fc: prevent new io rqsts in possible isr completions
When a bio completion calls back into the transport for a
back-end io device, the request completion path can free
the transport io job structure allowing it to be reused for
other operations. The transport has a defer_rcv queue which
holds temporary cmd rcv ops while waitng for io job structures.
when the job frees, if there's a cmd waiting, it is picked up
and submitted for processing, which can call back out to the
bio path if it's a read.  Unfortunately, what is unknown is the
context of the original bio done call, and it may be in a state
(softirq) that is not compatible with submitting the new bio in
the same calling sequence. This is especially true when using
scsi back-end devices as scsi is in softirq when it makes the
done call.

Correct by scheduling the io to be started via workq rather
than calling the start new io path inline to the original bio
done path.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Sagi Grimberg 4c65268588 nvmet: don't return "any" ip address in discovery log page
Its perfectly valid to assign a nvmet port to listen on "any"
IP address (traddr 0.0.0.0 for ipv4 address family) for IP based
transport ports. However, we must not return this address in
discovery log entries. Instead we need to return the address
where the request was accepted on (req->port address).

Since this is nvme transport specific, introduce an optional
.disc_traddr interface that is designed to check that a
port in question is bound to "any" IP address and if so, set
the traddr from the port where the request came from.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jens Axboe 468f098734 Merge branch 'for-jens' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Keith for 4.16-rc.

* 'for-jens' of git://git.infradead.org/nvme:
  nvmet: fix PSDT field check in command format
  nvme-multipath: fix sysfs dangerously created links
  nvme-pci: Fix nvme queue cleanup if IRQ setup fails
  nvmet-loop: use blk_rq_payload_bytes for sgl selection
  nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes
  nvme-fabrics: don't check for non-NULL module in nvmf_register_transport
2018-02-28 12:18:58 -07:00
Max Gurtovoy bffd2b6167 nvmet: fix PSDT field check in command format
PSDT field section according to NVM_Express-1.3:
"This field specifies whether PRPs or SGLs are used for any data
transfer associated with the command. PRPs shall be used for all
Admin commands for NVMe over PCIe. SGLs shall be used for all Admin
and I/O commands for NVMe over Fabrics. This field shall be set to
01b for NVMe over Fabrics 1.0 implementations.

Suggested-by: Idan Burstein <idanb@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-02-28 10:10:22 -07:00
Christoph Hellwig 796b0b8d8d nvmet-loop: use blk_rq_payload_bytes for sgl selection
blk_rq_bytes does the wrong thing for special payloads like discards and
might cause the driver to not set up a SGL.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-02-22 01:45:34 -07:00
Israel Rukshin 7756f72ccd nvmet: Change return code of discard command if not supported
Execute discard command on block device that doesn't support it
should return success.
Returning internal error while using multi-path fails the path.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-14 15:38:59 +02:00
Linus Torvalds 0a4b6e2f80 Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "This is the main pull request for block IO related changes for the
  4.16 kernel. Nothing major in this pull request, but a good amount of
  improvements and fixes all over the map. This contains:

   - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
     Paolo.

   - Support for SMR zones for deadline and mq-deadline from Damien and
     Christoph.

   - Set of fixes for bcache by way of Michael Lyle, including fixes
     from himself, Kent, Rui, Tang, and Coly.

   - Series from Matias for lightnvm with fixes from Hans Holmberg,
     Javier, and Matias. Mostly centered around pblk, and the removing
     rrpc 1.2 in preparation for supporting 2.0.

   - A couple of NVMe pull requests from Christoph. Nothing major in
     here, just fixes and cleanups, and support for command tracing from
     Johannes.

   - Support for blk-throttle for tracking reads and writes separately.
     From Joseph Qi. A few cleanups/fixes also for blk-throttle from
     Weiping.

   - Series from Mike Snitzer that enables dm to register its queue more
     logically, something that's alwways been problematic on dm since
     it's a stacked device.

   - Series from Ming cleaning up some of the bio accessor use, in
     preparation for supporting multipage bvecs.

   - Various fixes from Ming closing up holes around queue mapping and
     quiescing.

   - BSD partition fix from Richard Narron, fixing a problem where we
     can't mount newer (10/11) FreeBSD partitions.

   - Series from Tejun reworking blk-mq timeout handling. The previous
     scheme relied on atomic bits, but it had races where we would think
     a request had timed out if it to reused at the wrong time.

   - null_blk now supports faking timeouts, to enable us to better
     exercise and test that functionality separately. From me.

   - Kill the separate atomic poll bit in the request struct. After
     this, we don't use the atomic bits on blk-mq anymore at all. From
     me.

   - sgl_alloc/free helpers from Bart.

   - Heavily contended tag case scalability improvement from me.

   - Various little fixes and cleanups from Arnd, Bart, Corentin,
     Douglas, Eryu, Goldwyn, and myself"

* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
  block: remove smart1,2.h
  nvme: add tracepoint for nvme_complete_rq
  nvme: add tracepoint for nvme_setup_cmd
  nvme-pci: introduce RECONNECTING state to mark initializing procedure
  nvme-rdma: remove redundant boolean for inline_data
  nvme: don't free uuid pointer before printing it
  nvme-pci: Suspend queues after deleting them
  bsg: use pr_debug instead of hand crafted macros
  blk-mq-debugfs: don't allow write on attributes with seq_operations set
  nvme-pci: Fix queue double allocations
  block: Set BIO_TRACE_COMPLETION on new bio during split
  blk-throttle: use queue_is_rq_based
  block: Remove kblockd_schedule_delayed_work{,_on}()
  blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
  blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
  lib/scatterlist: Fix chaining support in sgl_alloc_order()
  blk-throttle: track read and write request individually
  block: add bdev_read_only() checks to common helpers
  block: fail op_is_write() requests to read-only partitions
  blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
  ...
2018-01-29 11:51:49 -08:00
Sagi Grimberg 423b4487fb nvmet: release a ns reference in nvmet_req_uninit if needed
nvmet_req_init looked up a namespace and took a reference on it (unless it
failed prior to that). If the request is uninitialized (in error cases) we
need to remove that reference in case it was taken, otherwise we leak
namespace reference when calling nvme_req_uninit.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-15 17:09:32 +01:00
Roy Shterman b227c59b9b nvme: host delete_work and reset_work on separate workqueues
We need to ensure that delete_work will be hosted on a different
workqueue than all the works we flush or cancel from it.
Otherwise we may hit a circular dependency warning [1].

Also, given that delete_work flushes reset_work, host reset_work
on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
fix the flushing in the individual drivers to flush nvme_delete_wq
when draining queued deletes.

[1]:
[  178.491942] =============================================
[  178.492718] [ INFO: possible recursive locking detected ]
[  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
[  178.494382] ---------------------------------------------
[  178.495160] kworker/5:1/135 is trying to acquire lock:
[  178.495894]  (
[  178.496120] "nvme-wq"
[  178.496471] ){++++.+}
[  178.496599] , at:
[  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
[  178.497670]
               but task is already holding lock:
[  178.498499]  (
[  178.498724] "nvme-wq"
[  178.499074] ){++++.+}
[  178.499202] , at:
[  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.500343]
               other info that might help us debug this:
[  178.501269]  Possible unsafe locking scenario:

[  178.502113]        CPU0
[  178.502472]        ----
[  178.502829]   lock(
[  178.503115] "nvme-wq"
[  178.503467] );
[  178.503716]   lock(
[  178.504001] "nvme-wq"
[  178.504353] );
[  178.504601]
                *** DEADLOCK ***

[  178.505441]  May be due to missing lock nesting notation

[  178.506453] 2 locks held by kworker/5:1/135:
[  178.507068]  #0:
[  178.507330]  (
[  178.507598] "nvme-wq"
[  178.507726] ){++++.+}
[  178.508079] , at:
[  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.509004]  #1:
[  178.509265]  (
[  178.509532] (&ctrl->delete_work)
[  178.509795] ){+.+.+.}
[  178.510145] , at:
[  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.511070]
               stack backtrace:
:
[  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
[  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
[  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
[  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
[  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
[  178.518443] Call Trace:
[  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
[  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
[  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
[  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
[  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
[  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
[  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
[  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
[  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
[  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
[  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
[  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
[  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
[  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
[  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
[  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
[  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
[  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
[  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40

Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-15 17:09:30 +01:00
Israel Rukshin 6b1943af3f nvmet: rearrange nvmet_ctrl_free()
Make it symmetric to nvmet_alloc_ctrl().

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:59 +01:00
Israel Rukshin eca19dc1d8 nvmet: fix error flow in nvmet_alloc_ctrl()
Remove the allocated id on error.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:58 +01:00
Roy Shterman 0de5cd367c nvme-fabrics: protect against module unload during create_ctrl
NVMe transport driver module unload may (and usually does) trigger
iteration over the active controllers and delete them all (sometimes
under a mutex).  However, a controller can be created concurrently with
module unload which can lead to leakage of resources (most important char
device node leakage) in case the controller creation occured after the
unload delete and drain sequence.  To protect against this, we take a
module reference to guarantee that the nvme transport driver is not
unloaded while creating a controller.

Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:56 +01:00
James Smart 9ce1f2e12e nvmet-fc: cleanup nvmet add_port/remove_port
The current fc transport add_port routine validates that there is a
matching port to the target port config. It then takes a reference
on the targetport. The del_port removes the reference.

Unfortunately, if the LLDD undergoes a hw reset or driver unload and
wants to unreg the targetport, due to the reference, the targetport
effectively can't be removed. It requires the admin to remove the
port from the nvmet config first, which calls the del_port.
Note: it appears nvmetcli clear skips over the del_port call (I'm
not attempting to change that).

There's no real reason to take the reference. With FC, there is nothing
to enable or disable as the presence of the FC targetport implicitly
means its enabled, and removal of the targtport means its disabled.

Change add_port to simply validate and change remove_port to a noop.
No references are taken on the targetport.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:56 +01:00
James Smart b6f807738b nvme_fcloop: refactor host/target io job access
The split between what the host accesses on its flows vs what the
target side accesses was flawed. Abort handling didn't properly
clear initiator vs target structure cross-reference and locks
weren't used for synchronization. Thus, there were issues of
freeing structures too soon and access after free.

A couple of these existed pre the IN_ISR mods, but when the
target upcalls were converted to work items, thus adding delays
between the 2 sides of accesses, the problems became pronounced.

Resolve by:
- tracking io state mainly in the tgt-side io structure.
- make the tgt-side io structure released by reference not by
  code flow.
- when changing initiator structures, use locks for
  synchronization
- aborts are clearly tracked for which side saw the abort, and
  after seeing the abort, cross-references are cleared under lock.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:55 +01:00
James Smart 24431d60d3 nvme_fcloop: rework to remove xxx_IN_ISR feature flags
The existing fcloop driver expects the target side upcalls to
the transport to context switch, thus the calls into the nvmet layer
are not done in the calling context of the host/initiator down calls.
The xxx_IN_ISR feature flags are used to select this logic.

The xxx_IN_ISR feature flags should go away in the nvmet_fc transport
as no other lldd utilizes them. Both Broadcom and Cavium lldds have their
own non-ISR deferred handlers thus the nvmet calls can be made directly.

This patch converts the paths that make the target upcalls (command
receive, abort receive) such that they schedule a work item rather
than expecting the transport to schedule the work item.

The patch also cleans up the following:
- The completion path from target to host scheduled a host work
  element called "work". Rename it "tio_done_work" for code clarity.
- The abort io path called a iniwork item to call the host side
  io done. This is no longer needed as the abort routine can make
  the same call.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:54 +01:00
James Smart 6fda20283e nvme_fcloop: disassocate local port structs
The current fcloop driver gets its lport structure from the private
area co-allocated with the fc_localport. All is fine except the
teardown path, which wants to wait on the completion, which is marked
complete by the delete_localport callback performed after
unregister_localport.  The issue is, the nvme_fc transport frees the
localport structure immediately after delete_localport is called,
meaning the original routine is trying to wait on a complete that
was just freed.

Change such that a lport struct is allocated coincident with the
addition and registration of a localport. The private area of the
localport now contains just a backpointer to the real lport struct.
Now, the completion can be waited for, and after completing, the
new structure can be kfree'd.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:54 +01:00
James Smart 278e096063 nvme_fcloop: fix abort race condition
A test case revealed a race condition of an i/o completing on a thread
parallel to the delete_association generating the aborts for the
outstanding ios on the controller.  The i/o completion was freeing the
target fcloop context, thus the abort task referenced the just-freed
memory.

Correct by clearing the target/initiator cross pointers in the io
completion and abort tasks before calling the callbacks. On aborts
that detect already finished io's, ensure the complete context is
called.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:53 +01:00
Sagi Grimberg 6a1c57acab nvmet: lower log level for each queue creation
It is a bit chatty to report on each queue, log it only for debug
purposes.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:52 +01:00
Sagi Grimberg 424125a09d nvmet-rdma: lowering log level for chatty debug messages
It is a bit chatty to report on every deleted queue, so keep it for debug
purposes only.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:52 +01:00
Sagi Grimberg cb4876e8ce nvmet-rdma: removed queue cleanup from module exit
We already do that when we are notified in device removal
which is triggered when unregistering as an ib client.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:01:51 +01:00
Bart Van Assche 68c6e9cd2f nvmet/rdma: Use sgl_alloc() and sgl_free()
Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Bart Van Assche 4442b56fb5 nvmet/fc: Use sgl_alloc() and sgl_free()
Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by:  James Smart <james.smart@broadcom.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
James Smart 254beb84fa nvme-fcloop: avoid possible uninitialized variable warning
The kbuild test robot send mail of a potential use of an uninitialized
variable - "tport" in fcloop_delete_targetport() which then calls
__targetport_unreg() which uses the variable. It will never be the
case it is uninitialized as the call to __targetport_unreg() only
occurs if there is a valid nport pointer. And at the time the nport
pointer is assigned, the tport variable is set.

Remove the warning by assigning a NULL value initially.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29 10:37:21 +01:00
James Smart 619c62dcc6 nvmet-fc: correct ref counting error when deferred rcv used
Whenever a cmd is received a reference is taken while looking up the
queue. The reference is removed after the cmd is done as the iod is
returned for reuse. The fod may be reused for a deferred (recevied but
no job context) cmd.  Existing code removes the reference only if the
fod is not reused for another command. Given the fod may be used for
one or more ios, although a reference was taken per io, it won't be
matched on the frees.

Remove the reference on every fod free. This pairs the references to
each io.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20 08:38:12 +01:00
Sagi Grimberg 9d7fab04b9 nvme-loop: check if queue is ready in queue_rq
In case the queue is not LIVE (fully functional and connected at the nvmf
level), we cannot allow any commands other than connect to pass through.

Add a new queue state flag NVME_LOOP_Q_LIVE which is set after nvmf connect
and cleared in queue teardown.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20 08:28:36 +01:00
James Smart cce75291ff nvmet_fc: fix better length checking
Reorganize nvmet_fc_handle_fcp_rqst() so that the nvmet req.transfer_len
field is set after the call nvmet_req_init(). An update to nvmet now
has nvmet_req_init() clearing the field, thus the fc transport was losing
the value.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-16 11:27:04 -07:00
Linus Torvalds e2c5923c34 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
2017-11-14 15:32:19 -08:00