Commit Graph

2346 Commits

Author SHA1 Message Date
Christophe JAILLET cf3d008401 nvme-fc: improve memory usage in nvme_fc_rcv_ls_req()
sizeof( struct nvmefc_ls_rcv_op ) = 64
sizeof( union nvmefc_ls_requests ) = 1024
sizeof( union nvmefc_ls_responses ) = 128

So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when
kzalloc() is called.

Because of the way memory allocations are performed, 2048 bytes are
allocated. So about 800 bytes are wasted for each request.

Switch to 3 distinct memory allocations, in order to:
   - save these 800 bytes
   - avoid zeroing this extra memory
   - make sure that memory is properly aligned in case of DMA access
    ("fc_dma_map_single(lsop->rspbuf)" just a few lines below)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:50:30 +01:00
Bean Huo d5ceb4d1c5 nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro
Added a quirk to fix Micron Nitro NVMe reporting duplicate NGUIDs.

Cc: <stable@vger.kernel.org>
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15 10:48:59 +01:00
Keith Busch d7ac8dca93 nvme: quiet user passthrough command errors
The driver is spamming the kernel logs for entirely harmless errors from
user space submitting unsupported commands. Just silence the errors.
The application has direct access to command status, so there's no need
to log these.

And since every passthrough command now uses the quiet flag, move the
setting to the common initializer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-09 14:28:27 +01:00
Chao Leng 98d81f0df7 nvme: use blk_mq_[un]quiesce_tagset
All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).

nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.

Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig 483239c75b blk-mq: pass a tagset to blk_mq_wait_quiesce_done
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig 2b4c2355c5 nvme-apple: don't unquiesce the I/O queues in apple_nvme_reset_work
apple_nvme_reset_work schedules apple_nvme_remove, to be called, which
will call apple_nvme_disable and unquiesce the I/O queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig bad3e021ae nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl
nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
call nvme_dev_disable and unquiesce the I/O queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig cd50f9b247 nvme: split nvme_kill_queues
nvme_kill_queues does two things:

 1) mark the gendisk of all namespaces dead
 2) unquiesce all I/O queues

These used to be be intertwined due to block layer issues, but aren't
any more.  So move the unquiscing of the I/O queues into the callers,
and rename the rest of the function to the now more descriptive
nvme_mark_namespaces_dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig 6bcd5089ee nvme: don't unquiesce the admin queue in nvme_kill_queues
None of the callers of nvme_kill_queues needs it to unquiesce the
admin queues, as all of them already do it themselves:

 1) nvme_reset_work explicit call nvme_start_admin_queue toward the
    beginning of the function.  The extra call to nvme_start_admin_queue
    in nvme_reset_work this won't do anything as
    NVME_CTRL_ADMIN_Q_STOPPED will already be cleared.
 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at
    the very beginning of the function if the PCIe device was not present,
    which is the precondition for the call to nvme_kill_queues.
    nvme_dev_disable already calls nvme_start_admin_queue toward the
    end of the function when the shutdown flag is set to true, so the
    admin queue is already enabled at this point.
 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver,
    which will end up in nvme_remove, which calls nvme_dev_disable with
    the shutdown flag.  This case will call nvme_start_admin_queue a bit
    later than before.
 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl
    above.
 5) nvme_remove_namespaces only calls nvme_kill_queues when the
    controller is in the DEAD state.  That can only happen in the PCIe
    driver, and only from nvme_remove. See item 2) above for the
    conditions there.

So it is safe to just remove the call to nvme_start_admin_queue in
nvme_kill_queues without replacement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig fde776afdd nvme: remove the NVME_NS_DEAD check in nvme_validate_ns
At the point where namespaces are marked dead, the controller is in a
non-live state and we won't get pass the identify commands.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig 4f17344e9d nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig 23a908647e nvme: don't remove namespaces in nvme_passthru_end
The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more.  Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:34 -06:00
Christoph Hellwig 0ffc7e98bf nvme-pci: refactor the tagset handling in nvme_reset_work
The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted.  Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de
[axboe: fix whitespace issue]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:35:19 -06:00
Christoph Hellwig 71b26083d5 block: set the disk capacity to 0 in blk_mark_disk_dead
nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later.  Move it to the
common code.

This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02 08:32:31 -06:00
Keith Busch fe8714b04f nvme-multipath: set queue dma alignment to 3
NVMe spec requires all transports support dword aligned addresses, which
is already set in the namespace request_queue. Set the same limit in the
multipath device's request_queue as well.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25 08:07:53 -07:00
Sagi Grimberg 83e1226b0e nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
When destroying a queue, when calling sock_release, the network stack
might need to allocate an skb to send a FIN/RST. When that happens
during memory pressure, there is a need to reclaim memory, which
in turn may ask the nvme-tcp device to write out dirty pages, however
this is not possible due to a ctrl teardown that is going on.

Set PF_MEMALLOC to the task that releases the socket to grant access
to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp
thread as this may also originate from the swap itself and should
be more resilient to memory pressure situations.

This fixes the following lockdep complaint:
--
======================================================
 WARNING: possible circular locking dependency detected
 6.0.0-rc2+ #25 Tainted: G        W
 ------------------------------------------------------
 kswapd0/92 is trying to acquire lock:
 ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0

 but task is already holding lock:
 ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (fs_reclaim){+.+.}-{0:0}:
        fs_reclaim_acquire+0x11e/0x160
        kmem_cache_alloc_node+0x44/0x530
        __alloc_skb+0x158/0x230
        tcp_send_active_reset+0x7e/0x730
        tcp_disconnect+0x1272/0x1ae0
        __tcp_close+0x707/0xd90
        tcp_close+0x26/0x80
        inet_release+0xfa/0x220
        sock_release+0x85/0x1a0
        nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
        nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
        nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
        kernfs_fop_write_iter+0x356/0x530
        vfs_write+0x4e8/0xce0
        ksys_write+0xfd/0x1d0
        do_syscall_64+0x58/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd

 -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
        __lock_acquire+0x2a0c/0x5690
        lock_acquire+0x18e/0x4f0
        lock_sock_nested+0x37/0xc0
        tcp_sendpage+0x23/0xa0
        inet_sendpage+0xad/0x120
        kernel_sendpage+0x156/0x440
        nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
        nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
        __blk_mq_try_issue_directly+0x452/0x660
        blk_mq_plug_issue_direct.constprop.0+0x207/0x700
        blk_mq_flush_plug_list+0x6f5/0xc70
        __blk_flush_plug+0x264/0x410
        blk_finish_plug+0x4b/0xa0
        shrink_lruvec+0x1263/0x1ea0
        shrink_node+0x736/0x1a80
        balance_pgdat+0x740/0x10d0
        kswapd+0x5f2/0xaf0
        kthread+0x256/0x2f0
        ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(sk_lock-AF_INET-NVME);
                               lock(fs_reclaim);
  lock(sk_lock-AF_INET-NVME);

 *** DEADLOCK ***

3 locks held by kswapd0/92:
 #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
 #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
 #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25 08:07:50 -07:00
Nam Cao 5fa9add66b nvme-tcp: replace sg_init_marker() with sg_init_table()
In nvme_tcp_ddgst_update(), sg_init_marker() is called with an
uninitialized scatterlist. This is probably fine, but gcc complains:

  CC [M]  drivers/nvme/host/tcp.o
In file included from ./include/linux/dma-mapping.h:10,
                 from ./include/linux/skbuff.h:31,
                 from ./include/net/net_namespace.h:43,
                 from ./include/linux/netdevice.h:38,
                 from ./include/net/sock.h:46,
                 from drivers/nvme/host/tcp.c:12:
In function ‘sg_mark_end’,
    inlined from ‘sg_init_marker’ at ./include/linux/scatterlist.h:356:2,
    inlined from ‘nvme_tcp_ddgst_update’ at drivers/nvme/host/tcp.c:390:2:
./include/linux/scatterlist.h:234:11: error: ‘sg.page_link’ is used uninitialized [-Werror=uninitialized]
  234 |         sg->page_link |= SG_END;
      |         ~~^~~~~~~~~~~
drivers/nvme/host/tcp.c: In function ‘nvme_tcp_ddgst_update’:
drivers/nvme/host/tcp.c:388:28: note: ‘sg’ declared here
  388 |         struct scatterlist sg;
      |                            ^~
cc1: all warnings being treated as errors

Use sg_init_table() instead, which basically memset the scatterlist to
zero first before calling sg_init_marker().

Signed-off-by: Nam Cao <namcaov@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25 08:07:50 -07:00
Christoph Hellwig 941f7298c7 nvme-apple: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the
apple_nvme structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:35 -06:00
Christoph Hellwig 7dcebef90d nvme-pci: remove an extra queue reference
Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the nvme_dev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:35 -06:00
Christoph Hellwig 2b3f056f72 blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue
The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference.  Move the call to
blk_put_queue into the callers to allow removing the extra references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25 08:25:10 -06:00
Serge Semin c94b7f9bab nvme-hwmon: kmalloc the NVME SMART log buffer
Recent commit 52fde2c07d ("nvme: set dma alignment to dword") has
caused a regression on our platform.

It turned out that the nvme_get_log() method invocation caused the
nvme_hwmon_data structure instance corruption.  In particular the
nvme_hwmon_data.ctrl pointer was overwritten either with zeros or with
garbage.  After some research we discovered that the problem happened
even before the actual NVME DMA execution, but during the buffer mapping.
Since our platform is DMA-noncoherent, the mapping implied the cache-line
invalidations or write-backs depending on the DMA-direction parameter.
In case of the NVME SMART log getting the DMA was performed
from-device-to-memory, thus the cache-invalidation was activated during
the buffer mapping.  Since the log-buffer isn't cache-line aligned, the
cache-invalidation caused the neighbour data to be discarded.  The
neighbouring data turned to be the data surrounding the buffer in the
framework of the nvme_hwmon_data structure.

In order to fix that we need to make sure that the whole log-buffer is
defined within the cache-line-aligned memory region so the
cache-invalidation procedure wouldn't involve the adjacent data. One of
the option to guarantee that is to kmalloc the DMA-buffer [1]. Seeing the
rest of the NVME core driver prefer that method it has been chosen to fix
this problem too.

Note after a deeper researches we found out that the denoted commit wasn't
a root cause of the problem. It just revealed the invalidity by activating
the DMA-based NVME SMART log getting performed in the framework of the
NVME hwmon driver. The problem was here since the initial commit of the
driver.

[1] Documentation/core-api/dma-api-howto.rst

Fixes: 400b6a7b13 ("nvme: Add hardware monitoring support")
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 12:43:13 +02:00
Christoph Hellwig 6b8cf94005 nvme-hwmon: consistently ignore errors from nvme_hwmon_init
An NVMe controller works perfectly fine even when the hwmon
initialization fails.  Stop returning errors that do not come from a
controller reset from nvme_hwmon_init to handle this case consistently.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
2022-10-19 12:42:58 +02:00
Russell King (Oracle) d622f8477a nvme-apple: don't limit DMA segement size
NVMe uses PRPs for data transfers and has no specific limit for a single
DMA segement.  Limiting the size will cause problems because the block
layer assumes PRP-ish devices using a virt boundary mask don't have a
segment limit.  And while this is true, we also really need to tell the
DMA mapping layer about it, otherwise dma-debug will trip over it.

Fixes: 5bd2927ace ("nvme-apple: Add initial Apple SoC NVMe driver")
Suggested-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[hch: rewrote the commit message based on the PCIe commit]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
2022-10-19 10:36:39 +02:00
Xander Li ac9b57d4e1 nvme-pci: disable write zeroes on various Kingston SSD
Kingston SSDs do support NVMe Write_Zeroes cmd but take long time to
process.  The firmware version is locked by these SSDs, we can not expect
firmware improvement, so disable Write_Zeroes cmd.

Signed-off-by: Xander Li <xander_li@kingston.com.tw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 10:36:39 +02:00
Dan Carpenter 4739824e2d nvme: fix error pointer dereference in error handling
There is typo here so it releases the wrong variable.  "ctrl->admin_q"
was intended instead of "ctrl->fabrics_q".

Fixes: fe60e8c534 ("nvme: add common helpers to allocate and free tagsets")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-19 10:36:39 +02:00
Sagi Grimberg 72e3b8883a nvme-multipath: fix possible hang in live ns resize with ANA access
When we revalidate paths as part of ns size change (as of commit
e7d65803e2), it is possible that during the path revalidation, the
only paths that is IO capable (i.e. optimized/non-optimized) are the
ones that ns resize was not yet informed to the host, which will cause
inflight requests to be requeued (as we have available paths but none
are IO capable). These requests on the requeue list are waiting for
someone to resubmit them at some point.

The IO capable paths will eventually notify the ns resize change to the
host, but there is nothing that will kick the requeue list to resubmit
the queued requests.

Fix this by always kicking the requeue list, and if no IO capable path
exists, these requests will be queued again.

A typical log that indicates that IOs are requeued:
--
nvme nvme1: creating 4 I/O queues.
nvme nvme1: new ctrl: "testnqn1"
nvme nvme2: creating 4 I/O queues.
nvme nvme2: mapped 4/0/0 default/read/poll queues.
nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
nvme nvme1: rescanning namespaces.
nvme1n1: detected capacity change from 2097152 to 4194304
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
nvme nvme2: rescanning namespaces.
--

Reported-by: Yogev Cohen <yogev@lightbitslabs.com>
Fixes: e7d65803e2 ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:42:58 +02:00
Xi Ruoyao d5d3c100ac nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs
ZHITAI TiPro5000 SSDs has the same APST sleep problem as its cousin,
TiPro7000.  The quirk for TiPro7000 has been added in
commit 6b961bce50 ("nvme-pci: avoid the deepest sleep state on
ZHITAI TiPro7000 SSDs"), use the same quirk for TiPro5000.

The ASPT data from "nvme id-ctrl /dev/nvme1":

vid       : 0x1e49
ssvid     : 0x1e49
sn        : ZTA21T0KA2227304LM
mn        : ZHITAI TiPlus5000 1TB
fr        : ZTA09139
[...]
ps    0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
         rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
         rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
         rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
         rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0025W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4
         rwt:4 rwl:4 idle_power:- active_power:-

Reported-and-tested-by: Chang Feng <flukehn@gmail.com>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:42:58 +02:00
Abhijit 80b2624094 nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760
Add a quirk to fix Lexar NM760 SSD drives reporting duplicate nsids.

Signed-off-by: Abhijit <abhijit@abhijittomar.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:42:45 +02:00
Sagi Grimberg c4abd87571 nvme-tcp: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
	inflight or scheduled (specifically also .stop_ctrl
	which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
	to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing (see
stack trace [1]).

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

[1]:
--
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? scan_shadow_nodes+0x40/0x40
 schedule+0x55/0xe0
 io_schedule+0x16/0x40
 do_read_cache_page+0x55d/0x850
 ? __page_cache_alloc+0x90/0x90
 read_cache_page+0x12/0x20
 read_part_sector+0x3f/0x110
 amiga_partition+0x3d/0x3e0
 ? osf_partition+0x33/0x220
 ? put_partition+0x90/0x90
 bdev_disk_changed+0x1fe/0x4d0
 blkdev_get_whole+0x7b/0x90
 blkdev_get_by_dev+0xda/0x2d0
 device_add_disk+0x356/0x3b0
 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
 ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
 nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
 nvme_mpath_add_disk+0x120/0x160 [nvme_core]
 nvme_alloc_ns+0x594/0xa00 [nvme_core]
 nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
 ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
 nvme_scan_work+0x281/0x410 [nvme_core]
 process_one_work+0x1be/0x380
 worker_thread+0x37/0x3b0
 ? process_one_work+0x380/0x380
 kthread+0x12d/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
INFO: task nvme:6725 blocked for more than 491 seconds.
      Not tainted 5.15.65-f0.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D
 stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? sched_clock+0x9/0x10
 schedule+0x55/0xe0
 schedule_timeout+0x24b/0x2e0
 ? try_to_wake_up+0x358/0x510
 ? finish_task_switch+0x88/0x2c0
 wait_for_completion+0xa5/0x110
 __flush_work+0x144/0x210
 ? worker_attach_to_pool+0xc0/0xc0
 flush_work+0x10/0x20
 nvme_remove_namespaces+0x41/0xf0 [nvme_core]
 nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
 nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
 dev_attr_store+0x14/0x30
 sysfs_kf_write+0x38/0x50
 kernfs_fop_write_iter+0x146/0x1d0
 new_sync_write+0x114/0x1b0
 ? intel_pmu_handle_irq+0xe0/0x420
 vfs_write+0x18d/0x270
 ksys_write+0x61/0xe0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
--

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:35:46 +02:00
Sagi Grimberg a1ae8d4d9b nvme-rdma: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
        inflight or scheduled (specifically also .stop_ctrl
        which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
        to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

Fixes: b435ecea2a ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432cb ("nvme: flush namespace scanning work just before removing namespaces")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-12 11:35:46 +02:00
Linus Torvalds 7c989b1da3 for-6.1/passthrough-2022-10-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM8rp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTHD/9eeWwaG7oSSu5J1YzkKn+hptaDzZwreL98
 Mh8euiQScUVpvHGkNowBhjBZ5cIAAcYaH17rjW7dWu6A7tv/iqygWd/YvIbs1JOe
 STSD9yf0RV4dI0MG6Wu2w6YxObaLvE5BTRxqb/WuFWNTgsYf2HEp4PM9sTio71+H
 WwWdRvsIxsRxVYemds3vBxd+BcM8vm26EoUTSaCwRhfopaJwBNceCYIIrM7VHUNM
 5G6+DJkm3mB1a8nsdguYZQC/y8F/9P5Ch9CdxA12yOZEryr3wzsyRNGdm7oRmFGM
 bAkjFcddhwk5+SuTzGX6t4/Z3ODIjeCXbMBg4p7AShHws4Yx1trJePiqoNQ8xd5A
 PkMfxhQpBPlDFKLmwtObPLInyzMpp5P8KYMIZfyymKD/+XjmqAlR6TXbFUTihzBU
 lHSFhwG8ysT2cAVrFBMDJu4UPIThIHqfkkF/nTkHePTSArJ/k5rGV7v5sQpZ+jtY
 R0gvoNHTq2IvgKGEEbTgDjpwVcCn5ERVorZuGjVN2nMdLj35kXpo7YNgyYMaD5LJ
 9SOR5a8iQjjudAfdGyZCGzNaOecizVFjABozUYc1XJi/boNuFTsq4XCE/tCLTixc
 V4sElRpgrlXxNXkiVdbuWIPuYo4sDw5gqZQynpVNH5PkmX/NqmpWYVEWJ20o+pwg
 3ag39nZQVQ==
 =nwLk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux

Pull passthrough updates from Jens Axboe:
 "With these changes, passthrough NVMe support over io_uring now
  performs at the same level as block device O_DIRECT, and in many cases
  6-8% better.

  This contains:

   - Add support for fixed buffers for passthrough (Anuj, Kanchan)

   - Enable batched allocations and freeing on passthrough, similarly to
     what we support on the normal storage path (me)

   - Fix from Geert fixing an issue with !CONFIG_IO_URING"

* tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux:
  io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy
  nvme: wire up fixed buffer support for nvme passthrough
  nvme: pass ubuffer as an integer
  block: extend functionality to map bvec iterator
  block: factor out blk_rq_map_bio_alloc helper
  block: rename bio_map_put to blk_mq_map_bio_put
  nvme: refactor nvme_alloc_request
  nvme: refactor nvme_add_user_metadata
  nvme: Use blk_rq_map_user_io helper
  scsi: Use blk_rq_map_user_io helper
  block: add blk_rq_map_user_io
  io_uring: introduce fixed buffer support for io_uring_cmd
  io_uring: add io_uring_cmd_import_fixed
  nvme: enable batched completions of passthrough IO
  nvme: split out metadata vs non metadata end_io uring_cmd completions
  block: allow end_io based requests in the completion batch handling
  block: change request end_io handler to pass back a return value
  block: enable batched allocation for blk_mq_alloc_request()
  block: kill deprecated BUG_ON() in the flush handling
2022-10-07 09:35:50 -07:00
Linus Torvalds 513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
Linus Torvalds 0a78a376ef for-6.1/io_uring-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9
 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd
 +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A
 A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD
 hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w
 PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+
 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM
 fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU
 OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN
 oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ
 W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4
 9bWG6BM7Yw==
 =m9kX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Add supported for more directly managed task_work running.

   This is beneficial for real world applications that end up issuing
   lots of system calls as part of handling work. Normal task_work will
   always execute as we transition in and out of the kernel, even for
   "unrelated" system calls. It's more efficient to defer the handling
   of io_uring's deferred work until the application wants it to be run,
   generally in batches.

   As part of ongoing work to write an io_uring network backend for
   Thrift, this has been shown to greatly improve performance. (Dylan)

 - Add IOPOLL support for passthrough (Kanchan)

 - Improvements and fixes to the send zero-copy support (Pavel)

 - Partial IO handling fixes (Pavel)

 - CQE ordering fixes around CQ ring overflow (Pavel)

 - Support sendto() for non-zc as well (Pavel)

 - Support sendmsg for zerocopy (Pavel)

 - Networking iov_iter fix (Stefan)

 - Misc fixes and cleanups (Pavel, me)

* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-10-07 08:52:43 -07:00
Linus Torvalds 7bc6e90d7a block-6.0-2022-09-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM2W7wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplm+EACplcKQA0fDsHOAyhqi6Ms7VK38kYAItlK2
 ztSeIQONBUtEuzSD6HA6W5+3gPPWRwXhNrIRMQp6rbKY8WS7jtTP/XqNISxjtvPx
 SMCDnDj8ANE8ykgLxiNECh87Q8D04wu3mY7Rs45aIowvXYxv8ALxGZ3QjFL/vDPT
 BHY8+EBwHbLRcpmI+J/eg1Qa9EfAixCTo+rimAucGe4qHVHmPawEjGY2hgmIcrdo
 k/WTelymDVkbrlRRcERWTFxjRkmS/f6lMQlEj3sPPsFkl2ka/l6BT135FOCJI/v4
 AEOpGjSoLTr0E3b4BCzu0MlU9WsTflinSgWKqJt2J4jyhuWjyCALv4ZrEIetKOZI
 NO+iEiTs2oO/dlGB3x0YFEieDACFRC6/leBAYgJIsG01jRBnQfU9/evxr25pK7kd
 iax+y0f/ilL0/aWR61YhJ67OH+aPLcgwJ1TUHD9N9BCcvq85rk46bcvmRf+ar9JZ
 4srCnybHN9yhkPkxeZKuTiNqJ3xZSesglNxxDMOfU9iWY1CjQX/aOoBuydJkJz4m
 gH8tyfEcvJxdhuLI3mvo7Mv2WUAyS2ohG4k1EAFZbPUa0pAHkyM9b1ZvSnbHWa6Z
 k6yKWmv8WLFKD2VY28HrsMHYBBVDCcJNsDzLqydaKFTU2g8YEQdk3h70Ga3tbhwS
 k23Pfs5bcw==
 =SZZ1
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-29' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "A single NVMe pull request via Christoph with a few fixes that should
  go into the 6.0 release:

   - Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
     (Michael Kelley)

   - Disable Write Zeroes on Phison E3C/E4C (Tina Hsu)"

* tag 'block-6.0-2022-09-29' of git://git.kernel.dk/linux:
  nvme-pci: disable Write Zeroes on Phison E3C/E4C
  nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
2022-09-30 09:33:33 -07:00
Kanchan Joshi 23fd22e55b nvme: wire up fixed buffer support for nvme passthrough
if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag,
use the pre-registered buffer for IO (non-vectored variant). Pass the
buffer/length to io_uring and get the bvec iterator for the range. Next,
pass this bvec to block-layer and obtain a bio/request for subsequent
processing.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-13-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi 4d17448682 nvme: pass ubuffer as an integer
This is a prep patch. Modify nvme_submit_user_cmd and
nvme_map_user_request to take ubuffer as plain integer
argument, and do away with nvme_to_user_ptr conversion in callers.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-12-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi 470e900c80 nvme: refactor nvme_alloc_request
nvme_alloc_request expects a large number of parameters.
Split this out into two functions to reduce number of parameters.
First one retains the name nvme_alloc_request, while second one is
named nvme_map_user_request.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-8-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Kanchan Joshi 38c0ddab7b nvme: refactor nvme_add_user_metadata
Pass struct request rather than bio. It helps to kill a parameter, and
some processing clean-up too.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-7-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Anuj Gupta 7f05635764 nvme: Use blk_rq_map_user_io helper
User blk_rq_map_user_io instead of duplicating the same code at
different places

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:51:13 -06:00
Jens Axboe 851eb780de nvme: enable batched completions of passthrough IO
Now that the normal passthrough end_io path doesn't need the request
anymore, we can kill the explicit blk_mq_free_request() and just pass
back RQ_END_IO_FREE instead. This enables the batched completion from
freeing batches of requests at the time.

This brings passthrough IO performance at least on par with bdev based
O_DIRECT with io_uring. With this and batche allocations, peak performance
goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also
about 10% faster than previously, going from ~61M to ~67M IOPS.

Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:11 -06:00
Jens Axboe c0a7ba77e8 nvme: split out metadata vs non metadata end_io uring_cmd completions
By splitting up the metadata and non-metadata end_io handling, we can
remove any request dependencies on the normal non-metadata IO path. This
is in preparation for enabling the normal IO passthrough path to pass
the ownership of the request back to the block layer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:11 -06:00
Jens Axboe de671d6116 block: change request end_io handler to pass back a return value
Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:09 -06:00
Jens Axboe 5853a7b551 Merge branch 'for-6.1/io_uring' into for-6.1/passthrough
* for-6.1/io_uring: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-09-30 07:47:42 -06:00
Jens Axboe 736feaa3a0 Merge branch 'for-6.1/block' into for-6.1/passthrough
* for-6.1/block: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:47:38 -06:00
Christoph Hellwig fe6f04c079 nvme: remove nvme_ctrl_init_connect_q
Unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig 6dfba1c09c nvme-fc: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_fc_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig 1864ea4615 nvme-fc: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code and use the chance the cleanup
the init_hctx methods a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27 14:44:17 +02:00
Christoph Hellwig 18ecd97506 nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
Also update the sqsize field when capping the queue size, and remove the
check a queue size that is larger than sqsize given that sqsize is only
initialized from opts->queue_size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig cefa1032f1 nvme-rdma: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_rdma_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig 2d60738c8f nvme-rdma: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig de777825e4 nvme-tcp: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_tcp_ctrl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig 06427ca09b nvme-tcp: store the generic nvme_ctrl in set->driver_data
Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig fb8745d040 nvme-tcp: remove the unused queue_size member in nvme_tcp_queue
->nvme_tcp_queue is not used anywhere, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:16 +02:00
Christoph Hellwig fe60e8c534 nvme: add common helpers to allocate and free tagsets
Add common helpers to allocate and tear down the admin and I/O tag sets,
including the special queues allocated with them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27 14:44:15 +02:00
Keith Busch 6ee742fa8e nvme-pci: report the actual number of tagset maps
We've been reporting 2 maps regardless of whether the module parameter
asked for anything beyond the default queues. A consequence of this
means that blk-mq will reinitialize the all the hardware contexts and io
schedulers on every controller reset when the mapping is exactly the
same as before. This unnecessary overhead is adding several milliseconds
on a reset for environments that don't need it. Report the actual number
of mappings in use.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Rishabh Bhatnagar 61ce339f19 nvme-pci: set min_align_mask before calculating max_hw_sectors
If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device.  Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.

Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Sagi Grimberg f46ef9e87c nvme: send a rediscover uevent when a persistent discovery controller reconnects
When a discovery controller is disconnected, no AENs will arrive to
notify the host about discovery log change events.

In order to solve this, send a uevent notification when a
persistent discovery controller reconnects. We add a new ctrl
flag NVME_CTRL_STARTED_ONCE that will be set on the first
start, and consecutive calls will find it set, and send the
event to userspace if the controller is a discovery controller.

Upon the event reception, userspace will re-read the discovery
log page and will act upon changes as it sees fit.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Sagi Grimberg bf093d9716 nvme: enumerate controller flags
We expect to grow a few of these flags for various purposes
so make them a proper enumeration.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:22:07 +02:00
Tina Hsu d14c273132 nvme-pci: disable Write Zeroes on Phison E3C/E4C
E3C/E4C SSDs do support the Write Zeroes command in theory, but have very
bad performance when using it.  As the firmware has been frozen for these
products we can not expect firmware improvements for it, so disable
Write Zeroes.

Signed-off-by: Tina Hsu <tina_hsu@phison.corp-partner.google.com>
[hch: update the commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:20:30 +02:00
Michael Kelley c292a337d0 nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
The IOC_PR_CLEAR and IOC_PR_RELEASE ioctls are
non-functional on NVMe devices because the nvme_pr_clear()
and nvme_pr_release() functions set the IEKEY field incorrectly.
The IEKEY field should be set only when the key is zero (i.e,
not specified).  The current code does it backwards.

Furthermore, the NVMe spec describes the persistent
reservation "clear" function as an option on the reservation
release command. The current implementation of nvme_pr_clear()
erroneously uses the reservation register command.

Fix these errors. Note that NVMe version 1.3 and later specify
that setting the IEKEY field will return an error of Invalid
Field in Command.  The fix will set IEKEY when the key is zero,
which is appropriate as these ioctls consider a zero key to
be "unspecified", and the intention of the spec change is
to require a valid key.

Tested on a version 1.4 PCI NVMe device in an Azure VM.

Fixes: 1673f1f08c ("nvme: move block_device_operations and ns/ctrl freeing to common code")
Fixes: 1d277a637a ("NVMe: Add persistent reservation ops")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:20:29 +02:00
Keith Busch 1e866afd4b nvme: ensure subsystem reset is single threaded
The subsystem reset writes to a register, so we have to ensure the
device state is capable of handling that otherwise the driver may access
unmapped registers. Use the state machine to ensure the subsystem reset
doesn't try to write registers on a device already undergoing this type
of reset.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214771
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Keith Busch 23e085b2de nvme: restrict management ioctls to admin
The passthrough commands already have this restriction, but the other
operations do not. Require the same capabilities for all users as all of
these operations, which include resets and rescans, can be disruptive.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Keith Busch a8eb6c1ba4 nvme: copy firmware_rev on each init
The firmware revision can change on after a reset so copy the most
recent info each time instead of just the first time, otherwise the
sysfs firmware_rev entry may contain stale data.

Reported-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Keith Busch bc8fb906b0 nvme: handle effects after freeing the request
If a reset occurs after the scan work attempts to issue a command, the
reset may quisce the admin queue, which blocks the scan work's command
from dispatching. The scan work will not be able to complete while the
queue is quiesced.

Meanwhile, the reset work will cancel all outstanding admin tags and
wait until all requests have transitioned to idle, which includes the
passthrough request. But the passthrough request won't be set to idle
until after the scan_work flushes, so we're deadlocked.

Fix this by handling the end effects after the request has been freed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216354
Reported-by: Jonathan Derrick <Jonathan.Derrick@solidigm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27 09:15:56 +02:00
Jens Axboe de97fcb303 fs: add batch and poll flags to the uring_cmd_iopoll() handler
We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Kanchan Joshi 585079b6e4 nvme: wire up async polling for io passthrough commands
Store a cookie during submission, and use that to implement
completion-polling inside the ->uring_cmd_iopoll handler.
This handler makes use of existing bio poll facility.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-5-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Martin Belanger 02c57a82c0 nvme-tcp: print actual source IP address through sysfs "address" attr
TCP transport relies on the routing table to determine which source
address and interface to use when making a connection. Currently, there
is no way to tell from userspace where a connection was made. This
patch exposes the actual source address using a new field named
"src_addr=" in the "address" attribute.

This is needed to diagnose and identify connectivity issues. With the
source address we can infer the interface associated with each
connection.

This was tested with nvme-cli 2.0 to verify it does not have any
adverse effect. The new "src_addr=" field will simply be displayed in
the output of the "list-subsys" or "list -v" commands as shown here.

$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress.discovery
\
 +- nvme0 tcp traddr=192.168.56.1,trsvcid=8009,src_addr=192.168.56.101 live

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:28 +02:00
Keith Busch c4c22c5208 nvme-pci: move iod dma_len fill gaps
The 32-bit field, dma_len, packs better in the iod struct above the
dma_addr_t on 64-bit systems.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch c372cdd1ef nvme-pci: iod npages fits in s8
The largest allowed transfer is 4MB, which can use at most 1025 PRPs.
Each PRP is 8 bytes, so the maximum number of 4k nvme pages needed for
the iod_list is 3, which fits in an 's8' type.

While modifying this field, change the name to "nr_allocations" to
better represent that this is referring to the number of units allocated
from a dma_pool.

Also introduce a BUILD_BUG_ON to ensure we never accidently increase the
largest transfer limit beyond 127 chained prp lists.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch 52da4f3f5c nvme-pci: iod's 'aborted' is a bool
It's only true or false, so make this a bool to reflect that and save
some space in nvme_iod.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch a53232cb3a nvme-pci: remove nvme_queue from nvme_iod
We can get the nvme_queue from the req just as easily, so remove the
duplicate path to the same structure to save some space.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Daniel Wagner 4cde03d82e nvme: consider also host_iface when checking ip options
It's perfectly fine to use the same traddr and trsvcid more than once
as long we use different host interface. This is used in setups where
the host has more than one interface but the target exposes only one
traddr/trsvcid combination.

Use the same acceptance rules for host_iface as we have for
host_traddr.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:24 +02:00
Daniel Wagner 1c467e2595 nvme-rdma: handle number of queue changes
On reconnect, the number of queues might have changed.

In the case where we have more queues available than previously we try
to access queues which are not initialized yet.

The other case where we have less queues than previously, the
connection attempt will fail because the target doesn't support the
old number of queues and we end up in a reconnect loop.

Thus, only start queues which are currently present in the tagset
limited by the number of available queues. Then we update the tagset
and we can start any new queue.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:21 +02:00
Daniel Wagner 09035f8649 nvme-tcp: handle number of queue changes
On reconnect, the number of queues might have changed.

In the case where we have more queues available than previously we try
to access queues which are not initialized yet.

The other case where we have less queues than previously, the
connection attempt will fail because the target doesn't support the
old number of queues and we end up in a reconnect loop.

Thus, only start queues which are currently present in the tagset
limited by the number of available queues. Then we update the tagset
and we can start any new queue.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:19 +02:00
Wolfram Sang a8817cc09d nvme: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:53 +02:00
Linjun Bao 1abc696174 nvme: add comment for unaligned "fake" nqn
Current "fake" nqn field is "nqn.2014.08.org.nvmexpress:", it is
not aligned with the canonical version for history reasons.

Signed-off-by: Linjun Bao <meljbao@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 11:33:53 +02:00
Linus Torvalds 9ebc0ecb21 block-6.0-2022-09-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMbdksQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqm8EACPqE5eFwFtZZnYV8eFjtTQCA5aZ+u3HyF8
 Xby5FEZxwpRaaGPwUZ6T7A45ZVH5ka+JPvSQYeSTKxdjBVeqmAUVnlhqLjxeTvJt
 WhfSCrPq553eYIKoeUVGxLfQ5GhK4f/2noiEMUIEOMSsDyh4txm46MRNdfi5doj1
 2JlbJyoTIK0ZRoraAuZp07bZPFnbxtjSB9YJlUqyKpiSMbM+xocPRaB3Tpr42So8
 vUdgt2/tDZ2xNOf4gd1KqUlosIlaG0RWeZeXTU9knrQjPYtB77iM6MVDUGhgiYT/
 w25XYCbZiGt1YwRTOYFX9bE6jGuPDc54isRqCmPSjYwTAbeEd7sWytn5D1W0IkaF
 dH93mDaW+8/GrjdGVCJ4jLPRCfQvZZ8i5XlwXaXQfnTQ8okUelx5BJ6kEfKRUkhV
 nZD0r5MskH4dQv4ESluaeblrDWphMaznZLNjp+UXfiycsbE9l2TX1FOjyK9E7f7o
 4zGYiertjc7wj0OnnoDTf0wdpXI13xDqQGFhe1Lnz4tOrpTfzZqwyv4IkFo3VghR
 OocJFk4GSuKLwDPASDY/B+nynhQw5fft/fp458Q5pFY/2zItvmg2rxloFetUAz9i
 0kcrO4NuO1l6E20KoBED2HMrQOow68dqcCJTrRcXhDfUIRIk0XSg4Kd/nGJBhZZS
 5mv/hhkjjQ==
 =3RZd
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull via Christoph:
      - fix a use after free in nvmet (Bart Van Assche)
      - fix a use after free when detecting digest errors
        (Sagi Grimberg)
      - fix regression that causes sporadic TCP requests to time out
        (Sagi Grimberg)
      - fix two off by ones errors in the nvmet ZNS support
        (Dennis Maisenbacher)
      - requeue aen after firmware activation (Keith Busch)

 - Fix missing request flags in debugfs code (me)

 - Partition scan fix (Ming)

* tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block:
  block: add missing request flags to debugfs code
  nvme: requeue aen after firmware activation
  nvmet: fix mar and mor off-by-one errors
  nvme-tcp: fix regression that causes sporadic requests to time out
  nvme-tcp: fix UAF when detecting digest errors
  nvmet: fix a use-after-free
  block: don't add partitions if GD_SUPPRESS_PART_SCAN is set
2022-09-09 15:03:08 -04:00
Keith Busch 371a982cd2 nvme: requeue aen after firmware activation
The driver prevents async event work while handling a processing paused
event, but someone needs to restart it after the controller returns to a
live state.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216400
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-07 08:38:25 +02:00
Sagi Grimberg 3770a42bb8 nvme-tcp: fix regression that causes sporadic requests to time out
When we queue requests, we strive to batch as much as possible and also
signal the network stack that more data is about to be sent over a socket
with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued
as well as queue->more_requests that is derived from the block layer
last-in-batch indication.

We set more_request=true when we flush the request directly from
.queue_rq submission context (in nvme_tcp_send_all), however this is
wrongly assuming that no other requests may be queued during the
execution of nvme_tcp_send_all.

Due to this, a race condition may happen where:

 1. request X is queued as !last-in-batch
 2. request X submission context calls nvme_tcp_send_all directly
 3. nvme_tcp_send_all is preempted and schedules to a different cpu
 4. request Y is queued as last-in-batch
 5. nvme_tcp_send_all context sends request X+Y, however signals for
    both MSG_SENDPAGE_NOTLAST because queue->more_requests=true.

==> none of the requests is pushed down to the wire as the network
stack is waiting for more data, both requests timeout.

To fix this, we eliminate queue->more_requests and only rely on
the queue req_list and send_list to be not-empty.

Fixes: 122e5b9f3d ("nvme-tcp: optimize network stack with setting msg flags according to batch size")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06 06:40:44 +02:00
Sagi Grimberg 160f3549a9 nvme-tcp: fix UAF when detecting digest errors
We should also bail from the io_work loop when we set rd_enabled to true,
so we don't attempt to read data from the socket when the TCP stream is
already out-of-sync or corrupted.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06 06:40:40 +02:00
Linus Torvalds d895ec7938 block-6.0-2022-09-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMSIeAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR0EADSxpD2LFb6yyqirZscKr+i0NA3HWNzw8+m
 aZRq8Leb3iozOCtnUPdrIQRvUI5FhW7Q8DUa/vB314xoIbregawJxQi5Q2o8akr7
 LhlyRp9ElKSM4QnCiKBWob6YIUdjVlx0LWhZPdncgc2fAGSxGJNWVHHM9X0UHkpV
 Sr3eWFKyl0LJxV0wu42dT2VcgRMFrr8+9VuKkgBXA5u8p4CpEbJtIlo2b4XgKmwp
 8oeJkQFaSnOSI6Zt4p2oi75PIa27pOGm6MfNH366tgbikH261F28LMCTE3lInzgm
 Rh1U0pRhh36Q+Qf9X9m4AVd8qk31m/YaDgIN4rQXlnc+N2i2dweW2DBZfasQKHJX
 ofxheTOfnqogT9ZJ4T10rHCQmWAC2mVeOgMafQDV6lUuyTqvcaBGkCxBkh9ASIXi
 lpcnWRVzVsiN5YTYxXiXheKiuklcIUnfgTLSu7BMjz32smb/ZqrkPs3ZNAKr8d+q
 HmUd+9THZ560eGMcQgSod2B6DD0NbWvGzIWFthADP9xVA6y68rN+e43MG2TrzkcP
 h4tgnA63lZ8eSaBCfsr0yOGfNkY55OYh4BPH8ZxP7tyBgYwJO2+/09dQgn6EjJnB
 qHorVqd6OoLKWkb88yz9uGtObj00ll4Ks4HcXwW0eF8idjyCEph57BJoocrIIqmP
 IitYAkqtAg==
 =9Eyh
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-09-02' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
     - error handling fix for the new auth code (Hannes Reinecke)
     - fix unhandled tcp states in nvmet_tcp_state_change (Maurizio
       Lombardi)
     - add NVME_QUIRK_BOGUS_NID for Lexar NM610 (Shyamin Ayesh)

 - Add documentation for the ublk driver merged in this merge window
   (Ming)

* tag 'block-6.0-2022-09-02' of git://git.kernel.dk/linux-block:
  Documentation: document ublk
  nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change()
  nvmet-auth: add missing goto in nvmet_setup_auth()
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610
2022-09-02 16:44:30 -07:00
Shyamin Ayesh 200dccd07d nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610
Lexar NM610 reports bogus eui64 values that appear to be the same across
all drives. Quirk them out so they are not marked as "non globally unique"
duplicates.

Signed-off-by: Shyamin Ayesh <me@shyamin.com>
[patch formatting]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-31 07:57:28 +03:00
Bart Van Assche a4e1d0b76e block: Change the return type of blk_mq_map_queues() into void
Since blk_mq_map_queues() and the .map_queues() callbacks always return 0,
change their return type into void. Most callers ignore the returned value
anyway.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org
[axboe: fold in fix from Bart]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22 10:07:53 -06:00
Linus Torvalds abe7a481aa block-6.0-2022-08-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL2SxQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl24D/4nxPDuhvDxLHpMUR/C2lCtAnqVhHtGLhIX
 QSw+GVA5LuJvuB0L5Zr7ODDILPY5ZyWI/F1FPVJUOJE1NJ3tFiH4WzFIkqtFtVCE
 2jFTXH63A/o/fyo9nscsZ1g6eEswSAbvenHEa9HNpjgFxz0lnXjrniP5VFPo5HNl
 F8/MO1CBkKmhsGazZn7o1J3Ws6RvApq59YzxHmVz1hFHPgJFN2KwIAQjY2+GGoOD
 ifpRBbZBCTzj2dEEFZHeK1aCYhTNP4VqbNnBDQPZHwEB3qkml5R9GhTlUe7Ej17t
 7o4A05efcm/24TXcODMHP5YaGA14otPUr8wQiJjuOIFLw8sMC61OyS9qDdu1IvyW
 JCnTtDkzwZZEkhXlraU1HmiLSaBjMEvd/2puxbcS9kISdO7baCLd4Oj7+8ThVhIf
 JHIt2x3vzKaCzWI93IMrw5iJFK0+NS+SLAD6eyuEgC71Rj5ooemxrBYxKBQ7jb3o
 GCC3SaU8lFmB1Z/zKo63gGS1b7eaCNGauNm9/gSe1jM8Sor4hlT0yeNRpHf7egAu
 1dUQdSwgon6EH6JOGX3CFSC9lnIEAew733QZLaYBqar2WHcq5Wpq6LcWUHhhujgB
 dSTeLY1Tnhs3GWvMFe4JH/YZilFpbMKzxuFPCV7sQScxzlaGusliX8kVkha/VPLK
 9rtT8uyJXg==
 =1/w+
 -----END PGP SIGNATURE-----

Merge tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request
     - print nvme connect Linux error codes properly (Amit Engel)
     - fix the fc_appid_store return value (Christoph Hellwig)
     - fix a typo in an error message (Christophe JAILLET)
     - add another non-unique identifier quirk (Dennis P. Kliem)
     - check if the queue is allocated before stopping it in nvme-tcp
       (Maurizio Lombardi)
     - restart admin queue if the caller needs to restart queue in
       nvme-fc (Ming Lei)
     - use kmemdup instead of kmalloc + memcpy in nvme-auth (Zhang
       Xiaoxu)

 - __alloc_disk_node() error handling fix (Rafael)

* tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block:
  block: Do not call blk_put_queue() if gendisk allocation fails
  nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70
  nvme-tcp: check if the queue is allocated before stopping it
  nvme-fabrics: Fix a typo in an error message
  nvme-fabrics: parse nvme connect Linux error codes
  nvmet-auth: use kmemdup instead of kmalloc + memcpy
  nvme-fc: fix the fc_appid_store return value
  nvme-fc: restart admin queue if the caller needs to restart queue
2022-08-13 13:37:36 -07:00
Dennis P. Kliem f37527a09d nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70
ADATA XPG GAMMIX S70 reports bogus eui64 values that appear to be the same
across all drives. Quirk them out so they are not marked as "non globally
unique" duplicates.

Signed-off-by: Dennis P. Kliem <dpkliem@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-11 14:10:16 +02:00
Maurizio Lombardi 2bff487f9a nvme-tcp: check if the queue is allocated before stopping it
When an error is detected and the host reconnects, the
nvme_tcp_error_recovery_work() function is called and starts
tearing down the io queues and de-allocating them;
If at the same time the "nvme" process deletes the controller via sysfs,
the nvme_tcp_delete_ctrl() gets called and waits until the
nvme_tcp_error_recovery_work() finishes its job; then starts
tearing down the io queues, but at this point they have already
been freed and the mutexes are destroyed.

Calling mutex_lock() against a destroyed mutex triggers a warning:

[ 1299.025575] nvme nvme1: Reconnecting in 10 seconds...
[ 1299.636449] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
[ 1299.645262] ------------[ cut here ]------------
[ 1299.649949] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 1299.649971] WARNING: CPU: 4 PID: 104150 at kernel/locking/mutex.c:579 __mutex_lock+0x2d0/0x7dc

[ 1299.717934] CPU: 4 PID: 104150 Comm: nvme
[ 1299.828075] Call trace:
[ 1299.830526]  __mutex_lock+0x2d0/0x7dc
[ 1299.834203]  mutex_lock_nested+0x64/0xd4
[ 1299.838139]  nvme_tcp_stop_queue+0x54/0xe0 [nvme_tcp]
[ 1299.843211]  nvme_tcp_teardown_io_queues.part.0+0x90/0x280 [nvme_tcp]
[ 1299.849672]  nvme_tcp_delete_ctrl+0x6c/0xf0 [nvme_tcp]
[ 1299.854831]  nvme_do_delete_ctrl+0x108/0x120 [nvme_core]
[ 1299.860181]  nvme_sysfs_delete+0xec/0xf0 [nvme_core]
[ 1299.865179]  dev_attr_store+0x40/0x70

Fix the warning by checking if the queues are allocated
in the nvme_tcp_stop_queue(). If they are not, it makes no
sense to try to stop them.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:21:31 +02:00
Christophe JAILLET c50cd03dbe nvme-fabrics: Fix a typo in an error message
A 'c' is missing.
s/fabris/fabrics/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:21:31 +02:00
Amit Engel ec9e96b523 nvme-fabrics: parse nvme connect Linux error codes
This fixes the assumption that errval is an unsigned nvme error

Signed-off-by: Amit Engel <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:05:10 +02:00
Christoph Hellwig 9317d00144 nvme-fc: fix the fc_appid_store return value
"nvme-fc: fold t fc_update_appid into fc_appid_store" accidentally
changed the userspace interface for the appid attribute, because the code
that decrements "count" to remove a trailing '\n' in the parsing results
in the decremented value being incorrectly be returned from the sysfs
write.  Fix this by keeping an orig_count variable for the full length
of the write.

Fixes: c814153c83 ("nvme-fc: fold t fc_update_appid into fc_appid_store")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Tested-by:  Muneendra Kumar M <muneendra.kumar@broadcom.com>
2022-08-10 16:05:08 +02:00
Ming Lei 6fb271f1bc nvme-fc: restart admin queue if the caller needs to restart queue
Without restarting admin queue in __nvme_fc_abort_outstanding_ios(),
it leaves controller not capable of handling admin pt request, and
causes io hang.

Fixes it by restarting admin queue if the caller of __nvme_fc_abort_outstanding_ios
requires to restart queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Tested-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-08-10 16:05:07 +02:00
Linus Torvalds c993e07be0 dma-mapping updates
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy,
    Christoph Hellwig)
  - restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
  - allow the IOMMU code to communicate an optional DMA mapping length
    and use that in scsi and libata (John Garry)
  - split the global swiotlb lock (Tianyu Lan)
  - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
    Lukas Bulwahn, Robin Murphy)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmLuIYULHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPS5A//Ty1ZNyXExmwZ6J6g7/oIvQlpAHilDr22mCd8tR8Y
 Ne7TgLa/X+usFvJTxJfkvg/LNMDjD7qx0J/mhDGm4reOFcEL4/PBy0rDSOgnmntV
 k/fPhgwnpuztiAQ+s+WkJ3pkrmG1HaEId7GGj2JaoYdas6RX2mGX7vL8uvUFepjw
 lYPAqWMtJHkOfsDK0PqqyQsr7dcC6lyFLqnn/wqvHtTJeKCfGs6W/SIrlWme2SZY
 3dNx84ZR1uPjaazAmtf2IWfjh/TBmd0ETRYycgUUKRP9iwsCkBQDBwsBGSIYXiWj
 BUKQ5oMvjAlUGRF0jYz9e77KuedE6GxWiXNQstitBmid142M37DHA5tvZRf65MPS
 THHcjTDmmoaO4YfFhhXOcFOrjG4/V8bF7fgHB6XkHDjhVVTcnIx8zuOAXIVBZvIV
 VAALmamBqEfIZZrCqgr7hzFssK2bip+TIMkdoD46Wcr+D7bAlujhuzWxubn9+ulT
 23v/pAvC80ut6LvKj6EA+GpRm/pejfOtEbjXPoO2hguNxvuUKvPQqNh9hy0q+v1e
 8n2Y/4lhy5bv02S7wKooNkfCoV753jBY1TIru45UmEYc3EkTQPii6okYe0DvW4QX
 VCnKgo156wSBfE+9eWdxCROv2SZqJFMV/wL3vw54dpJQMbDy7VkNsh4mGREdUkU1
 uek=
 =Bv19
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - convert arm32 to the common dma-direct code (Arnd Bergmann, Robin
   Murphy, Christoph Hellwig)

 - restructure the PCIe peer to peer mapping support (Logan Gunthorpe)

 - allow the IOMMU code to communicate an optional DMA mapping length
   and use that in scsi and libata (John Garry)

 - split the global swiotlb lock (Tianyu Lan)

 - various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
   Lukas Bulwahn, Robin Murphy)

* tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits)
  swiotlb: fix passing local variable to debugfs_create_ulong()
  dma-mapping: reformat comment to suppress htmldoc warning
  PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
  RDMA/rw: drop pci_p2pdma_[un]map_sg()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  nvme-pci: convert to using dma_map_sgtable()
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  iommu: Explicitly skip bus address marked segments in __iommu_map_sg()
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  swiotlb: clean up some coding style and minor issues
  dma-mapping: update comment after dmabounce removal
  scsi: sd: Add a comment about limiting max_sectors to shost optimal limit
  ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors
  scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit
  ...
2022-08-06 10:56:45 -07:00
Christoph Hellwig 5a97806f7d block: change the blk_queue_split calling convention
The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.

Also give it and the helpers used to implement it more descriptive names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Joel Granados eb867ee995 nvme: enable generic interface (/dev/ngXnY) for unknown command sets
Extend nvme_alloc_ns() and nvme_validate_ns() for unknown command-set as
well. Both are made to use a new helper (nvme_update_ns_info_cs_indep)
which is similar to nvme_update_ns_info but performs fewer operations
to get the generic interface up.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
[hch: rebased on other refactoring patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Christoph Hellwig 1e4ea66af1 nvme: factor out a nvme_ns_is_readonly helper
Add a little helper to check if a namespace should be marked read-only
that uses a new is_readonly flag in the nvme_ns_info structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Christoph Hellwig 1a893c2bfe nvme: refactor namespace probing
Change nvme_ns_scan to gather all information needed for generic
namespace setup into a nvme_ns_info structure.  This structure is filled
from the Command Set Idependent Identify Namespace data structure if
it is available or else the legacy Identify namespace structure.

With that everything related to the NVM command set (and the ZNS command
set derived from it) can be encapsulated in the nvme_update_ns_info_block
function while keeping the rest of the namespace probing generic.

The downside is that we now always issue two Identify Namespace calls for
each probed namespace instead of usually just a single one previously.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:53 -06:00
Christoph Hellwig 71882e7d23 nvme: generalize the nvme_multi_css check in nvme_scan_ns
Check for multiple command set support early on an error out if is
not supported when a !NVM command set namespace is found.  This
prepares for adding command set independent passthrough support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Christoph Hellwig 04c170f669 nvme: rename nvme_validate_or_alloc_ns to nvme_scan_ns
This shorter name much better fits what this function does in
the scanning process.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Christoph Hellwig e06b425bc8 nvme: catch -ENODEV from nvme_revalidate_zones again
nvme_revalidate_zones can also return -ENODEV if e.g. zone sizes aren't
constant or not a power of two.  In that case we should jump to marking
the gendisk hidden and only support pass through.

Fixes: 602e57c979 ("nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info")
Reported-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Christoph Hellwig 2fec1dfc28 nvme-apple: stop casting function pointer signatures
Casting function pointers breaks control flow enforcement and is
generally a horrible coding style.

Add two wrappers to get rid of these casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:51 -06:00
Christoph Hellwig 2f7a7e5d85 nvme-tcp: split nvme_tcp_alloc_tagset
Split nvme_tcp_alloc_tagset into one helper for the admin tag_set and
one for the I/O tag set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Christoph Hellwig a7f7b7116c nvme-rdma: split nvme_rdma_alloc_tagset
Split nvme_rdma_alloc_tagset into one helper for the admin tag_set and
one for the I/O tag set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Christoph Hellwig 2455a4b778 nvme-pci: split nvme_dev_add
Split nvme_dev_add into a helper to actually allocate the tag set, and
one that just update the number of queues.  Add a local variable for
the tag_set to clean up the code a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Christoph Hellwig f91b727ccf nvme-pci: split nvme_alloc_admin_tags
Split nvme_alloc_admin_tags into a helper to actually allocate the
tag set, and one that just restarts the admin queue.  Add a local
variable for the tag_set to clean up the code a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Christoph Hellwig 8614144002 nvme-pci: print the command name of aborted commands
To allow for slightly better debugging, print the command name when
aborting an command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Liu Song 33b6debd61 nvme-pci: remove useless assignment in nvme_pci_setup_prps
If prp_list is NULL, nvme_unmap_sg will be performed, and the assignment
to first_dma is meaningless, so remove it.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Nick Bowler a25d426158 nvme: define compat_ioctl again to unbreak 32-bit userspace.
Commit 89b3d6e605 ("nvme: simplify the compat ioctl handling") removed
the initialization of compat_ioctl from the nvme block_device_operations
structures.

Presumably the expectation was that 32-bit ioctls would be directed
through the regular handler but this is not the case: failing to assign
.compat_ioctl actually means that the compat case is disabled entirely,
and any attempt to submit nvme ioctls from 32-bit userspace fails
outright with -ENOTTY.

For example:

  % smartctl -x /dev/nvme0n1
  [...]
  Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Inappropriate ioctl for device

The blkdev_compat_ptr_ioctl helper can be used to direct compat calls
through the main ioctl handler and makes things work again.

Fixes: 89b3d6e605 ("nvme: simplify the compat ioctl handling")
Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Christoph Hellwig eb7e2d9258 nvme: don't always build constants.o
The entire content of constants.c if guarded by an ifdef, so switch to
just building the file conditionally instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:48 -06:00
Bean Huo 679c54f2de nvme: use command_id instead of req->tag in trace_nvme_complete_rq()
Use command_id instead of req->tag in trace_nvme_complete_rq(),
because of commit e7006de6c2 ("nvme: code command_id with a genctr
for use authentication after release"), cmd->common.command_id is set to
((genctl & 0xf)< 12 | req->tag), no longer req->tag, which makes cid in
trace_nvme_complete_rq and trace_nvme_setup_cmd are not the same.

Fixes: e7006de6c2 ("nvme: code command_id with a genctr for use authentication after release")
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:46 -06:00
Joel Granados c13cf14f44 nvme-multipath: refactor nvme_mpath_add_disk
Pass anagrpid as second argument. This is prep patch that allows reusing
this function for supporting unknown command sets.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:41 -06:00
Guixin Liu 0f89f0ece5 nvme-apple: use nvme core helper to cancel requests in tagset
Use nvme core helper nvme_cancel_tagset and nvme_cancel_admin_tagset
instead of same logic code.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:41 -06:00
Guixin Liu 1fcfca7812 nvme-pci: use nvme core helper to cancel requests in tagset
Use nvme core helper nvme_cancel_tagset and nvme_cancel_admin_tagset
instead of same logic code.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ruozhu Li <liruozhu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:41 -06:00
Caleb Sander 53ee9e2937 nvme-tcp: use in-capsule data for I/O connect
Currently, command data is only sent in-capsule on the for admin or I/O
commands on queues that indicate support for it.  Send fabrics command
data in-capsule for I/O queues as well to avoid needing a separate
H2CData PDU for the connect command.

This is optimization. Without this change, we send the connect command
capsule and data in separate PDUs (CapsuleCmd and H2CData), and must wait
for the controller to respond with an R2T PDU before sending the H2CData.

With the change, we send a single CapsuleCmd PDU that includes the data.
This reduces the number of bytes (and likely packets) sent across the network,
and simplifies the send state machine handling in the driver.

Signed-off-by: Caleb Sander <csander@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:41 -06:00
Israel Rukshin 0525af711b nvme-rdma: remove timeout for getting RDMA-CM established event
In case many controllers start error recovery at the same time (i.e.,
when port is down and up), they may never succeed to reconnect again.
This is because the target can't handle all the connect requests at
three seconds (the arbitrary value set today). Even if some of the
connections are established, when a single queue fails to connect,
all the controller's queues are destroyed as well. So, on the
following reconnection attempts the number of connect requests may
remain the same. To fix this, remove the timeout and wait for RDMA-CM
event to abort/complete the connect request. RDMA-CM sends unreachable
event when a timeout of ~90 seconds is expired. This approach is used
at other RDMA-CM users like SRP and iSER at blocking mode. The commit
also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:22:41 -06:00
Hannes Reinecke b61775d185 nvme-auth: Diffie-Hellman key exchange support
Implement Diffie-Hellman key exchange using FFDHE groups
for NVMe In-Band Authentication.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:49 -06:00
Hannes Reinecke f50fff73d6 nvme: implement In-Band authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds two new fabric options 'dhchap_secret' to specify the
pre-shared key (in ASCII respresentation according to NVMe 2.0 section
8.13.5.8 'Secret representation') and 'dhchap_ctrl_secret' to specify
the pre-shared controller key for bi-directional authentication of both
the host and the controller.
Re-authentication can be triggered by writing the PSK into the new
controller sysfs attribute 'dhchap_secret' or 'dhchap_ctrl_secret'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: fold in clang build fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:49 -06:00
Hannes Reinecke 3bf2fde6fc nvme-fabrics: decode 'authentication required' connect error
The 'connect' command might fail with NVME_SC_AUTH_REQUIRED, so we
should be decoding this error, too.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:47 -06:00
Chaitanya Kulkarni b10907b815 nvme: fix qid param blk_mq_alloc_request_hctx
Only caller of the __nvme_submit_sync_cmd() with qid value not equal to
NVME_QID_ANY is nvmf_connect_io_queues(), where qid value is alway set
to > 0.

[1] __nvme_submit_sync_cmd() callers with  qid parameter from :-

        Caller                  |   qid parameter
------------------------------------------------------
* nvme_fc_connect_io_queues()   |
   nvmf_connect_io_queue()      |      qid > 0
* nvme_rdma_start_io_queues()   |
   nvme_rdma_start_queue()      |
    nvmf_connect_io_queues()    |      qid > 0
* nvme_tcp_start_io_queues()    |
   nvme_tcp_start_queue()       |
    nvmf_connect_io_queues()    |      qid > 0
* nvme_loop_connect_io_queues() |
   nvmf_connect_io_queues()     |      qid > 0

When qid value of the function parameter __nvme_submit_sync_cmd() is > 0
from above callers, we use blk_mq_alloc_request_hctx(), where we pass
last parameter as 0 if qid functional parameter value is set to 0 with
conditional operators, see 1002 :-

991 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 992                 union nvme_result *result, void *buffer, unsigned bufflen,
 993                 int qid, int at_head, blk_mq_req_flags_t flags)
 994 {
 995         struct request *req;
 996         int ret;
 997
 998         if (qid == NVME_QID_ANY)
 999                 req = blk_mq_alloc_request(q, nvme_req_op(cmd), flags);
1000         else
1001                 req = blk_mq_alloc_request_hctx(q, nvme_req_op(cmd), flags,
1002                                                 qid ? qid - 1 : 0);
1003

But qid function parameter value of the __nvme_submit_sync_cmd() will
never be 0 from above caller list see [1], and all the other callers of
__nvme_submit_sync_cmd() use NVME_QID_ANY as qid value :-
1. nvme_submit_sync_cmd()
2. nvme_features()
3. nvme_sec_submit()
4. nvmf_reg_read32()
5. nvmf_reg_read64()
6. nvmf_ref_write32()
7. nvmf_connect_admin_queue()

Remove the conditional operator to pass the qid as 0 in the call to
blk_mq_alloc_requst_hctx().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:47 -06:00
Chaitanya Kulkarni 6b46fa024a nvme: remove unused timeout parameter
The function __nvme_submit_sync_cmd() has following list of callers
that sets the timeout value to 0 :-

        Callers               |   Timeout value
------------------------------------------------
nvme_submit_sync_cmd()        |        0
nvme_features()               |        0
nvme_sec_submit()             |        0
nvmf_reg_read32()             |        0
nvmf_reg_read64()             |        0
nvmf_reg_write32()            |        0
nvmf_connect_admin_queue()    |        0
nvmf_connect_io_queue()       |        0

Remove the timeout function parameter from __nvme_submit_sync_cmd() and
adjust the rest of code accordingly.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:47 -06:00
Michael Kelley 2c61c97fb1 nvme: handle the persistent internal error AER
In the NVM Express Revision 1.4 spec, Figure 145 describes possible
values for an AER with event type "Error" (value 000b). For a
Persistent Internal Error (value 03h), the host should perform a
controller reset.

Add support for this error using code that already exists for
doing a controller reset. As part of this support, introduce
two utility functions for parsing the AER type and subtype.

This new support was tested in a lab environment where we can
generate the persistent internal error on demand, and observe
both the Linux side and NVMe controller side to see that the
controller reset has been done.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:47 -06:00
Xiang wangx b7df575f8a nvme: remove a double word in a comment
Delete the redundant word 'be'.

Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 17:14:47 -06:00
Linus Torvalds c013d0af81 for-5.20/block-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
 MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
 b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
 YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
 gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
 ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
 ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
 WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
 4N1HEozvIw==
 =celk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Improve the type checking of request flags (Bart)

 - Ensure queue mapping for a single queues always picks the right queue
   (Bart)

 - Sanitize the io priority handling (Jan)

 - rq-qos race fix (Jinke)

 - Reserved tags handling improvements (John)

 - Separate memory alignment from file/disk offset aligment for O_DIRECT
   (Keith)

 - Add new ublk driver, userspace block driver using io_uring for
   communication with the userspace backend (Ming)

 - Use try_cmpxchg() to cleanup the code in various spots (Uros)

 - Finally remove bdevname() (Christoph)

 - Clean up the zoned device handling (Christoph)

 - Clean up independent access range support (Christoph)

 - Clean up and improve block sysfs handling (Christoph)

 - Clean up and improve teardown of block devices.

   This turns the usual two step process into something that is simpler
   to implement and handle in block drivers (Christoph)

 - Clean up chunk size handling (Christoph)

 - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
   Ming, Sebastian, Yang, Ying)

* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
  ublk_drv: fix double shift bug
  ublk_drv: make sure that correct flags(features) returned to userspace
  ublk_drv: fix error handling of ublk_add_dev
  ublk_drv: fix lockdep warning
  block: remove __blk_get_queue
  block: call blk_mq_exit_queue from disk_release for never added disks
  blk-mq: fix error handling in __blk_mq_alloc_disk
  ublk: defer disk allocation
  ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
  ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
  ublk: cleanup ublk_ctrl_uring_cmd
  ublk: simplify ublk_ch_open and ublk_ch_release
  ublk: remove the empty open and release block device operations
  ublk: remove UBLK_IO_F_PREFLUSH
  ublk: add a MAINTAINERS entry
  block: don't allow the same type rq_qos add more than once
  mmc: fix disk/queue leak in case of adding disk failure
  ublk_drv: fix an IS_ERR() vs NULL check
  ublk: remove UBLK_IO_F_INTEGRITY
  ublk_drv: remove unneeded semicolon
  ...
2022-08-02 13:46:35 -07:00
Logan Gunthorpe 91fb2b6052 nvme-pci: convert to using dma_map_sgtable()
The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-26 07:28:07 -04:00
Logan Gunthorpe 2f8594412b nvme-pci: check DMA ops when indicating support for PCI P2PDMA
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-26 07:28:07 -04:00
Tobias Gruetzmacher d6c52fa3e9 nvme-pci: Crucial P2 has bogus namespace ids
This adds a quirk for the Crucial P2.

Signed-off-by: Tobias Gruetzmacher <tobias-git@23.gs>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-25 07:34:07 +02:00
Bart Van Assche f9ed86dc1d nvme/host: Use the enum req_op and blk_opf_t types
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-38-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:32 -06:00
Israel Rukshin 6961b5e028 nvme: fix block device naming collision
The issue exists when multipath is enabled and the namespace is
shared, but all the other controller checks at nvme_is_unique_nsid()
are false. The reason for this issue is that nvme_is_unique_nsid()
returns false when is called from nvme_mpath_alloc_disk() due to an
uninitialized value of head->shared. The patch fixes it by setting
head->shared before nvme_mpath_alloc_disk() is called.

Fixes: 5974ea7ce0 ("nvme: allow duplicate NSIDs for private namespaces")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-14 16:35:25 +02:00
Keith Busch 081f5e753c nvme-pci: fix freeze accounting for error handling
A reset on a live device experiencing a link error still needs to have
the queue freeze state started for the subsequent reinitialization. Skip
only the register read if the device is not present instead of bypassing
the freeze checks.

Fixes: b98235d3a4 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-14 16:35:25 +02:00
Keith Busch 5c629dc960 nvme: use struct group for generic command dwords
This will allow the trace event to know the full size of the data
intended to be copied and silence read overflow checks.

Reported-by: John Garry <john.garry@huawei.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-06 19:12:56 +02:00
Keith Busch 73029c9b23 nvme-pci: phison e16 has bogus namespace ids
Add the quirk.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216049
Reported-by: Chris Egolf <cegolf@ugholf.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-06 18:18:41 +02:00
Christoph Hellwig d86e716aa4 block: move zone related fields to struct gendisk
Move the zone related fields that are currently stored in
struct request_queue to struct gendisk as these are part of the highlevel
block layer API and are only used for non-passthrough I/O that requires
the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig 982977df48 block: pass a gendisk to blk_queue_max_open_zones and blk_queue_max_active_zones
Switch to a gendisk based API in preparation for moving all zone related
fields from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
Christoph Hellwig 6b2bd27474 block: pass a gendisk to blk_queue_set_zoned
Prepare for storing the zone related field in struct gendisk instead
of struct request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:46:26 -06:00
John Garry 2dd6532e95 blk-mq: Drop 'reserved' arg of busy_tag_iter_fn
We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter
function so it may be dropped.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
John Garry 9bdb4833dd blk-mq: Drop blk_mq_ops.timeout 'reserved' arg
With new API blk_mq_is_reserved_rq() we can tell if a request is from
the reserved pool, so stop passing 'reserved' arg. There is actually
only a single user of that arg for all the callback implementations, which
can use blk_mq_is_reserved_rq() instead.

This will also allow us to stop passing the same 'reserved' around the
blk-mq iter functions next.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06 06:33:53 -06:00
Lamarque Vieira Souza e1c70d7934 nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA IM2P33F8ABR1
ADATA IM2P33F8ABR1 reports bogus eui64 values that appear to be the same
across all drives. Quirk them out so they are not marked as "non globally
unique" duplicates.

Co-developed-by: Felipe de Jesus Araujo da Conceição <felipe.conceicao@petrosoftdesign.com>
Signed-off-by: Felipe de Jesus Araujo da Conceição <felipe.conceicao@petrosoftdesign.com>
Signed-off-by: Lamarque V. Souza <lamarque.souza@petrosoftdesign.com>

Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-30 08:24:33 +02:00
Ruozhu Li f7f70f4aa0 nvme: fix regression when disconnect a recovering ctrl
We encountered a problem that the disconnect command hangs.
After analyzing the log and stack, we found that the triggering
process is as follows:
CPU0                          CPU1
                                nvme_rdma_error_recovery_work
                                  nvme_rdma_teardown_io_queues
nvme_do_delete_ctrl                 nvme_stop_queues
  nvme_remove_namespaces
  --clear ctrl->namespaces
                                    nvme_start_queues
                                    --no ns in ctrl->namespaces
    nvme_ns_remove                  return(because ctrl is deleting)
      blk_freeze_queue
        blk_mq_freeze_queue_wait
        --wait for ns to unquiesce to clean infligt IO, hang forever

This problem was not found in older kernels because we will flush
err work in nvme_stop_ctrl before nvme_remove_namespaces.It does not
seem to be modified for functional reasons, the patch can be revert
to solve the problem.

Revert commit 794a4cb3d2 ("nvme: remove the .stop_ctrl callout")

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-29 16:13:45 +02:00
Pablo Greco 1629de0e03 nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG SX6000LNP (AKA SPECTRIX S40G)
ADATA XPG SPECTRIX S40G drives report bogus eui64 values that appear to
be the same across drives in one system. Quirk them out so they are
not marked as "non globally unique" duplicates.

Before:
[    2.258919] nvme nvme1: pci function 0000:06:00.0
[    2.264898] nvme nvme2: pci function 0000:05:00.0
[    2.323235] nvme nvme1: failed to set APST feature (2)
[    2.326153] nvme nvme2: failed to set APST feature (2)
[    2.333935] nvme nvme1: allocated 64 MiB host memory buffer.
[    2.336492] nvme nvme2: allocated 64 MiB host memory buffer.
[    2.339611] nvme nvme1: 7/0/0 default/read/poll queues
[    2.341805] nvme nvme2: 7/0/0 default/read/poll queues
[    2.346114]  nvme1n1: p1
[    2.347197] nvme nvme2: globally duplicate IDs for nsid 1
After:
[    2.427715] nvme nvme1: pci function 0000:06:00.0
[    2.427771] nvme nvme2: pci function 0000:05:00.0
[    2.488154] nvme nvme2: failed to set APST feature (2)
[    2.489895] nvme nvme1: failed to set APST feature (2)
[    2.498773] nvme nvme2: allocated 64 MiB host memory buffer.
[    2.500587] nvme nvme1: allocated 64 MiB host memory buffer.
[    2.504113] nvme nvme2: 7/0/0 default/read/poll queues
[    2.507026] nvme nvme1: 7/0/0 default/read/poll queues
[    2.509467] nvme nvme2: Ignoring bogus Namespace Identifiers
[    2.512804] nvme nvme1: Ignoring bogus Namespace Identifiers
[    2.513698]  nvme1n1: p1

Signed-off-by: Pablo Greco <pgreco@centosproject.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-29 16:13:45 +02:00
Sagi Grimberg 41d07df7de nvme-tcp: always fail a request when sending it failed
queue stoppage and inflight requests cancellation is fully fenced from
io_work and thus failing a request from this context. Hence we don't
need to try to guess from the socket retcode if this failure is because
the queue is about to be torn down or not.

We are perfectly safe to just fail it, the request will not be cancelled
later on.

This solves possible very long shutdown delays when the users issues a
'nvme disconnect-all'

Reported-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-29 16:13:45 +02:00
Christoph Hellwig 8b9ab62662 block: remove blk_cleanup_disk
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:33:15 -06:00
Christoph Hellwig 6f8191fdf4 block: simplify disk shutdown
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.

Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.

This saves an extra queue freeze for devices without a separately
allocated queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:30:26 -06:00
Christoph Hellwig e648783318 nvme: move the Samsung X5 quirk entry to the core quirks
This device shares the PCI ID with the Samsung 970 Evo Plus that
does not need or want the quirks.  Move the the quirk entry to the
core table based on the model number instead.

Fixes: bc360b0b16 ("nvme-pci: add quirks for Samsung X5 SSDs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
2022-06-23 15:22:22 +02:00
Leo Savernik 41f38043f8 nvme: add a bogus subsystem NQN quirk for Micron MTFDKBA2T0TFH
The Micron MTFDKBA2T0TFH device reports the same subsysem NQN for
all devices.  Add a quick to ignore it.

Signed-off-by: Leo Savernik <l.savernik@aon.at>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-23 15:22:18 +02:00
rasheed.hsueh 43047e082b nvme-pci: disable write zeros support on UMIC and Samsung SSDs
Like commit 5611ec2b98 ("nvme-pci: prevent SK hynix PC400 from using
Write Zeroes command"), UMIS and Samsung has the same issue:
[ 6305.633887] blk_update_request: operation not supported error,
dev nvme0n1, sector 340812032 op 0x9:(WRITE_ZEROES) flags 0x0
phys_seg 0 prio class 0

So also disable Write Zeroes command on UMIS and Samsung.

Signed-off-by: rasheed.hsueh <rasheed.hsueh@lcfc.corp-partner.google.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:57 +02:00
Ning Wang 6b961bce50 nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs
When ZHITAI TiPro7000 SSDs entered deepest power state(ps4)
it has the same APST sleep problem as Kingston A2000.
by chance the system crashes and displays the same dmesg info:

https://bugzilla.kernel.org/show_bug.cgi?id=195039#c65

As the Archlinux wiki suggest (enlat + exlat) < 25000 is fine
and my testing shows no system crashes ever since.
Therefore disabling the deepest power state will fix the APST sleep issue.

https://wiki.archlinux.org/title/Solid_state_drive/NVMe

This is the APST data from 'nvme id-ctrl /dev/nvme1'

NVME Identify Controller:
vid       : 0x1e49
ssvid     : 0x1e49
sn        : [...]
mn        : ZHITAI TiPro7000 1TB
fr        : ZTA32F3Y
[...]
ps    0 : mp:3.50W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:3.30W operational enlat:50 exlat:100 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.80W operational enlat:50 exlat:200 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.1500W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0200W non-operational enlat:2000 exlat:60000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Signed-off-by: Ning Wang <ningwang35@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:56 +02:00
Keith Busch c4f01a776b nvme-pci: sk hynix p31 has bogus namespace ids
Add the quirk.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216049
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:56 +02:00
Keith Busch c98a879312 nvme-pci: smi has bogus namespace ids
Add the quirk.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216096
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:56 +02:00
Keith Busch 2cf7a77ed5 nvme-pci: phison e12 has bogus namespace ids
Add the quirk.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216049
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:56 +02:00
Stefan Reiter 3765fad508 nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50
ADATA XPG GAMMIX S50 drives report bogus eui64 values that appear to
be the same across drives in one system. Quirk them out so they are
not marked as "non globally unique" duplicates.

Signed-off-by: Stefan Reiter <stefan@pimaker.at>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:56 +02:00
Keith Busch 4641a8e6e1 nvme-pci: add trouble shooting steps for timeouts
Many users have encountered IO timeouts with a CSTS value of 0xffffffff,
which indicates a failure to read the register. While there are various
potential causes for this observation, faulty NVMe APST has been the
culprit quite frequently. Add the recommended troubleshooting steps in
the error output when this condition occurs.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:56:56 +02:00
Keith Busch 2f0dad1719 nvme: add bug report info for global duplicate id
The recent global id check is finding poorly implemented devices in the
wild. Include relavant device information in the output to help quicken
an appropriate quirk patch.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-06-13 19:54:14 +02:00